Extracting Knowledge from Web Tables Based on Fast Clustering with Equivalent Compression
WU Xiaolong1,2, CAO Cungen1
1.Key Laboratory of Intelligent Information Processing, Institute of Computing Technology, Chinese Academy of Sciences, Beijing 100190, China; 2.University of Chinese Academy of Sciences, Beijing 100049, China
Abstract：Extracting knowledge from Web tables is an important way to obtain high-quality knowledge, which is of substantial significance in knowledge graph, Web mining, etc. In contrast to classical methods defected in depending on a good table structure or enough pre-existing knowledge, we propose a novel method of Web table knowledge extraction based on fast clustering with equivalent compression for large-scale Web tables. By making full use of the structural characteristics of tables, we obtain tables with similar structures in an unsupervised clustering manner, and then infer the semantic structure of similar tables for knowledge extraction. The results show that the proposed clustering algorithm decreases the clustering time of 5,000 tables from 72 hours to 20 minutes at the same level of clustering accuracy, and the accuracy of the knowledge triples obtained by table templates after table clustering indicates that our method is highly satisfactory.
 Cafarella M J,Halevy A,Wang D Z,et al.Web tables:Exploring the power of tables on the web[J].Proceedings of the VLDB Endowment,2008,1(1):538-549.  Crestan E,Pantel P.Web-scale table census and classification[C]//Proceedings of the 4th ACM International Conference on Web Search and Data Mining.ACM,2011:545-554.  Dong X,Gabrilovich E,Heitz G,et al.Knowledge vault:A web-scale approach to probabilistic knowledge fusion[C]//Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining.ACM,2014:601-610.  Rodriguez A,Laio A.Clustering by fast search and find of density peaks[J].Science,2014,344(6191):1492-1496.  Limaye G,Sarawagi S,Chakrabarti S.Annotating and searching web tables using entities,types and relationships[J].Proceedings of the VLDB Endowment,2010,3(1-2):1338-1347.  Hassanzadeh O,Ward M J,Rodriguez-Muro M,et al.Understanding a large corpus of web tables through matching with knowledge bases:An empirical study[J].Ontology Matching,2015:25-34.  Wang J,Wang H,Wang Z,et al.Understanding tables on the web[C]//Proceedings of International Conference on Conceptual Modeling.Springer,Berlin,Heidelberg,2012:141-155.  Ritze D,Lehmberg O,Bizer C.Matching HTML tables to DBpedia[C]//Proceedings of the 5th International Conference on Web Intelligence,Mining and Semantics.ACM,2015:10-15.  Hogan A,Mileo A.Using linked data to mine RDF from wikipedia’s tables[C]//Proceedings of ACM International Conference on Web Search and Data Mining.ACM,2014:533-542.  Nagy G.Learning the characteristics of critical cells from web tables[C]//Proceedings of 21st International Conference on Pattern Recognition (ICPR),IEEE,2012:1554-1557.  Pinto D,McCallum A,Wei X,et al.Table extraction using conditional random fields[C]//Proceedings of the 26th Annual International ACM SIGIR Conference on Research and Development Ininformaion Retrieval.ACM,2003:235-242.  Chen H H,Tsai S C,Tsai J H.Mining tables from large scale HTML texts[C]//Proceedings of the 18th Conference on Computational Linguistics-Volume 1.Association for Computational Linguistics,2000:166-172.  Dalvi B B,Cohen W W,Callan J.Websets:Extracting sets of entities from the web using unsupervised information extraction[C]//Proceedings of the 5th ACM International Conference on Web Search and Data Mining.ACM,2012:243-252.  Pivk A,Cimiano P,Sure Y,et al.Transforming arbitrary tables into logical form with TARTAR[J].Data & Knowledge Engineering,2007,60(3):567-595.  Lautert L R,Scheidt M M,Dorneles C F.Web table taxonomy and formalization[J].ACM SIGMOD Record,2013,42(3):28-33.  Du M,Ding S,Jia H.Study on density peaks clustering based on k-nearest neighbors and principal component analysis[J].Knowledge-Based Systems,2016,99:135-145.  宋玲,吕强,邓薇,等.基于语义和结构的XML文档相似度的计算方法[J].中文信息学报,2012,26(5):59-64.  王开云,孔思淇,付云生,等.两种基于双向比较的最长公共子串算法[J].计算机研究与发展,2013,50(11):167-170.  余钧,郭岩,张凯,等.FPC:大规模网页的快速增量聚类[J].中文信息学报,2016,30(2):182-188.  Wu X,Cao C,Wang Y,et al.Extracting knowledge from web tables based on DOM tree similarity[C]//Proceedings of International Conference on Know-ledge Science,Engineering and Management.Springer,Cham,2016:302-313.  Oehmcke S,Zielinski O,Kramer O.KNN ensembles with penalized DTW for multivariate time series imputation[C]//Proceedings of International Joint Conference on Neural Networks.IEEE,2016:2774-2781.  McCallum A,Nigam K,Ungar L H.Efficient clustering of high-dimensional data sets with application to reference matching[C]//Proceedings of the 6th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining.ACM,2000:169-178.  Ghuli P,Shukla A,Kiran R,et al.Multidimensional Canopy clustering on iterative MapReduce framework using Elefig tool[J].IETE Journal of Research,2015,61(1):14-21.  安波,韩先培,孙乐,等.基于分布式表示和多特征融合的知识库三元组分类[J].中文信息学报,2016,30(6):84-89.