伍家豪,陈波,韩先培,孙乐. 基于多相似性度量和集合编码的属性对齐方法[J]. 中文信息学报, 2021, 35(4): 35-43.
WU Jiahao, CHEN Bo , HAN Xianpei, SUN Le. Attribute Alignment Based on Multi-Similarity Measure and Set Encoding. , 2021, 35(4): 35-43.
基于多相似性度量和集合编码的属性对齐方法
伍家豪1,2,陈波1,韩先培1,孙乐1
1.中国科学院 软件研究所,北京 100190; 2.中国科学院大学,北京 100049
Attribute Alignment Based on Multi-Similarity Measure and Set Encoding
WU Jiahao1,2, CHEN Bo 1, HAN Xianpei1, SUN Le1
1.Institute of Software, Chinese Academy of Sciences, Beijing 100190, China; 2.University of Chinese Academy of Sciences, Beijing 100049, China
Abstract:The goal of attribute alignment is to find the corresponding relationship which representing the same concept in heterogeneous knowledge graph. It is one of the key technologies to knowledge fusion. The existing models based on rules and word embedding are defected in incomplete similarity measurement and insufficient using of attribute instance information. To address this issue, this paper proposes an attribute alignment model based on multi similarity measures. We design similarity measures from multiple perspectives, and use machine learning model to aggregate this kind of features. At the same time, this paper proposes the attribute instance set representation learning algorithm. We extract the topic similarity between sets by encoding the attribute instance set as vectors, so as to assist attribute alignment. Experiments prove the validity of the model, and show that the set representation learning algorithm can effectively capture the subject feature of attribute instances and significantly improve the attribute alignment results.
[1] Dong X L, Rekatsinas T. Data integration and machine learning:A natural synergy[C]//Proceedings of the 2018 International Conference on Management of Data. Houston, Texas, USA: ACM, 2018: 1645-1650. [2] Rahm E, Bernstein P A. A survey of approaches to automatic schema matching[J].The VLDB Journal, 2001, 10(4): 334-350. [3] Comito C, Patarin S, Talia D. A semantic overlay network for p2p schema-based data integration[C]//Proceedings of the 11th IEEE Symposium on Computers and Communications. Cagliari, Sardinia, Italy: IEEE, 2006: 88-94. [4] Bernstein P A, Madhavan J, Rahm E. Generic schema matching, ten years later[C]//Proceedings of the VLDB Endowment.Seattle, WA, USA: VLDB Endowment, 2011, 4(11): 695-701. [5] Kirsten T, Thor A, Rahm E. Instance-based matching of large life science ontologies[C]//Proceedings of the International Conference on Data Integration in the Life Sciences. Berlin, Heidelberg: Springer, 2007: 172-187. [6] Dhamankar R, Lee Y, Doan A, et al. iMAP:Discovering complex semantic matches between database schemas[C]//Proceedings of the 2004 ACM SIGMOD International Conference on Management of Data. Paris, France: ACM, 2004: 383-394. [7] Aumueller D, Do H H, Massmann S, et al. Schema and ontology matching with COMA++[C]//Proceedings of the 2005 ACM SIGMOD International Conference on Management of Data. Baltimore, Maryland, USA: ACM, 2005: 906-908. [8] Do H H, Rahm E. COMA:A system for flexible combination of schema matching approaches[C]//Proceedings of the 28th International Conference on Very Large Data Bases. Hong Kong, China: VLDB Endowment, 2002: 610-621. [9] Khler H, Zhou X, Sadiq S, et al. Sampling dirty data for matching attributes[C]//Proceedings of the 2010 ACM SIGMOD International Conference on Management of Data. Indiana, USA: ACM, 2010: 63-74. [10] Vinyals O, Bengio S, Kudlur M. Order matters: Sequence to sequence for sets[J].arXiv preprint arXiv:1511.06391. 2015. [11] Shvaiko P, Euzenat J. Ontology matching:State of the art and future challenges[J]. IEEE Transactions on Knowledge and Data Engineering, 2013, 25(1): 158-176. [12] Jiménez-Ruiz E, Grau B C. Logmap: Logic-based and scalable ontology matching[C]//Proceedings of the 10th International Semantic Web Conference. Berlin, Heidelberg: Springer, 2011: 273-288. [13] Melnik S, Garcia-Molina H, Rahm E. Similarity flooding: A versatile graph matching algorithm and its application to schema matching[C]//Proceedings of the 18th International Conference on Data Engineering. Washington, DC, USA: IEEE, 2002: 117-128. [14] Madhavan J, Bernstein P A, Doan A, et al. Corpus-based schema matching[C]//Proceedings of the 21st International Conference on Data Engineering. Tokoyo, Japan: IEEE, 2005: 57-68. [15] Faria D, Pesquita C, Santos E, et al. AgreementMakerLight 2.0: Towards efficient large-scale ontology matching[C]//Proceedings of the International Semantic Web Conference. Riva del Garda, Trento, Italy: Springer, 2014: 457-460. [16] Faria D, Pesquita C, Santos E, et al. The agreementmakerlight ontology matching system[C]//Proceedings of OTM Confederated International Conferences. Graz Austria: Springer, 2013: 527-541. [17] Cruz I F, Antonelli F P, Stroe C. AgreementMaker: Efficient matching for large real-world schemas and ontologies[C]//Proceedings of the VLDB Endowment. Lyon, France: VLDB Endowment, 2009: 1586-1589. [18] GuliDc' M, Vrdoljak B, Banek M. Cromatcher: An ontology matching system based on automated weighted aggregation and iterative final alignment[J]. Web Semantics: Science, Services and Agents on the World Wide Web, 2016, 41: 50-71. [19] Nkisi-Orji I, Wiratunga N, Massie S, et al. Ontology alignment based on word embedding and random forest classification[C]//Proceedings of Joint EuropeanConference on Machine Learning and Knowledge Discovery in Databases. Cham, Dublin, Ireland: Springer, 2018: 557-572. [20] Fernandez R C, Mansour E, Qahtan A A, et al. Seeping semantics: Linking datasets using word embeddings for data discovery[C]//Proceedings of the 34th International Conference on Data Engineering. Paris, France: IEEE, 2018: 989-1000. [21] Staab S, Studer R. Handbook on ontologies[M]. Berlin, Heidelberg: Springer, 2004: 385-403. [22] Ngo D, Bellahsene Z. YAM++:A multi-strategy based approach for ontology matching task[C]//Proceedings of the International Conference on Knowledge Engineering and Knowledge Management. Galway, Ireland: Springer, 2012: 421-425. [23] Chen T, Guestrin C. XGBoost: A scalable tree boosting system[C]//Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. San Francisco, California, USA: ACM, 2016: 785-794. [24] Hinton G, Srivastava N, Swersky K. Lecture 6.5-rmsprop: Divide the gradient by a running average of its recent magnitude[R]. COURSERA: Neural Networks for Machine Learning 4.2, 2012: 26-31. [25] Song Y, Shi S, Li J, et al. Directional skip-gram: Explicitly distinguishing left and right context for word embeddings[C]//Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. New Orleans, Louisiana, USA: ACL, 2018: 175-180. [26] Mikolov T, Sutskever I, Chen K, et al. Distributed representations of words and phrases and their compositionality[C]//Proceedings of Advances in Neural Information Processing Systems. Lake Tahoe, Nevada, USA, 2013: 3111-3119.