属性对齐的目标是发现异构知识图谱中表示同一概念的属性之间的对应关系,是实现跨图谱知识融合的关键技术之一。现有模型通常利用基于规则和词嵌入的方法进行属性对齐,但这些方法仍存在以下两个问题:相似性度量不全面和属性实例信息未被充分利用。针对上述问题,该文提出了基于多相似性度量的属性对齐模型,通过多个角度设计相似性度量方法来获取属性间的相似性特征,并利用机器学习模型进行特征聚合。同时,为了充分利用属性的实例信息,在上述模型框架下提出了属性实例集合表示学习算法,通过将属性实例集合编码为向量来提取集合间的主题相似性,从而辅助属性对齐。在属性对齐数据集上的实验验证了模型的有效性,实验还表明,集合的表示学习算法能够有效捕捉属性实例的主题特征,并显著提升属性对齐结果。
Abstract
The goal of attribute alignment is to find the corresponding relationship which representing the same concept in heterogeneous knowledge graph. It is one of the key technologies to knowledge fusion. The existing models based on rules and word embedding are defected in incomplete similarity measurement and insufficient using of attribute instance information. To address this issue, this paper proposes an attribute alignment model based on multi similarity measures. We design similarity measures from multiple perspectives, and use machine learning model to aggregate this kind of features. At the same time, this paper proposes the attribute instance set representation learning algorithm. We extract the topic similarity between sets by encoding the attribute instance set as vectors, so as to assist attribute alignment. Experiments prove the validity of the model, and show that the set representation learning algorithm can effectively capture the subject feature of attribute instances and significantly improve the attribute alignment results.
关键词
属性对齐 /
表示学习 /
多相似性度量 /
集合编码
{{custom_keyword}} /
Key words
attribute alignment /
representation learning /
multi-similarity measures /
set encoding
{{custom_keyword}} /
{{custom_sec.title}}
{{custom_sec.title}}
{{custom_sec.content}}
参考文献
[1] Dong X L, Rekatsinas T. Data integration and machine learning:A natural synergy[C]//Proceedings of the 2018 International Conference on Management of Data. Houston, Texas, USA: ACM, 2018: 1645-1650.
[2] Rahm E, Bernstein P A. A survey of approaches to automatic schema matching[J].The VLDB Journal, 2001, 10(4): 334-350.
[3] Comito C, Patarin S, Talia D. A semantic overlay network for p2p schema-based data integration[C]//Proceedings of the 11th IEEE Symposium on Computers and Communications. Cagliari, Sardinia, Italy: IEEE, 2006: 88-94.
[4] Bernstein P A, Madhavan J, Rahm E. Generic schema matching, ten years later[C]//Proceedings of the VLDB Endowment.Seattle, WA, USA: VLDB Endowment, 2011, 4(11): 695-701.
[5] Kirsten T, Thor A, Rahm E. Instance-based matching of large life science ontologies[C]//Proceedings of the International Conference on Data Integration in the Life Sciences. Berlin, Heidelberg: Springer, 2007: 172-187.
[6] Dhamankar R, Lee Y, Doan A, et al. iMAP:Discovering complex semantic matches between database schemas[C]//Proceedings of the 2004 ACM SIGMOD International Conference on Management of Data. Paris, France: ACM, 2004: 383-394.
[7] Aumueller D, Do H H, Massmann S, et al. Schema and ontology matching with COMA++[C]//Proceedings of the 2005 ACM SIGMOD International Conference on Management of Data. Baltimore, Maryland, USA: ACM, 2005: 906-908.
[8] Do H H, Rahm E. COMA:A system for flexible combination of schema matching approaches[C]//Proceedings of the 28th International Conference on Very Large Data Bases. Hong Kong, China: VLDB Endowment, 2002: 610-621.
[9] Khler H, Zhou X, Sadiq S, et al. Sampling dirty data for matching attributes[C]//Proceedings of the 2010 ACM SIGMOD International Conference on Management of Data. Indiana, USA: ACM, 2010: 63-74.
[10] Vinyals O, Bengio S, Kudlur M. Order matters: Sequence to sequence for sets[J].arXiv preprint arXiv:1511.06391. 2015.
[11] Shvaiko P, Euzenat J. Ontology matching:State of the art and future challenges[J]. IEEE Transactions on Knowledge and Data Engineering, 2013, 25(1): 158-176.
[12] Jiménez-Ruiz E, Grau B C. Logmap: Logic-based and scalable ontology matching[C]//Proceedings of the 10th International Semantic Web Conference. Berlin, Heidelberg: Springer, 2011: 273-288.
[13] Melnik S, Garcia-Molina H, Rahm E. Similarity flooding: A versatile graph matching algorithm and its application to schema matching[C]//Proceedings of the 18th International Conference on Data Engineering. Washington, DC, USA: IEEE, 2002: 117-128.
[14] Madhavan J, Bernstein P A, Doan A, et al. Corpus-based schema matching[C]//Proceedings of the 21st International Conference on Data Engineering. Tokoyo, Japan: IEEE, 2005: 57-68.
[15] Faria D, Pesquita C, Santos E, et al. AgreementMakerLight 2.0: Towards efficient large-scale ontology matching[C]//Proceedings of the International Semantic Web Conference. Riva del Garda, Trento, Italy: Springer, 2014: 457-460.
[16] Faria D, Pesquita C, Santos E, et al. The agreementmakerlight ontology matching system[C]//Proceedings of OTM Confederated International Conferences. Graz Austria: Springer, 2013: 527-541.
[17] Cruz I F, Antonelli F P, Stroe C. AgreementMaker: Efficient matching for large real-world schemas and ontologies[C]//Proceedings of the VLDB Endowment. Lyon, France: VLDB Endowment, 2009: 1586-1589.
[18] GuliDc' M, Vrdoljak B, Banek M. Cromatcher: An ontology matching system based on automated weighted aggregation and iterative final alignment[J]. Web Semantics: Science, Services and Agents on the World Wide Web, 2016, 41: 50-71.
[19] Nkisi-Orji I, Wiratunga N, Massie S, et al. Ontology alignment based on word embedding and random forest classification[C]//Proceedings of Joint EuropeanConference on Machine Learning and Knowledge Discovery in Databases. Cham, Dublin, Ireland: Springer, 2018: 557-572.
[20] Fernandez R C, Mansour E, Qahtan A A, et al. Seeping semantics: Linking datasets using word embeddings for data discovery[C]//Proceedings of the 34th International Conference on Data Engineering. Paris, France: IEEE, 2018: 989-1000.
[21] Staab S, Studer R. Handbook on ontologies[M]. Berlin, Heidelberg: Springer, 2004: 385-403.
[22] Ngo D, Bellahsene Z. YAM++:A multi-strategy based approach for ontology matching task[C]//Proceedings of the International Conference on Knowledge Engineering and Knowledge Management. Galway, Ireland: Springer, 2012: 421-425.
[23] Chen T, Guestrin C. XGBoost: A scalable tree boosting system[C]//Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. San Francisco, California, USA: ACM, 2016: 785-794.
[24] Hinton G, Srivastava N, Swersky K. Lecture 6.5-rmsprop: Divide the gradient by a running average of its recent magnitude[R]. COURSERA: Neural Networks for Machine Learning 4.2, 2012: 26-31.
[25] Song Y, Shi S, Li J, et al. Directional skip-gram: Explicitly distinguishing left and right context for word embeddings[C]//Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. New Orleans, Louisiana, USA: ACL, 2018: 175-180.
[26] Mikolov T, Sutskever I, Chen K, et al. Distributed representations of words and phrases and their compositionality[C]//Proceedings of Advances in Neural Information Processing Systems. Lake Tahoe, Nevada, USA, 2013: 3111-3119.
{{custom_fnGroup.title_cn}}
脚注
{{custom_fn.content}}