基于异质信息网络的文本相似性度量方法

马秋微,赵书良,赵妍

PDF(3120 KB)
PDF(3120 KB)
中文信息学报 ›› 2023, Vol. 37 ›› Issue (9) : 108-120.
信息抽取与文本挖掘

基于异质信息网络的文本相似性度量方法

  • 马秋微1,2,3,赵书良1,2,3,赵妍1,2,3
作者信息 +

A Text Similarity Measure Based on Heterogeneous Information Network

  • MA Qiuwei1,2,3, ZHAO Shuliang1,2,3, ZHAO Yan1,2,3
Author information +
History +

摘要

文本相似性度量对基于文本的分类,聚类以及排序等有着广泛的影响。现有的大部分文本相似性度量方法不仅文本特征粒度单一化,而且忽略了非结构化文本数据中的结构化信息。该文将文本相似性度量问题转化为加权异质信息网络中的节点相似性度量问题,利用元路径的结构特性和语义特性度量文本的显式语义相似性,使其度量结果更准确并且更具有可解释性。首先,结合世界知识库,扩大文本特征粒度,构建加权文本异质信息网络,将非结构化文本类型数据表示为结构化的异质信息网络的形式。其次,挖掘元路径,并提出基于元路径的ω-PageRank-Nibble子图划分算法,得到包含给定文本节点集的局部图。根据局部图,计算并存储特定元路径的交换矩阵,为后续相似性度量降低时间及空间成本。最后,提出AllPathSim耦合相似性度量方法,度量文本类型节点的相似性。在图剪枝方面,利用基于元路径的ω-PageRank-Nibble算法划分子图,与处理整张图相比,时间成本和空间成本降低效果显著。在相似性度量方面,与同期最优的相同类型节点度量方法相比,AllPathSim耦合相似性度量方法与度量结果的相关系数在20NG和GCAT数据集上分别提高了6.1%和6.9%。

Abstract

Text similarity measure has a wide range of effects on text-based classification, clustering and ranking. This paper treats the text similarity measurement problem as a node similarity measurement in a weighted heterogeneous information network, and proposed to determine the explicit semantic similarity of text by the structural and semantic properties of meta-paths to. Firstly, the text feature granularity is expanded by combining with the world knowledge base to construct a weighted text heterogeneous information network, and the unstructured text is represented as a form of structured heterogeneous information network. Secondly, the meta-paths is mined, and an ω-PageRank-Nibble subgraph partitioning algorithm is designed to obtain a partial graph containing a given set of text nodes. According to the partial graph, the commuting matrix of the specific meta-path is calculated, which reduces the time and space cost for the subsequent similarity measurement. Finally, the AllPathSim similarity measure is proposed to measure the similarity of text type nodes. The AllPathSim coupling similarity measuring method is compared with the optimal measuring method of the same type of nodes, and the correlation coefficient of the measurement results is increased by 6.1% and 6.9% on the 20NG and GCAT data sets.

关键词

相似性度量 / 加权异质信息网络 / 元路径 / 文本挖掘

Key words

similarity measurement / weighted heterogeneous information network / meta path / text mining

引用本文

导出引用
马秋微,赵书良,赵妍. 基于异质信息网络的文本相似性度量方法. 中文信息学报. 2023, 37(9): 108-120
MA Qiuwei, ZHAO Shuliang, ZHAO Yan. A Text Similarity Measure Based on Heterogeneous Information Network. Journal of Chinese Information Processing. 2023, 37(9): 108-120

参考文献

[1] NGUYEN D Q, BILLINGSLEY R, DU L, et al. Improving topic models with latent feature word representations[J]. Transactions of the Association for Computational Linguistics, 2015, 3(4): 299-313.
[2] JATNIKA D, BIJAKSANA M A, Suryani A A. Word2Vec model analysis for semantic similarities in English words[J]. Procedia Computerence, 2019, 157(9):160-167.
[3] KUSNER M, SUN Y, KOLKIN N, et al. From word embeddings to document distances[C]//Proceedings of the 32nd International Conference on Machine Learning, 2015:957-966.
[4] TIAN H P, JIANG J. Sentence embedding model based on feature selection[C]//Proceedings of International Conference on Computer Engineering and Application, 2020:659-663.
[5] SHI C, LI Y, ZHANG J, et al. A survey of heterogeneous information network analysis[J]. IEEE Transactions on Knowledge and Data Engineering, 2016,29(1):17-37.
[6] DU Y,GUO W, LIU J, et al. Classification by multi-semantic meta path and active weight learning in heterogeneous information networks[J]. Expert Systems with Applications, 2019, 123(6):227-236.
[7] CAO J, WANG S, WEN D, et al. Mutual clustering on comparative texts via heterogeneous information networks[J]. Knowledge & Information Systems, 2020, 62(1):175-202.
[8] MA X, ZHANG Y, ZENG J. Newly published scientific papers recommendation in heterogeneous information networks[J]. Mobile Networks & Applications, 2019, 24(1):69-79.
[9] SHI C, ZHANG Z, JI Y, et al. SemRec: A personalized semantic recommendation method based on weighted heterogeneous information networks[J]. World Wide Web, 2019, 22(1):153-184.
[10] BAI J, LI L, ZENG D. HiWalk: Learning node embeddings from heterogeneous networks[J]. Information Systems, 2019, 81(3):82-91.
[11] 邱庆羽,李婧,全兵,等.基于文献信息网络语义特征的相似性搜索[J]. 计算机应用, 2018, 38(5):1327-1333.
[12] 刘辉林, 闫娜, 罗梦莹. SLTA-PathSim:一种融合节点属性和文本信息的相似性度量算法[J]. 小型微型计算机系统, 2020, 41(3):485-490.
[13] WAN G, DU B, PAN S, et al. Reinforcement learning based meta-path discovery in large scale heterogeneous information networks[J]. Proceedings of the AAAI Conference on Artificial Intelligence, 2020, 34(4):6094-6101.
[14] PHAM P, DO P. Topic-driven top-k similarity search by applying constrained meta-path based in content-based schema-enriched heterogeneous information network[J]. International Journal of Business Intelligence and data mining, 2020, 17(3): 349-376.
[15] ZHANG M, WANG J, WANG W. HeteRank: A general similarity measure in heterogeneous information networks by integrating multi-type relationships[J]. Information Sciences, 2018, 453(7):389-407.
[16] ZHOU Y, HUANG J, Li H, et al. A semantic-rich similarity measure in heterogeneous information networks[J]. Knowledge Based Systems, 2018, 154(8):32-42.
[17] DO P, PHAM P. DW-PathSim: A distributed computing model for topic-driven weighted meta-path-based similarity measure in a large-scale content-based heterogeneous information network[J]. Journal of Information & Telecommunication, 2019, 3(1):19-38.
[18] YAO L, MAO C, LUO Y. Graph convolutional networks for text classification[J]. Proceedings of the AAAI Conference on Artificial Intelligence, 2019, 33(7):7370-7377.
[19] BAO W, BAO W, DU J, Attentive siamese LSTM network for semantic textual similarity measure[C]//Proceedings of International Conference on Asian Language Processing, 2018:312-317.
[20] WANG C G, SONG Y Q, LI H R, et al. Unsupervised meta-path selection for text similarity measure based on heterogeneous information networks[J]. Data Mining and Knowledge Discovery, 2018, 32(6):1735-1767.
[21] WANG C G, SONG Y Q, AHMED E K, et al. Incorporating world knowledge to document clustering via heterogeneous information networks[C]//Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2015:1215-1224.
[22] SUN Y, HAN J, YAN X, et al. PathSim: Meta path based top-k similarity search in heterogeneous information networks[J]. Proceedings of the VLDB Endowment, 2011, 4(11):992-1003.
[23] MARELLI M, MENINI S, BARONI M, et al. A SICK cure for the evaluation of compositional distributional semantic models[C]//Proceedings of the 9th International Conference on Language Resources and Evaluation, 2014:216-223.
[24] DOLAN B, QUIRK C, BROCKETT C. Unsupervised construction of large paraphrase corpora: Exploiting massively parallel news sources[C]//Proceedings of the 20th International Conference on Computational Linguistics, 2004:350-356.

基金

国家社会科学基金(13&ZD091,18ZDA200);河北省重点研发计划项目(20370301D);河北师范大学重大关键技术攻关项目(L2020K01)
PDF(3120 KB)

Accesses

Citation

Detail

段落导航
相关文章

/