跨境民族文本聚类任务旨在建立跨境民族不同文本间的关联关系,为跨境民族文本检索、事件关联分析提供支撑。但是跨境民族间文化文本表达差异大,加上文化表达背景缺失,导致跨境民族文本聚类困难。基于此,该文提出了融合领域知识图谱的跨境民族文本聚类方法,首先融入跨境民族领域知识图谱,实现对跨境民族文本数据的文化背景知识补充及实体语义关联,从而获得文本的增强局部语义;同时考虑到跨境民族文本数据中全局语义信息的重要性,采用异构图注意力网络提取文本、主题、领域关键词之间的全局特征信息;最后利用变分自编码网络进行局部信息和全局信息的融合,并利用学习到的潜在特征表示进行聚类。实验表明,提出方法较基线方法Acc提升11.4%,NMI提升1%,ARI提升9.4%。
Abstract
The task of cross-border ethnic text clustering aims to establish the correlation between different texts of cross-border ethnic groups, which is challenged by substantial differences in cultural text expression among cross-border ethnic groups. This paper proposes a cross-border ethnic text clustering method based on domain knowledge graph. For local semantic information, the method adopts the cross-border ethnic domain knowledge graph to provide the cultural background knowledge and identify the association of entities in the texts. For global semantic information, the method applies the heterogeneous graph attention network is used to extract text features, topics and domain keywords. The variational autoencoding network is finally employed to fuse local information and global information, and the learned feature representation is used for clustering. Experiments show that the proposed method improves Acc by 11.4%, NMI by 1%, and ARI by 9.4% compared with the baseline method.
关键词
跨境民族 /
知识图谱 /
文本聚类 /
异构图注意力网络
{{custom_keyword}} /
Key words
cross-border ethnicity /
knowledge graph /
text clustering /
heterogeneous graph attention network
{{custom_keyword}} /
{{custom_sec.title}}
{{custom_sec.title}}
{{custom_sec.content}}
参考文献
[1] WONG J A HA . Algorithm AS 136: A k-means clustering algorithm[J]. Journal of the Royal Statistical Society, 1979, 28(1):100-108.
[2] GUO X, LONG G, LIU X, et al. Improved deep embedded clustering with local structure preservation[C]//Proceedings of the International Joint Conference on Artificial Intelligence,2017: 1753-1759.
[3] SONG C, LIU F, HUANG Y, et al. Auto-encoder based data clustering[C]//Proceedings of the IBERO American Congress on Pattern Recognition, 2013: 117-124.
[4] VINCENT P, LAROCHELLE H, LAJOIE I, et al. Stacked denoising autoencoders: Learning useful representations in a deep network with a local denoising criterion[J]. Journal of Machine Learning Research, 2010: 3371-3408.
[5] KINGMA D P, WELLING M. Auto-encoding variational Bayes[C]//Proceedings of the International Conference on Learning Representations, Banff, 2014.
[6] XIE J, GIRSHICK R, FARHADI A, et al. Unsupervised deep embedding for clustering analysis[C]//Proceedings of the International Conference On Machine Learning, 2016: 478-487.
[7] MIAO Y, YU L, BLUNSOM P. Neural variational inference for text processing[C]//Proceedings of the International Conferenceon Machine Learning. 2016:1727-1736.
[8] YANG B, FU X, NICHOLAS D S, et al. Towards k-means-friendly spaces: Simultaneous deep learning and clustering[C]//Proceedings of the International Conference on Machine Learning, 2017:3861-3870.
[9] ZHANG D, NAN F, WEI X, et al. Supporting clustering with contrastive learning[J].arXiv preprint arXiv:2103.12953, 2021.
[10] KIPF T N, MAX W. Semi-supervised classification with graph convolutional networks[C]//Proceedings of the International Conference on Learning Representations,2017.
[11] NG AY, JORDAN M I, WEISS Y . On spectral clustering: Analysis and an algorithm[C]//Proceedings of the 14th International Conference on Neural Information Processing Systems: Natural and Synthetic, 2001:849-856.
[12] KIPF T N, WELLING M. Variational graph autoencoders[J]. arXiv preprint arXiv:1611.07308,2016.
[13] WANG C, PAN S, HU R, et al. Attributed graph clustering: A deep attentional embedding approach[C]//Proceedings of the 28th International Joint Conference on Artificial Intelligence,2019: 3670-3676.
[14] BO D, WANG X, SHI C, et al. Structural deep clustering network[C]//Proceedings of the Web Conference 2020, 2020: 1400-1410.
[15] HU L, YANG T, ZHANG L, et al. Compare to the knowledge: Graph neural fake news detection with external knowledge[C]//Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, 2021: 754-763.
[16] CHIU B,SAHU S K, THOMAS D, et al. Autoencoding keyword correlation graph for document clustering[C]//Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, 2020: 3974-3981.
[17] MAO C, LIANG H, YU Z, et al. A clustering method of case-involved news by combining topic network and multi-head attention mechanism[J]. Sensors, 2021, 21(22): 7501-7507.
[18] BORDES A, USUNIER N, GARCIA-DURAN A, et al. Translating embeddings for modeling multi-relational data[C]//Proceedings of the Neural Information Processing Systems, 2013: 1-9.
[19] MCCONVILLE R, SANTOS-RODRIGUEZ R,PIECHOCKI R J, et al. N2d:(not too) deep clustering via clustering the local manifold of an autoencoded embedding[C]//Proceedings of the 25th International Conference on Pattern Recognition. IEEE, 2021: 5145-5152.
[20] ZHANG D, NAN F, WEI X, et al. Supporting clustering with contrastive learning[J].arXiv preprint arXiv:2103.12953, 2021.
[21] STREHL A, GHOSH J. Cluster ensembles: A knowledge reuse framework for combining multiple partitions[J]. Journal of Machine Learning Research, 2002, 3(12): 583-617.
[22] HUBERT L, ARABIE P. Comparing partitions[J]. Journal of Classification, 1985, 2(1): 193-218.
{{custom_fnGroup.title_cn}}
脚注
{{custom_fn.content}}
基金
国家自然科学基金(62166023,61866019);云南省自然科学基金(2019FA023);云南省重大科技专项计划项目(202103AA080015,202002AD080001)
{{custom_fund}}