基于GraphSAGE网络的藏文短文本分类研究

敬容,杨逸民,万福成,国旗,于洪志,马宁

PDF(2004 KB)
PDF(2004 KB)
中文信息学报 ›› 2024, Vol. 38 ›› Issue (9) : 58-65.
民族、跨境及周边语言信息处理

基于GraphSAGE网络的藏文短文本分类研究

  • 敬容1,杨逸民1,万福成2,国旗1,3,于洪志2,马宁2
作者信息 +

Research on Tibetan Short Text Classification Based on GraphSAGE Network

  • JING Rong1, YANG Yimin1, WAN Fucheng2, GUO Qi1,3, YU Hongzhi2, MA Ning2
Author information +
History +

摘要

文本分类是自然语言处理领域的重要研究方向,由于藏文数据的稀缺性、语言学特征抽取的复杂性、篇章结构的多样性等因素导致藏文文本分类任务进展缓慢。因此,该文以图神经作为基础模型进行改进。首先,在“音节-音节”“音节-文档”建模的基础上,融合文档特征,采用二元分类模型动态网络构建“文档-文档”边,以充分挖掘短文本的全局特征,增加滑动窗口,减少模型的计算复杂度并寻找最优窗口取值。其次,针对藏文短文本的音节稀疏性,首次引入GraphSAGE作为基础模型,并探究不同聚合方式在藏文短文本分类上的性能差异。最后,为捕获节点间关系的异质性,对邻居节点进行特征加权再平均池化以增强模型的特征提取能力。在TNCC 标题文本数据集上,该文模型的分类准确率达到了 62.50%,与传统GCN、原始 GraphSAGE 和预训练语言模型 CINO 相比,该方法在分类准确率上分别提高了 2.56%、1% 和 2.4%。

Abstract

Test classification is an important research direction in the field of natural language processing. The Tibetan text categorization is challenged by data scarcity, complexity of extracted linguistic features, and diversity of chapter structures. In this paper, we use graph neural model as the framework. Firstly, on the basis of the "syllable-syllable" and "syllable-document", we combine the document features to dynamically construct "document-document" edge, mining the global features of short text. We also increase the sliding window to find the optimal window value. Secondly, aimed at the syllable sparsity of Tibetan short text, GraphSAGE is introduced as the base model to explore the performance difference in different aggregation functions. Finally, to capture the heterogeneity of relationships between nodes, a feature-weighting approach is proposed based on average pooling. Experiments on the TNCC title dataset show our model has reached 62.50% accuracy, outperforming the GGN, the original GraphSAGE and the pre-trained language model CINO by 2.56%, 1% and 2.4%, respectively.

关键词

图神经网络 / 藏文文本分类 / TNCC数据集

Key words

graph neural network / Tibetan text classification / TNCC dataset

引用本文

导出引用
敬容,杨逸民,万福成,国旗,于洪志,马宁. 基于GraphSAGE网络的藏文短文本分类研究. 中文信息学报. 2024, 38(9): 58-65
JING Rong, YANG Yimin, WAN Fucheng, GUO Qi, YU Hongzhi, MA Ning. Research on Tibetan Short Text Classification Based on GraphSAGE Network. Journal of Chinese Information Processing. 2024, 38(9): 58-65

参考文献

[1] 胥桂仙,张子欣,于绍娜,等.基于图卷积网络的藏文新闻文本分类[J].数据分析与知识发现,2023,7(06): 73-85.
[2] LIU S, DENG J, SUN Y, et al. TiBERT: Tibetan pretrained language model[C]//Proceedings of the IEEE International Conference on Systems, Man, and Cybernetics, 2022: 2956-2961
[3] QUN N, LI X, QIU X, et al. End to end neural text classification for Tibetan[M]. Chinese Computational Linguistics and Natural Language Processing Based on Naturally Annotated Big Data. Springer, Cham,2017: 472-480.
[4] HAMILTON W, YING Z, LESKOVEC J. Inductive representation learning on large graphs[C]//Proceedings of the 31st International Conference on Neural Information Processing Systems, 2017: 1025-1035.
[5] 李艾林, 李照耀. 基于朴素贝叶斯技术的藏文文本分类[J]. 中文信息学报, 2013(11): 11-12.
[6] 贾宏云, 群诺, 苏慧婧, 等. 基于SVM藏文文本分类的研究与实现[J]. 电子技术与软件工程, 2018(9): 144-146.
[7] 周登. 基于n-Gram模型的藏文文本分类技术研究[D]. 兰州: 西北民族大学硕士学位论文, 2010.
[8] 贾会强. 基于KNN算法的藏文文本分类关键技术研究[J]. 西北民族大学学报(自然科学版), 2011, 32(3): 24-29.
[9] KIM Y .Convolutional neural networks for sentence classification[J].CoRR,2014,abs/1408.5882.
[10] 苏慧婧. 基于MLP和SepCNN模型的藏文文本分类研究与实现[D]. 拉萨: 西藏大学硕士学位论文, 2021.
[11] 张江燕. 基于BERT的藏文预训练语言模型研究与应用[D].拉萨: 西藏大学硕士学位论文,2023.
[12] 孔春伟,吕学强,张乐.HRTNSC:基于混合表示的藏文新闻主客观句子分类模型[J].中文信息学报,2022,36(12): 94-103.
[13] 胥桂仙,刘兰寅,张廷,等.基于预训练模型和图神经网络的藏文文本分类研究[J].东北师大学报(自然科学版),2023,55(01):5264.
[14] SCARSELLI F, GORI M, TSOI A C, et al. The graph neural network model[J]. IEEE Transactions on Neural Networks, 2008, 20(1): 61-80.
[15] KIPF T N, WELLING M. Semi-supervised classification with graph convolutional networks[J]. arXiv preprint arXiv:1609.02907, 2016.
[16] YAO L, MAO C, LUO Y. Graph convolutional networks for text classification[C]//Proceedings of the AAAI Conference on Artificial Intelligence, 2019, 33(01): 7370-7377
[17] DEVLIN J, CHANG M W, LEE K,et al.BERT: Pretraining of deep bidirectional transformers for language understanding[J]. 10.48550/arXiv.1810.04805,2018.
[18] 安波,龙从军.基于预训练语言模型的藏文文本分类[J].中文信息学报,2022,36(12): 85-93.
[19] YANG Z, XU Z, CUI Y, et al. CINO: A Chinese minority pretrained language model[J]. arXiv preprint arXiv:2202.13558, 2022.

基金

国家自然科学基金(62366046)
PDF(2004 KB)

308

Accesses

0

Citation

Detail

段落导航
相关文章

/