针对藏文数据集稀少的问题,该文对TNCC数据集进行了数据增强,提出了基于少数民族语言预训练模型(CINO)、TextCNN和双向长短时记忆网络(BiLSTM)的多特征融合与多语言预训练的藏文文本分类模型(MFMLP)。模型将数据集的文本分词输入到CINO中,然后将提取到的全部特征分别经过TextCNN和BiLSTM通路以提取不同层次特征,将提取到的特征与CINO提取的[CLS]特征在融合层进行多特征融合,最终通过分类器实现分类。基于藏文数据集TNCC进行文本分类实验,结果表明相较于CINO模型,该文提出的算法对藏文文本类别的识别能力有一定的提高。
Abstract
This paper proposes a Tibetan text classification model based on multi-feature fusion and multi-language pre-training (MFMLP), which is based on the minority language pre-training model (CINO), TextCNN and BiLSTM. This approach inputs the segmentation of dataset into CINO, and then passes all the extracted features through the TextCNN and BiLSTM channels to extract feature at different levels. The extracted feature are combined with the [CLS] feature extracted by CINO at the fusion layer. Based on the Tibetan data set (TNCC), the experiment shows certain improvements compared with the CINO model in text classification task.
关键词
多特征融合 /
多语言预训练 /
藏文文本分类
{{custom_keyword}} /
Key words
multi-feature fusion /
multi-language pre-training /
Tibetan text classification
{{custom_keyword}} /
{{custom_sec.title}}
{{custom_sec.title}}
{{custom_sec.content}}
参考文献
[1] KALCHBRENNER N, GREFENSTETTE E, BLUNS-OM P. A convolutional neural network for modelling sentences[C]//Proceedings of the Annual Meeting of the Association for Computatio-nal Linguistics, 2014.
[2] KIM Y. Convolutional neural networks for sentence classification[C]//Proceedings of the Conference on Empirical Methods in Natural Language Proces-sing, 2014: 1746-1751.
[3] HOCHREITER S, SCHMIDHUBER J. Long short-term memory[J]. Neural Computation, 1997, 9(8): 1735-1780.
[4] VASWANI A, SHAZEER N, PARMAR N, et al. Attention is all you need[C]//Proceedings of the 31st International Conference on Neural Information Processing Systems, 2017: 6000-6010.
[5] DEVLIN J, CHANG M W, LEE K, et al. BERT: Pre-training of deep bidirectional transformers for lan-guage understanding[C]//Proceedings of the Confere-nce of the North American Chapter of the Association for Computational Linguistic: Human Language Technologies, 2019(1): 4171-4186.
[6] 周登. 基于N-Gram模型的藏文文本分类技术研究[D]. 兰州: 西北民族大学硕士学位论文, 2010.
[7] 贾会强,刘晓丽,于洪志. 基于词性特征提取的藏文文本分类方法研究[C].CCF NCSC 2011——第二届中国计算机学会服务计算学术会议论文集, 2011:93-97.
[8] 贾会强. 基于KNN算法的藏文文本分类关键技术研究[J]. 西北民族大学学报(自然科学版), 2011, 32(03): 24-29.
[9] 胥桂仙, 向春丞, 翁彧, 等. 基于栏目的藏文网页文本自动分类方法[J]. 中文信息学报, 2011, 25(04): 20-23.
[10] 王勇. 基于朴素贝叶斯的藏文文本分类研究[D]. 兰州: 西北民族大学硕士学位论文,2013.
[11] 群诺,贾宏云. 基于Logistic回归模型的藏文文本分类研究与实现[J]. 信息与电脑(理论版), 2018(05): 70-73.
[12] 贾宏云. 基于AdaBoost模型的藏文文本分类研究与实现[D]. 拉萨: 西藏大学硕士学位论文,2019.
[13] 王莉莉,杨鸿武,宋志蒙. 基于多分类器的藏文文本分类方法[J]. 南京邮电大学学报(自然科学版), 2020, 40(01): 102-110.
[14] 苏慧婧, 群诺, 贾宏云. 基于GaussianNB模型的藏文文本分类研究与实现[J]. 青海师范大学学报(自然科学版), 2019, 35(04): 1-4.
[15] 苏慧婧, 索朗拉姆, 尼玛扎西, 等. 基于MLP和SepCNN神经网络模型的藏文文本分类研究[J]. 软件, 2020, 41(12): 11-17.
[16] 李亮. 基于ALBERT的藏文预训练模型及其应用[D]. 兰州: 兰州大学硕士学位论文,2020.
[17] YANG Z, XU Z, CUI Y, et al. CINO: A Chinese minority pre-trained language model[J]. arXiv preprint arXiv:2202.13558, 2022.
[18] CONNEAU A, KHANDELWAL K, GOYAL N, et al. Unsupervised cross-lingual representation learning at scale[J]. arXiv preprint arXiv:1911.02116, 2019.
[19] QUN N, LI X, QIU X, et al. End-to-end neural text classification for Tibetan[G]//Chinese Computatio-nal Linguistics and Natural Language Processing Ba-sed on Naturally Annotated Big Data. Springer, Ch-am, 2017: 472-480.
[20] LIU P, QIU X, HUANG X. Recurrent neural netwo-rk for text classification with multi-task learning[C]//Proceedings of the International Joint Conference on Artificial Intelli-gence. AAAI Press, 2016.
[21] LAI S, XU L, LIU K, et al. Recurrent convolution-al neural networks for text classification[C]//Proceedings of the National Conference on Artificial Intelligence. AAAI Press, 2015.
[22] JOHNSON R, ZHANG T. Deep pyramid convolutio-nal neural networks for text categorization[C]//Proc-eedings of the 55th Annual Meeting of the Associa-tion for Computational Linguistics, 2017.
{{custom_fnGroup.title_cn}}
脚注
{{custom_fn.content}}
基金
国家社会科学基金(19BGL241)
{{custom_fund}}