基于预训练语言模型的藏文文本分类

安波,龙从军

PDF(1390 KB)
PDF(1390 KB)
中文信息学报 ›› 2022, Vol. 36 ›› Issue (12) : 85-93.
民族、跨境及周边语言信息处理

基于预训练语言模型的藏文文本分类

  • 安波,龙从军
作者信息 +

Pre-trained Language Model Based Tibetan Text Classification

  • AN Bo1, LONG Congjun1
Author information +
History +

摘要

藏文文本分类是藏文自然语言处理中的基础任务,具有基础性和重要性。大规模预训练模型加微调的方式是当前的主流文本分类方法。然而藏文缺少开源的大规模文本和预训练语言模型,未能在藏文文本分类任务上进行验证。针对上述问题,该文抓取了一个较大规模的藏文文本数据集,并在该数据集的基础上训练一个藏文预训练语言模型(BERT-base-Tibetan)。将该方法应用到多种基于神经网络的文本分类模型上的实验结果表明,预训练语言模型能够显著提升藏文文本分类的性能(F1值平均提升9.3%),验证了预训练语言模型在藏文文本分类任务中的价值。

Abstract

Tibetan text classification is a fundamental task in Tibetan natural language processing. The current mainstream text classification model is a large-scale pre-training model plus fine-tuning. However, Tibetan lacks open source large-scale text and pre-training language model, and cannot be verified on Tibetan text classification task. This paper crawls a large Tibetan text dataset to solve the above problems and trains a Tibetan pre-training language model (BERT-base-Tibetan) based on this dataset. Experimental results show that the pre-training language model can significantly improve the performance of Tibetan text classification (F1 value increases by 9.3% on average) and verify the value of the pre-training language model in Tibetan text classification tasks.

关键词

藏文文本分类 / 预训练语言模型 / 深度学习

Key words

Tibetan text classification,pre-trained language model / deep learning

引用本文

导出引用
安波,龙从军. 基于预训练语言模型的藏文文本分类. 中文信息学报. 2022, 36(12): 85-93
AN Bo, LONG Congjun. Pre-trained Language Model Based Tibetan Text Classification. Journal of Chinese Information Processing. 2022, 36(12): 85-93

参考文献

[1] 苏慧婧, 群诺. 藏文文本分类技术研究综述[J]. 电脑知识与技术: 学术版, 17(4): 4.
[2] 张克宏. 藏文文献数字化保护系统面向对象的集成测试研究[C].第三届全国软件测试会议与移动计算,栅格,智能化高级论坛论文集,2009.
[3] 王志娟, 冯迎辉, 赵小兵. 我国藏文网站分析[J]. 语言政策与规划研究, 2014, 000(002): 25-31.
[4] 贾宏云, 群诺, 苏慧婧,等. 基于SVM藏文文本分类的研究与实现[J]. 电子技术与软件工程, 2018, (009): 144-146.
[5] Kowsari K, Jafari M K, Heidarysafa M, et al. Text classification algorithms: A survey[J]. Information, 2019, 10(4): 150.
[6] Li Q, Peng H, Li J, et al. A survey on text classification: from shallow to deep learning[J]. arXiv preprint arXiv: 2008.00364, 2020.
[7] 贺鸣, 孙建军, 成颖. 基于朴素贝叶斯的文本分类研究综述[J]. 情报科学, 2016(7): 147-154.
[8] Conneau A, Schwenk H, Barrault L, et al. Very deep convolutional networks for text classification[J]. arXiv preprint arXiv: 1606.01781, 2016.
[9] Liu P, Qiu X, Huang X. Recurrent neural network for text classification with multi-task learning[J]. arXiv preprint arXiv: 1605.05101, 2016.
[10] Zhou C, Sun C, Liu Z, et al. A C-LSTM neural network for text classification[J]. Computer Science, 2015, 1(4): 39-44.
[11] Peters M E, Neumann M, Iyyer M, et al. Deep contextualized word representations[J]. arXiv preprint arXiv: 1802.05365, 2018.
[12] Vaswani A, Shazeer N, Parmar N, et al. Attention is all you need[C]//Proceedings of the Advances in Neural Information Processing Systems, 2017: 5998-6008.
[13] Liu Y, Ott M, Goyal N, et al. Roberta: A robustly optimized bert pretraining approach[J]. arXiv preprint arXiv: 1907.11692, 2019.
[14] Garg S, Ramakrishnan G. Bae: Bert-based adversarial examples for text classification[J]. arXiv preprint arXiv: 2004.01970, 2020.
[15] Cheng W, Greaves C, Warren M. From N-gram to skipgram to concgram[J]. International Journal of Corpus Linguistics, 2006, 11(4): 411-433.
[16] Rong X. Word2Vec parameter learning explained[J]. arXiv preprint arXiv: 1411.2738, 2014.
[17] Pennington J, Socher R, Manning C D. Glove: Global vectors for word representation[C]//Proceedings of the Conference on Empirical Methods in Natural Language Processing,2014: 1532-1543.
[18] Brown T B, Mann B, Ryder N, et al. Language models are few-shot learners[J]. arXiv preprint arXiv: 2005.14165, 2020.
[19] Qiu X, Sun T, Xu Y, et al. Pre-trained models for natural language processing: A survey[J]. Science China Technological Sciences, 2020: 1-26.
[20] Joulin A, Grave E, Bojanowski P, et al. FastText. zip: Compressing text classification models[J]. arXiv preprint arXiv: 1612.03651, 2016.
[21] Guo B, Zhang C, Liu J, et al. Improving text classification with weighted word embeddings via a multi-channel TextCNN model[J]. Neurocomputing, 2019, 363: 366-374.
[22] Johnson R, Zhang T. Deep pyramid convolutional neural networks for text categorization[C]//Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics, 2017: 562-570.
[23] Bai X. Text classification based on LSTM and attention[C]//Proceedings of the 13th International Conference on Digital Information Management. IEEE, 2018: 29-32.
[24] 贾会强, 李永宏. 藏文文本分类器的设计与实现[J]. 科技致富向导,2010(12): 30-31.
[25] 袁斌. 藏文微博情感分类研究与实现[D]. 兰州: 西北民族大学硕士学位论文,2016.
[26] 王勇. 基于朴素贝叶斯的藏文文本分类研究[D]. 兰州: 西北民族大学硕士学位论文,2013.
[27] 李艾林. 面向 Web 舆情分析的藏文文本分类算法研究[D]. 兰州: 西北民族大学硕士学位论文,2014.
[28] 周登. 基于N-Gram模型的藏文文本分类技术研究[D]. 兰州: 西北民族大学硕士学位论文,2010.
[29] 群诺, 贾宏云. 基于Logistic回归模型的藏文文本分类研究与实现[J]. 信息与电脑,2018 (5): 70-73.
[30] 贾宏云. 基于 AdaBoost 模型的藏文文本分类研究与实现[D]. 拉萨: 西藏大学硕士学位论文,2019.
[31] 王莉莉. 基于多分类器的藏文文本分类方法[J]. 南京邮电大学学报(自然科学版), 2020, 40(1): 102-110.
[32] 胥桂仙, 向春丞, 翁彧, 等. 基于栏目的藏文网页文本自动分类方法[J]. 中文信息学报, 2011, 25(4): 20-24.
[33] Qun N, Li X, Qiu X, et al. End-to-end neural text classification for Tibetan[M]. Chinese Computational Linguistics and Natural Language Processing Based on Naturally Annotated Big Data. Springer, Cham, 2017: 472-480.
[34] Li Z, Zhu J, Luo Z, et al. Research on tibetan text classification method based on neural network[C]//Proceedings of the International Conference on Asian Language Processing. IEEE, 2019: 379-383.
[35] Ma W, Yu H, Ma J. Study of Tibetan text classification based on fastText[C]//Proceedings of the 3rd International Conference on Computer Engineering, Information Science & Application Technology. Atlantis Press, 2019: 374-380.
[36] 刘汇丹, 诺明花, 赵维纳, 等. SegT: 一个实用的藏文分词系统[J]. 中文信息学报, 2012, 26(1): 97-104.
[37] 李亮. 基于ALBERT的藏文预训练模型及其应用[D]. 兰州: 兰州大学硕士学位论文,2020.
[38] 肖宇.基于神经网络语言模型的藏文语义检索方法研究[D]. 拉萨: 西藏大学硕士学位论文,2021.

基金

国家自然科学基金(62076233);中国社会科学院民族学与人类学研究所2022创新工程青年学者资助计划项目(2022MZSQN001);国家社会科学基金冷门绝学研究专项(20VJXG036);国家社会科学基金(22BTQ010)
PDF(1390 KB)

2704

Accesses

0

Citation

Detail

段落导航
相关文章

/