文本分类是自然语言处理的基础任务之一。标注数据不足一直是限制藏文及其他少数民族语言自然语言处理技术发展的重要原因,传统的深度学习模型对标注数据的规模有较高的要求。为解决这个问题,该文在大规模预训练语言模型的基础上,利用提示学习实现低资源藏文文本分类,即使用不同的藏文预训练语言模型和提示模板开展藏文文本分类实验。实验结果表明,通过设计合理的提示模板等方式,提示学习能够在训练数据不足的情况下提升藏文文本分类的效果(48.3%),初步验证了提示学习在民族语言处理中的价值和潜力。但是,实验结果也反映出提示学习模型在处理部分类别时性能较差,且藏文预训练语言模型也有进一步提升空间。
Abstract
Text classification is one of the fundamental tasks in natural language processing. The lack of labeled data has always been an important factor limiting the development of natural language processing technologies for Tibetan and other minority languages, as traditional deep learning models have higher requirements for the scale of labeled data. To address this issue, this paper implements low-resource Tibetan text classification using prompt learning based on pre-trained language models, which involves conducting Tibetan text classification experiments using different Tibetan pre-trained language models and prompt templates. The experimental results show that, by designing reasonable prompt templates and other methods, prompt learning can improve the effectiveness of Tibetan text classification (48.3%) in the case of insufficient training data, preliminarily verifying the value and potential of prompt learning in minority language processing. However, the experimental results also indicate that the prompt learning model may underperform in specific categories, suggesting there is still potential for enhancement in the Tibetan pre-trained language model.
关键词
藏文文本分类 /
预训练语言模型 /
提示学习 /
小样本学习
{{custom_keyword}} /
Key words
Tibetan text classification,pre-trained language model /
prompt learning /
few-shot learning
{{custom_keyword}} /
{{custom_sec.title}}
{{custom_sec.title}}
{{custom_sec.content}}
参考文献
[1] 苏慧婧, 群诺. 藏文文本分类技术研究综述[J]. 电脑知识与技术(学术版), 17(4): 4-10.
[2] 张克宏. 藏文文献数字化保护系统面向对象的集成测试研究[C]//第三届全国软件测试会议与移动计算,栅格,智能化高级论坛论文集, 2009.
[3] 王志娟, 冯迎辉, 赵小兵. 我国藏文网站分析[J]. 语言政策与规划研究, 2014, 000(002): 25-31.
[4] 贾宏云, 群诺, 苏慧婧,等. 基于SVM藏文文本分类的研究与实现[J]. 电子技术与软件工程, 2018, 000(009): 144-146.
[5] QUN N, LI X, QIU X, et al. End-to-end neural text classification for Tibetan[G]//Chinese Computational Linguistics and Natural Language Processing Based on Naturally Annotated Big Data. Springer, Cham, 2017: 472-480.
[6] VASWANI A, SHAZEER N, PARMAR N, et al. Attention is all you need[C]//Proceedings of the 31st International Conference on Neural Information Processing Systems, 2017: 5998-6008.
[7] 张博旭,蒲智,程曦.基于提示学习的维吾尔语文本分类研究[J].计算机工程, 2022,10: 1-10.
[8] 岳增营,叶霞,刘睿珩.基于语言模型的预训练技术研究综述[J].中文信息学报,2021,35(09): 15-29.
[9] 赵凯琳,靳小龙,王元卓.小样本学习研究综述[J].软件学报,2021,32(02): 349-369.
[10] 李岱峰,林凯欣,李栩婷.基于提示学习与T5 PEGASUS的图书宣传自动摘要生成器[J].数据分析与知识发现,2022,10: 1-14.
[11] 张博旭,蒲智,程曦.基于提示学习的维吾尔语文本分类研究[OL].计算机工程: 2022, 10: 1-10.
[12] JOULIN A, GRAVE E, BOJANOWSKI P, et al. Fasttext. zip: Compressing text classification models[J]. arXiv preprint arXiv: 1612.03651, 2016.
[13] GUO B, ZHANG C, LIU J, et al. Improving text classification with weighted word embeddings via a multi-channel textCNN model[J]. Neurocomputing, 2019, 363: 366-374.
[14] JOHNSON R, ZHANG T. Deep pyramid convolutional neural networks for text categorization[C]//Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics, 2017: 562-570.
[15] LIU P, QIU X, HUANG X. Recurrent neural network for text classification with multi-task learning[J]. arXiv preprint arXiv: 1605.05101, 2016.
[16] PETERS M E, NEUMANN M, IYYER M, et al. Deep contextualized word representations[J]. arXiv preprint arXiv: 1802.05365, 2018.
[17] 贾会强, 李永宏. 藏文文本分类器的设计与实现[J]. 科技致富向导, 2010(12): 30-31.
[18] 袁斌. 藏文微博情感分类研究与实现[D]. 兰州: 西北民族大学硕士学位论文, 2016.
[19] 王勇. 基于朴素贝叶斯的藏文文本分类研究[D]. 兰州: 西北民族大学硕士学位论文,2013.
[20] 李艾林. 面向 Web 舆情分析的藏文文本分类算法研究[D]. 兰州: 西北民族大学硕士学位论文, 2014.
[21] 周登. 基于 N-Gram 模型的藏文文本分类技术研究[D]. 兰州: 西北民族大学硕士学位论文, 2010.
[22] 群诺, 贾宏云. 基于 Logistic 回归模型的藏文文本分类研究与实现[J]. 信息与电脑, 2018 (5): 70-73.
[23] 贾宏云. 基于 AdaBoost 模型的藏文文本分类研究与实现[D]. 拉萨: 西藏大学硕士学位论文, 2019.
[24] 王莉莉. 基于多分类器的藏文文本分类方法[J]. 南京邮电大学学报(自然科学版), 2020, 40(1): 102-110.
[25] 胥桂仙, 向春丞, 翁彧, 等. 基于栏目的藏文网页文本自动分类方法[J]. 中文信息学报, 2011, 25(4): 20-24.
[26] LI Z, ZHU J, LUO Z, et al. Research on Tibetan text classification method based on neural network[C]//Proceedings of the International Conference on Asian Language Processing. IEEE, 2019: 379-383.
[27] MA W, YU H, MA J. Study of Tibetan text classification based on fastText[C]//Proceedings of the 3rd International Conference on Computer Engineering, Information Science & Application Technology. Atlantis Press, 2019: 374-380.
[28] 胥桂仙,张子欣,于绍娜,等.基于图卷积网络的藏文新闻文本分类[J].数据分析与知识发现,2022: 1-19.
[29] YANG Z Q,XU Z H, CUI Y M, et al. A Chinese minority pre-trained language model[C]//Proceedings of the International Committee on Computational Linguistics.2022: 3937-3949.
[30] 安波,龙从军. 基于预训练语言模型的藏文文本分类[J]. 中文信息学报,2022,36(12): 85-93.
[31] SUN C, LI X, LI Y. et al. P-tuning: A Fsimple yet effective method for natural language understanding and generation tasks[C]//Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, 2020: 8521-8534.
[32] LI Y, XIA Y, ZHANG J, et al. PET: A prompt-based extraction of transformation for pre-trained language models[J]. arXiv preprint arXiv: 2103.10385, 2021.
[33] XU X, ZHANG Y, SUN, Y. et al. TAPT: Transformer augmented propensity pre-training on narrative cloze[J]. arXiv preprint arXiv: 2012.15432, 2020.
[34] SUN K, YU D,YU D. Knowledge-powered prompting for open-domain question answering[C]//Proceedings of the AAAI Conference on Artificial Intelligence, 2021,35(5), 4255-4262.
[35] 刘汇丹, 诺明花, 赵维纳, 等. SegT 一个实用的藏文分词系统[J]. 中文信息学报, 2012, 26(1): 97-104.
{{custom_fnGroup.title_cn}}
脚注
{{custom_fn.content}}
基金
国家社会科学基金(22BTQ010);省部共建藏语智能信息处理及应用国家重点实验室自主课题基金(2022-SKL-012);国家自然科学基金(62076233,62266036);中国社会科学院数据练专项(2024SJK017)
{{custom_fund}}