该文提出了一种针对中国多民族低资源语言生成式摘要模型Lacmia(Language-Anti-confusioned Chinese Minority Abstractive Summarization Model)。为了克服以往模型只能处理单一语言的限制,Lacmia采用了一种统一的生成式架构来执行不同民族语言的摘要生成任务。此外,为了解决以往模型在多民族低资源语言处理上的性能不足问题,该模型在框架中加入了语言信息嵌入模块。该文通过在损失函数中引入目标语言偏好性正则化项,有效减轻了多语言摘要中出现的语言混淆现象,从而提升摘要生成准确性和流畅度。广泛的实验表明,Lacmia在包括藏语和维吾尔语在内的多民族低资源语言摘要任务中,取得了卓越成绩。除了在ROUGE评价标准上实现了显著性能提升外,Lacmia在该文新提出的CINOScore和NLCR两项指标上均达到了最佳效果,验证了模型的有效性和先进性。
Abstract
This article introduces the Lacmia, an innovative system for automatic summary generation for low-resource ethnic minority languages in China. Lacmia utilizes a unified generative framework to perform summarization tasks across different languages, overcoming the limitations of previous models for a single language. To address the insufficient performance in handling low-resource languages, the model integrates a language information embedding module within its architecture. Additionally, the introduction of a linguistic lexicon preference regularization in the loss function effectively reduces language confusion in multilingual summaries, thereby enhancing the accuracy and fluency of the generated abstracts. Comprehensive experiments demonstrate Lacmias exceptional performance in summarization tasks for Tibetan and Uyghur, among other minority languages. according to ROUGE metrics,CINOScore and NLCR benchmarks.
关键词
生成式摘要 /
多语言预训练模型 /
低资源语言信息处理 /
多目标学习
{{custom_keyword}} /
Key words
abstractive text summarization /
multilingual pre-training model /
low-resource language processing /
multi-objective learning
{{custom_keyword}} /
{{custom_sec.title}}
{{custom_sec.title}}
{{custom_sec.content}}
参考文献
[1] DEVLIN J, CHANG M W, LEE K, et al. BERT: Pre-training of deep bidirectional transformers for language understanding[C]//Proceedings of the NAACL, 2019: 4171-4186.
[2] BROWN T, MANN B, RYDER N, et al. Language models are few-shot learners[C]//Proceedings of the 34th International Conference on Neural Information Processing Systems, 2020: 1877-1901.
[3] YANG Z, DAI Z, YANG Y, et al. Xlnet: Generalized autoregressive pretraining for language understanding[C]//Proceedings of the 33rd International Conference on Neural Information Processing Systems, 2019: 5753-5763.
[4] MIHALCEA R, TARAU P. Textrank: Bringing order into text[C]//Proceedings of the Conference on Empirical Methods in Natural Language Processing, 2004: 404-411.
[5] AIZAWA A. An information-theoretic perspective of TF-IDF measures[J]. Information Processing & Management, 2003, 39(1): 45-65.
[6] LIU Y. Fine-tune BERT for extractive summarization[J]. arXiv preprint arXiv: 1903.10318, 2019.
[7] LIU Y, LAPATA M. Text summarization with pretrained encoders[C]//Proceedings of the EMNLP-IJCNLP, 2019: 3730-3740.
[8] GROSSBERG S. Recurrent neural networks[J]. Scholarpedia, 2013, 8(2): 1888-1897.
[9] VASWANI A, SHAZEER N, PARMAR N, et al. Attention is all you need[C]//Proceedings of the 31st International Conference on Neural Information Processing Systems, 2017: 6000-6010.
[10] YANG Z, XU Z, CUI Y, et al. CINO: A Chinese minority pre-trained language model[C]//Proceedings of the 29th International Conference on Computational Linguistics, 2022: 3937-3949.
[11] 闫晓东,王羿钦,黄硕,等. 藏文文本摘要数据集[J]. 中国科学数据(中英文网络版), 2022, 7(2): 39-45.
[12] HU B, CHEN Q, ZHU F. LCSTS: A large scale Chinese short text summarization dataset[J]. Computer Science, 2015: 2667-2671.
[13] HOU S, HUANG Y, FEI C, et al. Holographic lexical chain and its application in Chinese text summarization[C]//Proceedings of the 1st International Joint Conference, Beijing, China, 2017: 266-281.
[14] CHENG J, LAPATA M. Neural summarization by extracting sentences and words[C]//Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, 2016: 484-494.
[15] LI S, XU J. A two-step abstractive summarization model with asynchronous and enriched-information decoding[J]. Neural Computing and Applications, 2021, 33: 1159-1170.
[16] 李亚超, 江静, 加羊吉,等. TIP-LAS: 一个开源的藏文分词词性标注系统[J]. 中文信息学报, 2015, 29(6): 203-207.
[17] YAN X, XIE X, ZOU Y, et al. Abstractive summarization of Tibetan news based on hybrid model[C]//Proceedings of the 19th Chinese National Conference on Computational Linguistics, 2020: 479-490.
[18] HUANG S, YAN X, OUYANG X, et al. Abstractive summarization of Tibetan based on end-to-end pre-trained model[C]//Proceedings of the 22nd Chinese National Conference on Computational Linguistics, 2023: 113-123.
[19] 李维, 闫晓东, 解晓庆. 基于改进 TextRank 的藏文抽取式摘要生成[J]. 中文信息学报, 2020, 34(9): 36-43.
[20] 阿热帕提·尕依提. 基于统计的维吾尔网页自动摘要提取研究[D]. 乌鲁木齐: 新疆大学硕士学位论文, 2011.
[21] 程园. 舆情分析中维吾尔文文本自动摘要研究[D]. 乌鲁木齐: 新疆大学硕士学位论文, 2016.
[22] ZHANG T, KISHORE V, WU F, et al. Bertscore: Evaluating text generation with BERT[C]//Proceedings of ICLR, 2020: 1-43.
[23] WANG W, BAO H, HUANG S, et al. Minilmv2: Multi-head self-attention relation distillation for compressing pretrained transformers[C]//Proceedings of ACL-IJCNLP, 2021: 2140-2151.
[24] CHIPMAN H A, GEORGE E I, MCCULLOCH R E, et al. mBART: Multidimensional monotone BART[J]. Bayesian Analysis, 2022, 17(2): 515-544.
[25] XUE L, CONSTANT N, ROBERTS A, et al. mT5: A massively multilingual pre-trained text-to-text transformer[C]//Proceedings of the NAACL, 2021: 483-498.
[26] KINGMA D P, BA J. Adam: A method for stochastic optimization[C]//Proceedings of ICLR, 2015: 1-13.
[27] DU Z, QIAN Y, LIU X, et al. Glm: General language model pretraining with autoregressive blank infilling[C]//Proceedings of the 60th Annual Meeting of the Association for Compuational Linguistics, 2022: 320-335.
{{custom_fnGroup.title_cn}}
脚注
{{custom_fn.content}}
基金
国家重点研究与发展计划(2020YFB1406702-3);北京市科技计划项目(Z231100001723002);国家自然科学基金(62006257)
{{custom_fund}}