Lacmia:抗混淆的多民族语言生成式摘要模型

翁彧,罗皓予,刘征,超木日力格,刘轩,董俊

PDF(3929 KB)
PDF(3929 KB)
中文信息学报 ›› 2024, Vol. 38 ›› Issue (10) : 80-94.
民族、跨境及周边语言信息处理

Lacmia:抗混淆的多民族语言生成式摘要模型

  • 翁彧1,罗皓予1,刘征1,2,超木日力格1,刘轩1,董俊1
作者信息 +

Lacmia: Language-Anti-confusioned Chinese Minority Abstractive Summarization Model

  • WENG Yu1, LUO Haoyu1, LIU Zheng1,2, Chaomurilige1, LIU Xuan1, DONG Jun1
Author information +
History +

摘要

该文提出了一种针对中国多民族低资源语言生成式摘要模型Lacmia(Language-Anti-confusioned Chinese Minority Abstractive Summarization Model)。为了克服以往模型只能处理单一语言的限制,Lacmia采用了一种统一的生成式架构来执行不同民族语言的摘要生成任务。此外,为了解决以往模型在多民族低资源语言处理上的性能不足问题,该模型在框架中加入了语言信息嵌入模块。该文通过在损失函数中引入目标语言偏好性正则化项,有效减轻了多语言摘要中出现的语言混淆现象,从而提升摘要生成准确性和流畅度。广泛的实验表明,Lacmia在包括藏语和维吾尔语在内的多民族低资源语言摘要任务中,取得了卓越成绩。除了在ROUGE评价标准上实现了显著性能提升外,Lacmia在该文新提出的CINOScore和NLCR两项指标上均达到了最佳效果,验证了模型的有效性和先进性。

Abstract

This article introduces the Lacmia, an innovative system for automatic summary generation for low-resource ethnic minority languages in China. Lacmia utilizes a unified generative framework to perform summarization tasks across different languages, overcoming the limitations of previous models for a single language. To address the insufficient performance in handling low-resource languages, the model integrates a language information embedding module within its architecture. Additionally, the introduction of a linguistic lexicon preference regularization in the loss function effectively reduces language confusion in multilingual summaries, thereby enhancing the accuracy and fluency of the generated abstracts. Comprehensive experiments demonstrate Lacmias exceptional performance in summarization tasks for Tibetan and Uyghur, among other minority languages. according to ROUGE metrics,CINOScore and NLCR benchmarks.

关键词

生成式摘要 / 多语言预训练模型 / 低资源语言信息处理 / 多目标学习

Key words

abstractive text summarization / multilingual pre-training model / low-resource language processing / multi-objective learning

引用本文

导出引用
翁彧,罗皓予,刘征,超木日力格,刘轩,董俊. Lacmia:抗混淆的多民族语言生成式摘要模型. 中文信息学报. 2024, 38(10): 80-94
WENG Yu, LUO Haoyu, LIU Zheng, Chaomurilige, LIU Xuan, DONG Jun. Lacmia: Language-Anti-confusioned Chinese Minority Abstractive Summarization Model. Journal of Chinese Information Processing. 2024, 38(10): 80-94

参考文献

[1] DEVLIN J, CHANG M W, LEE K, et al. BERT: Pre-training of deep bidirectional transformers for language understanding[C]//Proceedings of the NAACL, 2019: 4171-4186.
[2] BROWN T, MANN B, RYDER N, et al. Language models are few-shot learners[C]//Proceedings of the 34th International Conference on Neural Information Processing Systems, 2020: 1877-1901.
[3] YANG Z, DAI Z, YANG Y, et al. Xlnet: Generalized autoregressive pretraining for language understanding[C]//Proceedings of the 33rd International Conference on Neural Information Processing Systems, 2019: 5753-5763.
[4] MIHALCEA R, TARAU P. Textrank: Bringing order into text[C]//Proceedings of the Conference on Empirical Methods in Natural Language Processing, 2004: 404-411.
[5] AIZAWA A. An information-theoretic perspective of TF-IDF measures[J]. Information Processing & Management, 2003, 39(1): 45-65.
[6] LIU Y. Fine-tune BERT for extractive summarization[J]. arXiv preprint arXiv: 1903.10318, 2019.
[7] LIU Y, LAPATA M. Text summarization with pretrained encoders[C]//Proceedings of the EMNLP-IJCNLP, 2019: 3730-3740.
[8] GROSSBERG S. Recurrent neural networks[J]. Scholarpedia, 2013, 8(2): 1888-1897.
[9] VASWANI A, SHAZEER N, PARMAR N, et al. Attention is all you need[C]//Proceedings of the 31st International Conference on Neural Information Processing Systems, 2017: 6000-6010.
[10] YANG Z, XU Z, CUI Y, et al. CINO: A Chinese minority pre-trained language model[C]//Proceedings of the 29th International Conference on Computational Linguistics, 2022: 3937-3949.
[11] 闫晓东,王羿钦,黄硕,等. 藏文文本摘要数据集[J]. 中国科学数据(中英文网络版), 2022, 7(2): 39-45.
[12] HU B, CHEN Q, ZHU F. LCSTS: A large scale Chinese short text summarization dataset[J]. Computer Science, 2015: 2667-2671.
[13] HOU S, HUANG Y, FEI C, et al. Holographic lexical chain and its application in Chinese text summarization[C]//Proceedings of the 1st International Joint Conference, Beijing, China, 2017: 266-281.
[14] CHENG J, LAPATA M. Neural summarization by extracting sentences and words[C]//Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, 2016: 484-494.
[15] LI S, XU J. A two-step abstractive summarization model with asynchronous and enriched-information decoding[J]. Neural Computing and Applications, 2021, 33: 1159-1170.
[16] 李亚超, 江静, 加羊吉,等. TIP-LAS: 一个开源的藏文分词词性标注系统[J]. 中文信息学报, 2015, 29(6): 203-207.
[17] YAN X, XIE X, ZOU Y, et al. Abstractive summarization of Tibetan news based on hybrid model[C]//Proceedings of the 19th Chinese National Conference on Computational Linguistics, 2020: 479-490.
[18] HUANG S, YAN X, OUYANG X, et al. Abstractive summarization of Tibetan based on end-to-end pre-trained model[C]//Proceedings of the 22nd Chinese National Conference on Computational Linguistics, 2023: 113-123.
[19] 李维, 闫晓东, 解晓庆. 基于改进 TextRank 的藏文抽取式摘要生成[J]. 中文信息学报, 2020, 34(9): 36-43.
[20] 阿热帕提·尕依提. 基于统计的维吾尔网页自动摘要提取研究[D]. 乌鲁木齐: 新疆大学硕士学位论文, 2011.
[21] 程园. 舆情分析中维吾尔文文本自动摘要研究[D]. 乌鲁木齐: 新疆大学硕士学位论文, 2016.
[22] ZHANG T, KISHORE V, WU F, et al. Bertscore: Evaluating text generation with BERT[C]//Proceedings of ICLR, 2020: 1-43.
[23] WANG W, BAO H, HUANG S, et al. Minilmv2: Multi-head self-attention relation distillation for compressing pretrained transformers[C]//Proceedings of ACL-IJCNLP, 2021: 2140-2151.
[24] CHIPMAN H A, GEORGE E I, MCCULLOCH R E, et al. mBART: Multidimensional monotone BART[J]. Bayesian Analysis, 2022, 17(2): 515-544.
[25] XUE L, CONSTANT N, ROBERTS A, et al. mT5: A massively multilingual pre-trained text-to-text transformer[C]//Proceedings of the NAACL, 2021: 483-498.
[26] KINGMA D P, BA J. Adam: A method for stochastic optimization[C]//Proceedings of ICLR, 2015: 1-13.
[27] DU Z, QIAN Y, LIU X, et al. Glm: General language model pretraining with autoregressive blank infilling[C]//Proceedings of the 60th Annual Meeting of the Association for Compuational Linguistics, 2022: 320-335.

基金

国家重点研究与发展计划(2020YFB1406702-3);北京市科技计划项目(Z231100001723002);国家自然科学基金(62006257)
PDF(3929 KB)

262

Accesses

0

Citation

Detail

段落导航
相关文章

/