基于多粒度特征的文本生成评价方法

赖华,高玉梦,黄于欣,余正涛,张勇丙

PDF(4279 KB)
PDF(4279 KB)
中文信息学报 ›› 2022, Vol. 36 ›› Issue (3) : 45-53,63.
民族、跨境及周边语言信息处理

基于多粒度特征的文本生成评价方法

  • 赖华1,2,高玉梦1,2,黄于欣1,2,余正涛1,2,张勇丙1,2
作者信息 +

Evaluation Method of Text Generation Based on Multi-granularity Features

  • LAI Hua1,2, GAO Yumeng1,2, HUANG Yuxin1,2, YU Zhengtao1,2, ZHANG Yongbing1,2
Author information +
History +

摘要

近年来,基于预训练语言模型的文本生成评价方法得到了广泛关注,其通过计算两个句子间子词粒度的相似度来评价生成文本的质量。但是对于越南语、泰语等存在大量黏着语素的语言,单个音节或子词不能独立成词表达语义,仅基于子词粒度匹配的方法并不能够完整表征两个句子间的语义相似关系。基于此,该文提出一种基于子词、音节、词组等多粒度特征的文本生成评价方法。首先基于MBERT模型生成文本的表示,然后引入音节、词组等粗粒度语义单元之间的相似性来增强子词粒度的相似度评价模型。在机器翻译、跨语言摘要、跨语言数据筛选等任务上的实验结果表明,该文提出的多粒度特征评价方法相比ROUGE、BLEU等基于统计的评价方法以及Bertscore等基于语义相似度的评价方法都取得了更好的性能,与人工评价结果相关性更高。

Abstract

Recently, the evaluation method of text generation based on pre-trained language model has gained attention, which evaluates the quality of generated text by computing the granularity similarity of sub-words of two sentences. However, for languages that contain many adhesive morphemes, such as Vietnamese and Thai, a single syllable or sub-word cannot form the semantic integrity, which means that the sub-word granularity matching method cannot fully represent the semantic relationship between two sentences. Therefore, we propose a text generation evaluation method with multi-granularity features of sub-words, syllables, and phrases. After the representation of text is obtained by MBERT, the semantic similarity of syllables and phrases is introduced to enhance the evaluation model of sub-words. Experimental results on such tasks as cross-language summarization, machine translation, and data screening show that, compared with ROUGE, BLEU based on statistical evaluation and Bertscore based on deep semantic matching, the proposed metric correlates better with human judgments.

关键词

文本生成 / 评价方法 / 黏着语素 / 多粒度特征 / MBERT

Key words

text generation / evaluation method / adhesive morphemes / multi-granularity feature / MBERT

引用本文

导出引用
赖华,高玉梦,黄于欣,余正涛,张勇丙. 基于多粒度特征的文本生成评价方法. 中文信息学报. 2022, 36(3): 45-53,63
LAI Hua, GAO Yumeng, HUANG Yuxin, YU Zhengtao, ZHANG Yongbing. Evaluation Method of Text Generation Based on Multi-granularity Features. Journal of Chinese Information Processing. 2022, 36(3): 45-53,63

参考文献

[1] Ma Q, Wei J, Bojar O, et al. Results of the WMT19 metrics shared task: segment-level and strong MT systems pose big challenges[C]//Proceedings of the Fourth Conference on Machine Translation, 2019: 62-90.
[2] Rei R, Stewart C, Farinha A C, et al. COMET: a neural framework for MT evaluation[J]. arXiv preprint arXiv: 2009.09025, 2020.
[3] Takahashi K, Sudoh K, Nakamura S. Automatic machine translation evaluation using source language inputs and cross-lingual language model[C]//Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, 2020: 3553-3558.
[4] Snover M, Dorr B, Schwartz R, et al. A study of translation edit rate with targeted human annotation[C]//Proceedings of the AMTA, 2006, 25(7): 223-231.
[5] Papineni K, Roukos S, Ward T, et al. Bleu: a method for automatic evaluation of machine translation[C]//Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics. 2002: 311-318.
[6] Lin C Y. Rouge: a package for automatic evaluation of summaries[C]//Proceedings of the Text Summarization Branches Out, 2004: 74-81.
[7] Banerjee S, Lavie A. METEOR: an automatic metric for MT evaluation with improved correlation with human judgments[C]//Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization, 2005: 65-72.
[8] Denkowski M, Lavie A. Meteor universal: language specific translation evaluation for any target language[C]//Proceedings of the 9th Workshop on Statistical Machine Translation, 2014: 376-380.
[9] Guo Y, Hu J. Meteor++ 2.0: adopt syntactic level paraphrase knowledge into machine translation evaluation[C]//Proceedings of the 4th Conference on Machine Translation, 2019: 501-506.
[10] Ng J P, Abrecht V. Better summarization evaluation with word embeddings for ROUGE[J]. arXiv preprint arXiv: 1508.06034, 2015.
[11] Ganesan K. Rouge 2.0: updated and improved measures for evaluation of summarization tasks[J]. arXiv preprint arXiv: 1803.01937, 2018.
[12] Thompson B, Post M. Automatic machine translation evaluation in many languages via zero-shot paraphrasing[J]. arXiv preprint arXiv: 2004.14564, 2020.
[13] Gao Y, Zhao W, Eger S. SUPERT: towards new frontiers in unsupervised evaluation metrics for multi-document summarization[J]. arXiv preprint arXiv: 2005.03724, 2020.
[14] Zhang T, Kishore V, Wu F, et al. BERTScore: evaluating text generation with BERT[J]. arXiv preprint arXiv: 1904.09675, 2019.
[15] Devlin J, Chang M W, Lee K, et al. BERT: Pre-training of deep bidirectional transformers for language understanding[J]. arXiv preprint arXiv: 1810.04805, 2018.
[16] Phatthiyaphaibun W, Korakot Chaovavanich C. PyThaiNLP: Thai natural language processing in Python[OL]. http://doi. org/10.5281/zenodo, 2016, 3519354.
[17] Vu T, Nguyen D Q, Nguyen D Q, et al. VnCoreNLP: a Vietnamese natural language processing toolkit[J]. arXiv preprint arXiv: 1801.01331, 2018.
[18] Krys'ciński W, Keskar N S, McCann B, et al. Neural text summarization: a critical evaluation[J]. arXiv preprint arXiv: 1908.08960, 2019.[19] Hu B, Chen Q, Zhu F. LCSTS: a large scale Chinese short text summarization dataset[J]. arXiv preprint arXiv: 1506.05865, 2015.
[20] Erkan G, Radev D R. LexRank: graph-based lexical centrality as salience in text summarization[J]. Journal of Artificial Intelligence Research, 2004, 22: 457-479.
[21] Chaganty A T, Mussman S, Liang P. The price of debiasing automatic metrics in natural language evaluation[J]. arXiv preprint arXiv: 1807.02202, 2018.
[22] Lample G, Conneau A. Cross-lingual language model pretraining[J]. arXiv preprint arXiv: 1901.07291, 2019.

基金

国家自然科学基金(61732005, 61972186, 61762056, 61761026);云南省重大科技专项计划项目(202002AD080001-5);云南省重大科技专项计划项目(202103AA080015);云南省高新技术产业专项(201606);云南省基础研究计划项目(202001AT070047,2018FB104)
PDF(4279 KB)

1636

Accesses

0

Citation

Detail

段落导航
相关文章

/