基于神经网络的机器译文自动评价综述

刘媛,李茂西,罗琪,李易函

PDF(3212 KB)
PDF(3212 KB)
中文信息学报 ›› 2023, Vol. 37 ›› Issue (9) : 1-14.
综述

基于神经网络的机器译文自动评价综述

  • 刘媛1,2,李茂西1,3,罗琪4,李易函5
作者信息 +

Automatic Evaluation of Machine Translation Based on Neural Network: A Survey

  • LIU Yuan1,2, LI Maoxi1,3, LUO Qi4, LI Yihan5
Author information +
History +

摘要

机器译文自动评价是指对机器翻译系统输出译文的质量进行自动评价,是机器翻译领域的一项重要研究任务。目前机器译文自动评价方法的研究主流为基于神经网络的机器译文自动评价,该文对其进行综述,将其分为基于表征匹配的方法和基于端到端神经网络的方法,梳理和对比了这两类自动评价方法的代表性工作及其特点,并介绍推动机器译文自动评价研究的相关评测活动和性能评价指标,最后展望基于神经网络的机器译文自动评价的发展趋势,并对全文进行总结。

Abstract

Automatic evaluation of machine translation is an important research task in the field of machine translation, which refers to the quality assessment of the system outputs of machine translation. This paper reviews the automatic evaluation of machine translation based on neural network, and classifies them into methods based on representation matching and methods based on end-to-end neural network. We also compare the representative works and their characteristics, introduce the related evaluation campaigns and evaluation metrics. Finally, we summarize the development trend of automatic evaluation of machine translation based on neural network.

关键词

机器翻译 / 自动评价 / 神经网络 / 深度学习

Key words

machine translation / automatic evaluation / neural network / deep learning

引用本文

导出引用
刘媛,李茂西,罗琪,李易函. 基于神经网络的机器译文自动评价综述. 中文信息学报. 2023, 37(9): 1-14
LIU Yuan, LI Maoxi, LUO Qi, LI Yihan. Automatic Evaluation of Machine Translation Based on Neural Network: A Survey. Journal of Chinese Information Processing. 2023, 37(9): 1-14

参考文献

[1] 张家俊,赵阳,宗成庆(译).神经机器翻译[M].北京: 机械工业出版社,2022.
[2] 熊德意,李良友,张檬.神经机器翻译: 基础、原理、实践与进阶[M].北京: 电子工业出版社,2022.
[3] 肖桐,朱靖波.机器翻译: 基础与模型[M].北京: 电子工业出版社,2021.
[4] SHIWEN Y. Automatic evaluation of output quality for machine translation systems[J]. Machine Translation, 1993, 8(1): 117-126.
[5] PAPINENI K, ROUKOS S, WARD T, et al. BLEU: a method for automatic evaluation of machine translation[C]//Proceedings of the ACL, 2002: 311-318.
[6] 刘洋, 刘群, 林守勋. 机器翻译评测中的模糊匹配[J]. 中文信息学报, 2005(03):46-54.
[7] 姚建民, 周明, 赵铁军, 等. 基于句子相似度的机器翻译评价方法及其有效性分析[J]. 计算机研究与发展, 2004, 41(7): 1258-1265.
[8] DODDINGTON G. Automatic evaluation of machine translation quality using N-gram co-occurrence statistics[C]//Proceedings of the 2nd International Conference on Human Language Technology Research, 2002: 138-145.
[9] LIN C Y,OCH F J. Automatic evaluation of machine translation quality using longest common subsequence and skip-bigram statistics[C]//Proceedings of the ACL, 2004: 605-612.
[10] SU K Y, WU M W, Chang J S. A new quantitative quality measure for machine translation systems[C]//Proceedings of the COLING, 1992: 433-439.
[11] SNOVER M, DORR B, SCHWARTZ R, et al. A study of translation edit rate with targeted human annotation[C]//Proceedings of the AMTA, 2006: 223-231.
[12] 李良友, 贡正仙, 周国栋. 机器翻译自动评价综述[J]. 中文信息学报, 2014, 28(3): 81-91.
[13] WANG B, ZHOU M, LIU S, et al. Woodpecker: An automatic methodology for machine translation diagnosis with rich linguistic knowledge[J]. Journal of Information Science and Engineering, 2014, 30(5): 1407-1424.
[14] 朱晓宁. 基于语言学知识的机器翻译自动评价研究[D]. 哈尔滨: 哈尔滨工业大学硕士学位论文, 2011.
[15] 肖冬青. 基于用户行为的机器翻译自动评价研究[D]. 哈尔滨: 哈尔滨工业大学硕士学位论文, 2012.
[16] 朱俊国. 机器翻译自动评价计算粒度研究[D]. 哈尔滨: 哈尔滨工业大学硕士学位论文, 2010.
[17] STANOJEVIC′ M, SIMA’AN K. Beer: Better evaluation as ranking[C]//Proceedings of the WMT, 2014: 414-419.
[18] MA Q, GRAHAM Y, WANG S, et al. Blend: A novel combined MT metric based on direct assessment: CASICT-DCU submission to WMT metrics task[C]//Proceedings of the WMT, 2017: 598-603.
[19] RUS V,LINTEAN M. A comparison of greedy and optimal assessment of natural language student input using word-to-word similarity metrics[C]//Proceedings of the 7th Workshop on Building Educational Applications Using NLP, 2012: 157-162.
[20] LO C, WU D. MEANT: An inexpensive, high-accuracy, semi-automatic metric for evaluating translation utility based on semantic roles[C]//Proceedings of the ACL, 2011: 220-229.
[21] LO C. MEANT 2.0: accurate semantic MT evaluation for any output language[C]//Proceedings of the WMT, 2017: 589-597.
[22] MUKHERJEE A, ALA H, SHRIVASTAVA M, et al. MEE: An automatic metric for evaluation using embeddings for machine translation[C]//Proceedings of the DSAA, 2020: 292-299.
[23] KUSNER M, SUN Y, KOLKIN N, et al. From word embeddings to document distances[C]//Proceedings of the ICML, 2015: 957-966.
[24] LANDAUER T K, DUMAIS S T. A solution to Plato's problem: the latent semantic analysis theory of acquisition, induction, and representation of knowledge[J]. Psychological Review, 1997, 104(2): 211-240.
[25] FORGUES G, PINEAU J, LARCHEVQUE J M, et al. Bootstrapping dialog systems with word embeddings[C]//Proceedings of the Nips, Modern Machine Learning and Natural Language Processing Workshop, 2014, 2: 168-172.
[26] CHEN B, GUO H. Representation based translation evaluation metrics[C]//Proceedings of the ACL and IJCNLP, 2015: 150-155.
[27] CHEN B, GUO H, KUHN R. Multi-level evaluation for machine translation[C]//Proceedings of the WMT, 2015: 361-365.
[28] ZHANG T, KISHORE V, WU F, et al.Bertscore: evaluating text generation with BERT[J]. arXiv preprint arXiv:1904.09675, 2019.
[29] MATHUR N, BALDWIN T, COHN T. Putting evaluation in context: contextual embeddings improve machine translation evaluation[C]//Proceedings of the ACL, 2019: 2799-2808.
[30] ZHAO W,PEYRARD M, LIU F, et al. MoverScore: text generation evaluating with contextualized embeddings and earth mover distance[C]//Proceedings of the EMNLP, 2019: 563-578.
[31] ZHAN R, LIU X, WONG D F, et al. Difficulty-aware machine translation evaluation[C]//Proceedings of the ACL, 2021: 26-32.
[32] VERNIKOS G, THOMPSON B, MATHUR P, et al. Embarrassingly easy document-level MT metrics: how to convert any pretrained metric into a document-level metric[J]. arXiv preprint arXiv:2209.13654, 2022.
[33] WIETING J, BERG-KIRKPATRICK T, GIMPEL K, et al. Beyond BLEU: Training neural machine translation with semantic similarity[C]//Proceedings of the ACL, 2019: 4344-4355.
[34] EDUNOV S, OTT M, AULI M, et al. Classical structured prediction losses for sequence to sequence learning[C]//Proceedings of the NAACL, 2018: 355-364.
[35] LO C.YiSi: a unified semantic MT quality evaluation and estimation metric for languages with different levels of available resources[C]//Proceedings of the WMT, 2019: 507-513.
[36] CONNEAU A, LAMPLE G. Cross-lingual language model pretraining [C]//Proceedings of the NeurIPS, 2019: 7059-7069.
[37] SONG Y, ZHAO J, SPECIA L.SentSim: crosslingual semantic evaluation of machine translation[C]//Proceedings of the NAACL, 2021: 3143-3156.
[38] FENG F, YANG Y,CER D, et al. Language-agnostic bert sentence embedding[C]//Proceedings of the ACL, 2022: 878-891.
[39] HAN L, SOROKINA I,EROFEEV G, et al. Cushlepor: customised hlepor metric using LaBSE distilled knowledge model to improve agreement with human judgements[C]//Proceedings of the WMT, 2021: 1014-1023.
[40] SHIMANAKA H, KAJIWARA T, KOMACHI M. Ruse: regressor using sentence embeddings for automatic machine translation evaluation[C]//Proceedings of the WMT, 2018: 751-758.
[41] CHEN Q, ZHU X, LING Z, et al. Enhanced LSTM for natural language inference[C]//Proceedings of the ACL, 2017: 1657-1668.
[42] 罗琪,李茂西.引入源端信息的机器译文自动评价方法研究[J].中文信息学报,2021,35(12):60-67.
[43] HU W, LI M,QIU B, et al. Neural automatic evaluation of machine translation method combined with xlm word representation[C]//Proceedings of the CCL, 2021: 13-22.
[44] REI R, STEWART C, FARINHA A C, et al. COMET: A neural framework for MT evaluation[C]//Proceedings of the EMNLP, 2020: 2685-2702.
[45] REI R, FARINHA A C,ZERVA C, et al. Are references really needed?: Unbabel-IST 2021 submission for the metrics shared task[C]//Proceedings of the WMT, 2021: 1030-1040.
[46] TENNEY I, DAS D, PAVLICK E. BERT rediscovers the classical NLP pipeline[C]//Proceedings of the ACL, 2019: 4593-4601.
[47] REIMERS N,GUREVYCH I. Sentence-bert: Sentence embeddings using Siamese bert-networks[C]//Proceedings of the EMNLP-IJCNLP, 2019: 3982-3992.
[48] KONDRATYUK D, STRAKA M. 75 languages, 1 model: parsing universal dependencies universally[C]//Proceedings of the EMNLP-IJCNLP, 2019: 2779-2795.
[49] REI R, FARINHA A C, DE SOUZA J G C, et al. Searching for COMETINHO: The little metric that could[C]//Proceedings of the 23rd Annual Conference of the European Association for Machine Translation, 2022: 61-70.
[50] TEFNIK M, NOVOTN V, SOJKA P. Regressive ensemble for machine translation quality evaluation[C]//Proceedings of the WMT, 2021: 1041-1048.
[51] RONY M R A H,KOVRIGUINA L, CHAUDHURI D, et al. RoMe: a robust metric for evaluating natural language generation[C]//Proceedings of the ACL, 2022: 5645-5657.
[52] ZHANG K,SHASHA D. Simple fast algorithms for the editing distance between trees and related problems[J]. SIAM Journal on Computing, 1989, 18(6): 1245-1262.
[53] SHIMANAKA H, KAJIWARA T, KOMACHI M. Machine translation evaluation with bert regressor[J]. arXiv preprint arXiv:1907.12679, 2019.
[54] SELLAM T, DAS D, PARIKH A P. BLEURT: Learning robust metrics for text generation[C]//Proceedings of the ACL, 2020: 7881-7892.
[55] WAN Y, LIU D, YANG B, et al.RoBLEURT submission for the WMT metrics task[C]//Proceedings of the WMT, 2021: 1053-1058.
[56] KANE H,KOCYIGIT M Y, ABDALLA A, et al. NUBIA: Neural based interchangeability assessor for text generation[C]//Proceedings of the EvalNLGEval, 2020: 28-37.
[57] EDDINE M K, SHANG G, TIXIER A, et al. FrugalScore: Learning cheaper, lighter and faster evaluation metrics for automatic text generation[C]//Proceedings of the ACL, 2022: 1305-1318.
[58] THOMPSON B, POST M. Automatic machine translation evaluation in many languages via zero-shot paraphrasing[C]//Proceedings of the EMNLP, 2020: 90-121.
[59] KRUBIN′SKI M, GHADERY E, MOENS M F, et al. Mteqa at wmt metrics shared task[C]//Proceedings of the WMT, 2021: 1024-1029.
[60] STANOJEVIC′ M, KAMRAN A, KOEHN P, et al. Results of the WMT metrics shared task[C]//Proceedings of the WMT, 2015: 256-273.
[61] BOJAR O, GRAHAM Y, KAMRAN A, et al. Results of the WMT16 metrics shared task[C]//Proceedings of the WMT, 2016: 199-231.
[62] BOJAR O, CHATTERJEE R, FEDERMANN C, et al. Findings of the conference on machine translation [C]//Proceedings of the WMT, 2017: 169-214.
[63] MA Q,BOJAR O, GRAHAM Y. Results of the WMT18 metrics shared task: Both characters and embeddings achieve good performance[C]//Proceedings of the WMT, 2018: 671-688.
[64] MA Q, WEI J T Z,BOJAR O, et al. Results of the WMT19 metrics shared task: Segment-level and strong MT systems pose big challenges[C]//Proceedings of the WMT, 2019: 62-90.
[65] MATHUR N, WEI J, FREITAG M, et al. Results of the WMT20 metrics shared task[C]//Proceedings of the WMT, 2020: 688-725.
[66] FREITAG M, REI R, MATHUR N, et al. Results of the WMT21 metrics shared task: Evaluating metrics with expert-based human evaluations on TED and news domain[C]//Proceedings of the WMT, 2021: 733-774.
[67] LUBLI S, CASTILHO S, NEUBIG G, et al. A set of recommendations for assessing human-machine parity in language translation[J]. Journal of Artificial Intelligence Research, 2020, 67: 653-672.
[68] SNOVER M, MADNANI N, DORR B, et al. Fluency, adequacy, or HTER?: Exploring different human judgments with a tunable MT metric[C]//Proceedings of the WMT, 2009: 259-268.
[69] LOMMEL A, USZKOREIT H, BURCHARDT A. Multidimensional quality metrics (MQM): A framework for declaring and describing translation quality metrics[J]. Revista Tradumàtica: Tecnologies De La traducció, 2014 (12): 455-463.
[70] FREITAG M, FOSTER G,GRANGIER D, et al. Experts, errors, and context: A large-scale study of human evaluation for machine translation[J]. Transactions of the Association for Computational Linguistics, 2021, 9: 1460-1474.
[71] TAKAHASHI K, ISHIBASHI Y,SUDOH K, et al. Multilingual machine translation evaluation metrics fine-tuned on pseudo-negative examples for WMT 2021 metrics task[C]//Proceedings of the WMT, 2021: 1049-1052.
[72] YU H, MA Q, WU X, et al. CASICT-DCU participation in WMT2015 metrics task[C]//Proceedings of theWMT, 2015: 417-421.
[73] 马青松, 张金超, 刘群. 基于融合策略的机器翻译自动评价方法[J]. 中文信息学报, 2018, 32(9): 11-19.
[74] XU J, GUO Y, HU J. Incorporate semantic structures into machine translation evaluation via UCCA[C]//Proceedings of the WMT, 2020: 934-939.
[75] GUO Y,RUAN C, HU J. Meteor++: Incorporating copy knowledge into machine translation evaluation[C]//Proceedings of the WMT, 2018: 740-745.
[76] GUO Y, HU J. Meteor++ 2.0: Adopt syntactic level paraphrase knowledge into machine translation evaluation[C]//Proceedings of the WMT, 2019: 501-506.
[77] 李良友. 融合文档信息的机器翻译自动评价研究[D]. 苏州: 苏州大学硕士学位论文, 2013.
[78] ZHANG L, WENG Z, XIAO W, et al. Extract domain-specific paraphrase from monolingual corpus for automatic evaluation of machine translation[C]//Proceedings of the WMT, 2016: 511-517.
[79] 张丽林, 李茂西, 肖文艳,等. 机器翻译自动评价中领域知识复述抽取研究[J]. 北京大学学报 (自然科学版), 2017, 53(2): 230-238.
[80] 翟煜锦, 李培芸, 项青宇, 等. 基于 QE 的机器翻译重排序方法研究[J]. 江西师范大学学报(自然科学版), 2020, 44(1): 46-50.
[81] 赵阳, 周龙, 王迁, 等. 民汉稀缺资源神经机器翻译技术研究[J]. 江西师范大学学报 (自然科学版), 2019, 43(6): 630-637.
[82] 张芸祯. 基于篇章结构的机器翻译自动评价方法[D]. 曲阜: 山东师范大学硕士学位论文, 2019.
[83] 贡正仙, 李良友. 基于加权词汇衔接的文档级机器翻译自动评价[J]. 北京大学学报 (自然科学版), 2014, 50(1): 173-179.
[84] COMELLES E, GIMENEZ J, MARQUEZ L, et al. Document-level automatic MT evaluation based on discourse representations[C]//Proceedings of the WMT and MetricsMATR, 2010: 333-338.
[85] GUZMN F,JOTY S, MRQUEZ L, et al. Using discourse structure improves machine translation evaluation[C]//Proceedings of the ACL, 2014: 687-698.
[86] WONG B T M, KIT C. Extending machine translation evaluation metrics with lexical cohesion to document level[C]//Proceedings of the EMNLP, 2012: 1060-1068.
[87] GONG Z, ZHANG M, ZHOU G. Document-level machine translation evaluation with gist consistency and text cohesion[C]//Proceedings of the WMT, 2015: 33-40.
[88] TAN X, ZHANG L, ZHOU G. Discourse cohesion evaluation for document-level neural machine translation[J].arXiv preprint arXiv:2208.09118, 2022.
[89] JIANG Y, MA S, ZHANG D, et al. BlonDe: An automatic evaluation metric for document-level machine translation [C] //Proceedings of the NAACL, 2022: 1550-1565.
[90] CASTILHO S. Dela project: document-level machine translation evaluation[C]//Proceedings of the EAMT, 2022: 319-320.
[91] MARUF S, SALEH F,HAFFARI G. A survey on document-level neural machine translation: methods and evaluation[J]. ACM Computing Surveys, 2021, 54(2): 1-36.

基金

国家自然科学基金(61662031,61462044);江西省教育厅科技项目(GJJ210306);教育部产学合作协同育人项目(220604647062739)
PDF(3212 KB)

Accesses

Citation

Detail

段落导航
相关文章

/