机器翻译自动评价综述

李良友,贡正仙,周国栋

PDF(2698 KB)
PDF(2698 KB)
中文信息学报 ›› 2014, Vol. 28 ›› Issue (3) : 81-91.
机器翻译

机器翻译自动评价综述

  • 李良友,贡正仙,周国栋
作者信息 +

A Survey of Automatic Machine Translation Evaluation

  • LI Liangyou, GONG Zhengxian, ZHOU Guodong
Author information +
History +

摘要

随着机器翻译的发展,对其质量进行评测的自动评价方法也越来越受重视。发展至今,各种评价方法与技术层出不穷,采用何种分类标准来组织和描述它们也是一个很大的挑战。根据核心技术的不同,该文重点介绍了三类主流的自动评价方法,包括: 基于语言学检测点的方法、字符串匹配的方法和基于机器学习的方法。论文分别阐述了这些类别中颇具代表性的方法的工作原理并分析了各自的优缺点。此外,受限参考译文下的评价技术虽然不是主流的方法,但是其对提高自动化程度和评价性能的作用不能忽视,所以该文将其作为特殊的类别做了阐述。然后,汇报了近年来衡量自动评价方法的国际评测结果。最后,总结了自动评价的发展趋势和有待进一步解决的相关问题。

Abstract

With the development of machine translation, the automatic evaluation methods have been paid more and more attention. Since so many related methods and technologies have been proposed, it is a big challenge to organize and describe them with a scientific classification. This paper focuses on three types of methods, i.e. Checkpoint-based methods, String-matching methods and Machine Learning based method. This paper enumerates several representative approaches for each type of method, describing the principle of metrics and analyzing advantages and shortcomings of them. In addition, the sub-branch of evaluation with limited references is also introduced as a special catalog, which plays an important role in increasing the degree of automation as well as boosting the performance. Furthermore, some famous evaluation metric campaigns are introduced. Finally, we show the trend of current researches on automatic evaluation and point out some relevant problems for future study.

关键词

机器翻译 / 自动评价 / 自动评价分类

Key words

machine translation / automatic evaluation / classification of automatic evaluation

引用本文

导出引用
李良友,贡正仙,周国栋. 机器翻译自动评价综述. 中文信息学报. 2014, 28(3): 81-91
LI Liangyou, GONG Zhengxian, ZHOU Guodong. A Survey of Automatic Machine Translation Evaluation. Journal of Chinese Information Processing. 2014, 28(3): 81-91

参考文献

[1] International Standards for Language Engineering[DB/OL]. http://www.ilc.cnr.it/EAGLES96/isle/ISLE_D14.2.zip.2003.
[2] Y Shiwen. Automatic evaluation of output quality for Machine Translation systems[J]. Machine Translation. 1993, 8: 117-126.
[3] M Zhou, B Wang, S Liu, et al. Diagnostic evaluation of machine translation systems using automatically constructed linguistic check-points[C]//Proceedings of the 22nd International Conference on Computational Linguistics—Volume 1. Stroudsburg, PA, USA: 2008: 1121-1128.
[4] C Tillmann, S Vogel, H Ney, et al. Accelerated DP Based Search for Statistical Translation[A]. In European Conf. on Speech Communication and Technology. 1997: 2667-2670.
[5] M Snover, B Dorr, R Schwartz, et al. A Study of Translation Edit Rate with Targeted Human Annotation[C]//Proceedings of the 7th Conference of the Association for Machine Translation in the Americas. 2006: 223-231.
[6] G Leusch, N Ueffing, H Ney. CDER: Efficient MT Evaluation Using Block Movements[C]//Proceedings of EACL. 2006: 241-248.
[7] D Lopresti, A Tomkins. Block Edit Models for Approximate String Matching[J]. Theoretical Computer Science. 1997, 181: 159-179.
[8] K Papineni, S Roukos, T Ward, et al. BLEU: a Method for Automatic Evaluation of Machine Translation[C]//Proceedings of the 40th Annual Meeting on Association for Computational Linguistics. Stroudsburg, PA, USA, 2002: 311-318.
[9] C Callison-Burch, M Osborne, P Koehn. Re-evaluating the Role of BLEU in Machine Translation Research[C]//Proceedings of the 11th Conference of the European Chapter of the Association for Computational Linguistics. 2006: 249-256.
[10] G Doddington. Automatic Evaluation of Machine Translation Quality Using N-gram Co-Occurrence Statistics[C]//Proceedings of the second international conference on Human Language Technology Research. San Francisco, CA, USA, 2002: 138-145.
[11] B Babych, A Hartley. Extending the BLEU MT Evaluation Method with Frequency Weightings[C]//Proceedings of the 42nd Annual Meeting on Association for Computational Linguistics. Stroudsburg, PA, USA: 2004.
[12] C Lin, F J Och. Automatic Evaluation of Machine Translation Quality Using Longest Common Subsequence and Skip-Bigram Statistics[C]//Proceedings of the 42nd Annual Meeting on Association for Computational Linguistics. Stroudsburg, PA, USA: 2004.
[13] D Liu, D Gildea. Stochastic Iterative Alignment for Machine Translation Evaluation[C]//Proceedings of the COLING/ACL on Main conference poster sessions. Stroudsburg, PA, USA, 2006: 539-546.
[14] Y S Chan, H T Ng. MAXSIM: A Maximum Similarity Metric for Machine Translation Evaluation[C]//Proceedings of ACL-08: HLT. Columbus, Ohio, 2008: 55-62.
[15] J Turian, L Shen, I D. Melamed. Evaluation of Machine Translation and its Evaluation[C]//Proceedings of MT Summit IX. 2003: 386-393.
[16] S Banerjee, A Lavie. METEOR: An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments[C]//Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization. Ann Arbor, Michigan: 2005: 65-72.
[17] L Zhou, C Lin, E Hovy. Re-evaluating Machine Translation Results with Paraphrase Support[C]//Proceedings of the Conference on Empirical Methods in Natural Language Processing. Stroudsburg, PA, USA: 2006: 77-84.
[18] C Callison-Burch, C Fordyce, P Koehn, et al. (Meta-) Evaluation of Machine Translation[C]//Proceedings of the Second Workshop on Statistical Machine Translation. Prague, Czech Republic, 2007: 136-158.
[19] M Snover, N Madnani, B J Dorr, et al. Fluency, Adequacy, or HTER? Exploring Different Human Judgments with a Tunable MT Metric[C]//Proceedings of the Fourth Workshop on Statistical Machine Translation. Stroudsburg, PA, USA, 2009: 259-268.
[20] M PopoviAc', H Ney. Syntax-oriented evaluation measures for machine translation output[C]//Proceedings of the Fourth Workshop on Statistical Machine Translation. Stroudsburg, PA, USA, 2009: 29-32.
[21] M Popovi C. Morphemes and POS tags for n-gram based evaluation metrics[C]//Proceedings of the Sixth Workshop on Statistical Machine Translation. Edinburgh, Scotland, 2011: 104-107.
[22] C Liu, D Dahlmeier, H T Ng. TESLA: Translation Evaluation of Sentences with Linear-programming-based Analysis[C]//Proceedings of the Joint 5th Workshop on Statistical Machine Translation and MetricsMATR. Uppsala, Sweden, 2010: 354-359.
[23] M Denkowski, A Lavie. Meteor 1.3: Automatic Metric for Reliable Optimization and Evaluation of Machine Translation Systems[C]//Proceedings of the Sixth Workshop on Statistical Machine Translation. Edinburgh, Scotland, 2011: 85-91.
[24] B Chen, R Kuhn. AMBER: A Modified BLEU, Enhanced Ranking Metric[C]//Proceedings of the Sixth Workshop on Statistical Machine Translation. Edinburgh, Scotland, 2011: 71-77.
[25] A Birch, M Osborne. Reordering Metrics for MT[C]//Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies. Portland, Oregon, USA: 2011: 1027-1035.
[26] B T Wong, C Kit. ATEC: automatic evaluation of machine translation via word choice and word order[J]. Machine Translation. 2009, 23: 141-155.
[27] S Corston-Oliver, M Gamon, C Brockett. A machine learning approach to the automatic evaluation of machine translation[C]//Proceedings of the 39th Annual Meeting on Association for Computational Linguistics. Stroudsburg, PA, USA: 2001. 148-155.
[28] C B Quirk. Training a Sentence-Level Machine Translation Confidence Metric[C]//Proceedings of LREC 2004. 2004.
[29] L Specia, D Raj, M Turchi. Machine translation evaluation versus quality estimation[J]. Machine Translation. 2010, 24(1): 39-50.
[30] G Russo-Lassner, J Lin, P Resnik. A Paraphrase-Based Approach to Machine Translation Evaluation[R].University of Maryland, College Park, 2005.
[31] J Albrecht, R Hwa. Regression for Sentence-Level MT Evaluation with Pseudo References[C]//Proceedings of the 45th Annual Meeting of the Association of Computational Linguistics. Prague, Czech Republic: 2007: 296-303.
[32] P Koehn, C Monz. Manual and Automatic Evaluation of Machine Translation between European Languages[C]//Proceedings on the Workshop on Statistical Machine Translation. New York City: 2006: 102-121.
[33] Y Ye, M Zhou, C Lin. Sentence Level Machine Translation Evaluation as a Ranking Problem: one step aside from BLEU[C]//Proceedings of the Second Workshop on Statistical Machine Translation. Stroudsburg, PA, USA: 2007: 240-247.
[34] J Albrecht, R Hwa. A Re-examination of Machine Learning Approaches for Sentence-Level MT Evaluation[C]//Proceedings of the 45th Annual Meeting of the Association of Computational Linguistics. Prague, Czech Republic: 2007: 880-887.
[35] K Duh. Ranking vs. Regression in Machine Translation Evaluation[C]//Proceedings of the Third Workshop on Statistical Machine Translation. Stroudsburg, PA, USA: 2008: 191-194.
[36] D Liu, D Gildea. Source-Language Features and Maximum Correlation Training for Machine Translation Evaluation[C]//Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics. Rochester, New York: 2007: 41-48.
[37] J Giménez, L Márquez. Heterogeneous Autmatic MT Evaluation Through Non-Parametric Metric Combinations[C]//Proceedings of the Third International Joint Conference on Natural Language Processing. 2008: 319-326.
[38] J Giménez, L Màrquez. Linguistic Features for Automatic Evaluation of Heterogeneous MT Systems[C]//Proceedings of the Second Workshop on Statistical Machine Translation. Stroudsburg, PA, USA: 2007: 256-264.
[39] D Liu, D Gildea. Syntactic Features for Evaluation of Machine Translation[C]//Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization. Ann Arbor, Michigan: 2005: 25-32.
[40] K Owczarzak, J van Genabith, A Way. Dependency-Based Automatic Evaluation for Machine Translation[C]//Proceedings of the NAACL-HLT 2007/AMTA Workshop on Syntax and Structure in Statistical Translation. Stroudsburg, PA, USA: 2007: 80-87.
[41] C Lo, D Wu. MEANT: An inexpensive, high-accuracy, semi-automatic metric for evaluating translation utility via semantic frames[C]//Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies. Stroudsburg, PA, USA: 2011: 220-229.
[42] M Rios, W Aziz, L Specia. TINE: A Metric to Assess MT Adequacy[C]//Proceedings of the Sixth Workshop on Statistical Machine Translation. Edinburgh, Scotland: 2011: 116-122.
[43] A Kulesza, S M Shieber. A Learning Approach to Improving Sentence-Level MT Evaluation[C]//Proceedings of the 10th International Conference on Theoretical and Methodological Issues in Machine Translation. Baltimore: 2004.
[44] X Song, T Cohn. Regression and Ranking based Optimisation for Sentence Level MT Evaluation[C]//Proceedings of the Sixth Workshop on Statistical Machine Translation. Edinburgh, Scotland: 2011: 123-129.
[45] S Sun, Y Chen, J Li. A Re-examination on Features in Regression Based Approach to Automatic MT Evaluation[C]//Proceedings of the 46th Annual Meeting of the Association for Computational Linguistics on Human Language Technologies: Student Research Workshop. Stroudsburg, PA, USA: 2008: 25-30.
[46] E Avramidis, M Popovi C, D Vilar, et al. Evaluate with Confidence Estimation: Machine ranking of translation outputs using grammatical features[C]//Proceedings of the Sixth Workshop on Statistical Machine Translation. Edinburgh, Scotland: 2011: 65-70.
[47] E Amig O, J U S Gim E Nez, J Gonzalo, et al. The contribution of linguistic features to automatic machine translation evaluation[C]//Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP: Volume 1-Volume 1. Stroudsburg, PA, USA: 2009: 306-314.
[48] M Gamon, A Aue, M Smets. Sentence-level MT evaluation without reference translations: Beyond language modeling[A]. In 10th EAMT conference Practical applications of machine translation[C]. Budapest: 2005: 103-111.
[49] M P D V Burchardt. Evaluation without references: IBM1 scores as evaluation metrics[C]//Proceedings of the Sixth Workshop on Statistical Machine Translation. Edinburgh, Scotland: 2011: 99-103.
[50] J S Albrecht, R Hwa. The Role of Pseudo References in MT Evaluation[C]//Proceedings of the Third Workshop on Statistical Machine Translation. Stroudsburg, PA, USA: 2008: 187-190.
[51] O Hamon, D Mostefa. The Impact of Reference Quality on Automatic MT Evaluation[C]//Proceedings of the 22nd International Conference on Computational Linguistics (Coling 2008). Manchester, UK: 2008: 39-42.
[52] 刘挺,李维刚,张宇,等. 复述技术研究综述[J]. 中文信息学报. 2006, 20(04): 25-32.
[53] A Finch, Y Akiba, E Sumita. Using a Paraphraser to Improve Machine Translation Evaluation[C]//Proceedings of the International Joint Conference on Natural Language Processing (IJCNLP). 2004.
[54] K Owczarzak, D Groves, J Van Genabith, et al. Contextual Bitext-Derived Paraphrases in Automatic MT Evaluation[C]//Proceedings of the Workshop on Statistical Machine Translation. Stroudsburg, PA, USA: 2006: 86-93.
[55] D Kauchak, R Barzilay. Paraphrasing for Automatic Evaluation[C]//Proceedings of the main conference on Human Language Technology Conference of the North American Chapter of the Association of Computational Linguistics. Stroudsburg, PA, USA: 2006: 455-462.
[56] C Callison-Burch, C Fordyce, P Koehn, et al. Further Meta-Evaluation of Machine Translation[C]//Proceedings of the Third Workshop on Statistical Machine Translation. Columbus, Ohio: 2008: 70-106.
[57] C Callison-Burch, P Koehn, C Monz, et al. Findings of the 2009 Workshop on Statistical Machine Translation[C]//Proceedings of the Fourth Workshop on Statistical Machine Translation. Athens, Greece: 2009: 1-28.
[58] C Callison-Burch, P Koehn, C Monz, et al. Findings of the 2010 Joint Workshop on Statistical Machine Translation and Metrics for Machine Translation[C]//Proceedings of the Joint Fifth Workshop on Statistical Machine Translation and MetricsMATR. Uppsala, Sweden: 2010: 17-53.
[59] C Callison-Burch, P Koehn, C Monz, et al. Findings of the 2011 Workshop on Statistical Machine Translation[C]//Proceedings of the Sixth Workshop on Statistical Machine Translation. Edinburgh, Scotland: 2011: 22-64.
[60] M Przybocki, K Peterson, S E B Bronsart, et al. The NIST 2008 Metrics for machine translation challenge—overview, methodology, metrics, and results[J]. Machine Translation. 2009, 23(2-3): 71-103.

基金

国家自然科学基金(90920004)
PDF(2698 KB)

987

Accesses

0

Citation

Detail

段落导航
相关文章

/