该文介绍了一个新的汉英词语对齐规范。该规范以现有的LDC汉英词语对齐规范为基础,对其进行了较大的改进和扩展,特别是提出了一种全新的对齐标注方法 —— 将词语对齐区分为真对齐和伪对齐,真对齐又分为强对齐和弱对齐。这种细化的标注方法能够更好地刻画词语对齐的特点。该规范已经实际应用于大规模的人工词语对齐标注中。我们对对齐标注的一致性进行了评价。结果表明,在该规范的指导下,标注者内部和标注者间的对齐都取得了比较理想的一致性,两组强、弱、伪三种对齐的Kappa值分别为0.99、0.98、0.93 和0.96、0.83、0.68。最后,一个简单的实验初步证实了该规范在统计机器翻译中的有效性。
Abstract
This paper presents a new guideline for Chinese-English word alignment. Starting from the existing Guidelines for Chinese-English Word Alignment (Linguistic Data Consortium , 2006), we propose a completely different classification for word alignment annotationgenuine link (involving strong link and weak link) and pseudo link. This explicit distinction can represent the characteristic of cross-lingual word alignment. The proposedguideline has been successfully applied in a large-scale task for Chinese-English Word alignment, achieving good intra- and inter-annotator agreemenst at the Kappa coefficients of 0.99、0.98、0.93 and 0.96、0.83、0.68 for the strong link, weak link and pseudo link respectively. And a further experiment proves that such annotated word alignment is useful for SMT system.
Key words artificial intelligence; machine translation; annotation guidelines for Chinese-English word alignment; manual word alignment; genuine link; pseudo link; strong link; weak link; alignment and annotation agreement
关键词
人工智能 /
机器翻译 /
汉英词语对齐规范 /
手工词语对齐 /
真对齐 /
伪对齐 /
强对齐 /
弱对齐 /
对齐和标注一致性
{{custom_keyword}} /
Key words
artificial intelligence /
machine translation /
annotation guidelines for Chinese-English word alignment /
manual word alignment /
genuine link /
pseudo link /
strong link /
weak link /
alignment and annotation agreement
{{custom_keyword}} /
{{custom_sec.title}}
{{custom_sec.title}}
{{custom_sec.content}}
参考文献
[1]F.J. Och and Hermann Ney. A systematic comparison of various statistical alignment models [J]. Computational Linguistics, 2003, March, 29(1):1951.
[2]Melamed, D. Annotation style guide for the Blinker project, Version 1.0.4. [R]. IRCS Technical Report #9806: University of Pennsylvania, Philadelphia , 1998.
[3]Jean Véronis. ARCADE Tagging guidelines for word alignment, Version 1.0. [OL]. 1998. http:aune.lpl.univaix.fr/projects/arcade/2nd/word/guide/index.html.
[4]Linguistic Data Consortium. Guidelines for ChineseEnglish Word Alignment, Version 1.1. [OL]. 2006. http:projects.ldc.upenn.edu/gale/Alignment/specs/GALE_Chinese_alignment_guidelines_v1.1.pdf.
[5]Linguistic Data Consortium. Guidelines for ChineseEnglish Word Alignment, Version 3.0. [OL]. 2008.
http:projects.ldc.upenn.edu/gale/Alignment/specs/GALE_Chinese_alignment_guidelines_v3.0.pdf
[6]F.J. Och and H. Ney. Improved statistical alignment models [C]Proc. of the 38th Annual Meeting of the ACL. Hong Kong, China, 2000: pages 440447.
[7]J.Cohen. A coefficient of agreement for nominal scales [OL]. 1960. http://www.garfield.library.upenn.edu/classics1986/A1986AXF2600001.pdf.
[8]J.Carletta. Assessing agreement on classification tasks: the Kappa statistics [OL]. 1996. http:acl.ldc.upenn.edu/J/J96/J962004.pdf.
[9]K.Krippendorff. Content Analysis: An introduction to its Methodology [M]. Beverly Hills: Sage Publications, 1980.
[10]Philip Koehn et al. Moses: Open source toolkit for statistical machine translation [C]Proceedings of the ACL Demo and Poster Sessions. 2007: pages 177180.
[11]Philipp Koehn, Franz Josef Och and Daniel Marcu. Statistical phrasebased translation [C]Proceedings of HLT/NAACL. 2003: pages 8188.
[12]Franz Josef Och. Minimum Error Rate Training in Statistical Machine Translation [C]Proceedings of the 41st Annual Meeting of the Association for Computational Linguistics. 2003: pages 160167.
[13]Papineni, K.S. Roukos, T. Ward, and W.J. Zhu. BLEU: a Method for Automatic Evaluation of Machine Translation [C]Proceedings of 40th Annual Meeting of the Association for Computational Linguistics (ACL). Philadelphia, PA: 2002: pages 311318.
{{custom_fnGroup.title_cn}}
脚注
{{custom_fn.content}}