汉蒙语形态差异性及平行语料库规模小制约了汉蒙统计机器翻译性能的提升。该文将蒙古语形态信息引入汉蒙统计机器翻译中,通过将蒙古语切分成词素的形式,构造汉语词和蒙古语词素,以及蒙古语词素和蒙古语的映射关系,弥补汉蒙形态结构上的非对称性,并将词素作为中间语言,通过训练汉语—蒙古语词素以及蒙古语词素-蒙古语统计机器翻译系统,构建出新的短语翻译表和调序模型,并采用多路径解码及多特征的方式融入汉蒙统计机器翻译。实验结果表明,将基于词素媒介构建出的短语翻译表和调序模型引入现有统计机器翻译方法,使得译文在BLEU值上比基线系统有了明显提高,一定程度上消解了数据稀疏和形态差异对汉蒙统计机器翻译的影响。该方法是一种通用的方法,通过词素和短语两个层面信息的结合,实现了两种语言在形态结构上的对称,不仅适用于汉蒙统计机器翻译,还适用于形态非对称且低资源的语言对。
Abstract
To deal with the morphological difference between Chinese and Mongolian, this paper proposes a method of adopting morpheme of Mongolian as the pivot to Chinese-Mongolian statistical machine translation (SMT). First, we segment Mongolian word into morphemes, achieving a balance in the morphology of the language pair. Then, we treat Mongolian morpheme as pivot language and construct two new SMT systems: Chinese-Morpheme SMT and Morpheme-Mongolian SMT. New translation knowledge including phrase translation table and reordering model is introduced for these two SMT systems. Finally, we use multiple decoding paths and multiple features to incorporate the new translation knowledge. Experimental results demonstrate our method can improve the translation quality significantly.
关键词
中间语言 /
词素 /
统计机器翻译 /
短语翻译表 /
调序模型
{{custom_keyword}} /
Key words
pivot language /
morpheme /
statistical machine translation /
phrase translation table /
reordering model
{{custom_keyword}} /
{{custom_sec.title}}
{{custom_sec.title}}
{{custom_sec.content}}
参考文献
[1] 杨攀, 张建, 李淼,等.汉蒙统计机器翻译中的形态学方法研究[J]. 中文信息学报, 2009, 23(1): 50-57.
[2] Ke Tran, Arianna Bisazza, Christof Monz. Word translation prediction for morphologically rich languages with bilingual neural networks[C]//Proceedings of EMNLP, 2014: 1676-1688.
[3] Hassan Al-Haj, Alon Lavie. The impact of Arabic morphological segmentation on broad-coverage English-to-Arabic statistical machine translation[J]. Machine translation, 2012, 26(1-2): 3-24.
[4] Ahmed El Kholy, Nizar Habash. Translate, predict or generate: modeling rich morphology in statistical machine translation[C]//Proceedings of EAMT, 2012: 27-34.
[5] Minh-Thang Luong, Preslav Nakov, Min-Yen Kan. A hybrid morpheme-word representation for machine translation of morphologically rich languages[C]//Proceedings of EMNLP, 2010: 148-157.
[6] Sharon Goldwater, David McClosky. Improving statistical MT through morphological analysis[C]//Proceedings of HLT-EMNLP, 2005: 676-683.
[7] Nimesh Singh, Nizar Habash. Hebrew morphological preprocessing for statistical machine translation[C]//Proceedings of EAMT, 2012: 43-50.
[8] Mohammad Salameh, Colin Cherry, Greg Kondrak. Lattice desegmentation for statistical machine translation[C]//Proceedings of ACL, 2014: 100-110.
[9] 骆凯, 李淼, 乌达巴拉,等.汉蒙翻译模型中的依存语法与形态信息应用研究[J]. 中文信息学报, 2009, 23(6): 98-104.
[10] Wen Li, Lei Chen, Miao Li, et al. Chained machine translation using morphemes as pivot language[C]//Proceedings of COLING, 2010: 169-177.
[11] Philipp Koehn, Franz Josef Och, Daniel Marcu. Statistical phrase-based translation[C]//Proceedings of NAACL-HLT, 2003: 48-54.
[12] Philipp Koehn, Hieu Hoang, Alexandra Birch, et al. Moses: open source toolkit for statistical machine translation[C]//Proceedings of ACL, 2007: 177-180.
[13] Franz Josef Och, Hermann Ney. Improved statistical alignment models[C]//Proceedings of ACL, 2000: 440-447.
[14] Andreas Stolcke. SRILM-an extensible language modeling toolkit.[C]//Proceedings of International Conference on Spoken Language Processing, 2002: 901-904.
[15] Stanley F Chen, Joshua Goodman. An empirical study of smoothing techniques for language modeling[C]//Proceedings of ACL, 1996: 310-318.
[16] 刘群, 张华平, 俞鸿魁,等.基于层叠隐马模型的汉语词法分析[J]. 计算机研究与发展, 2004, 41(8): 1421-1429.
[17] Franz Josef Och. Minimum error rate training in statistical machine translation[C]//Proceedings of ACL, 2003: 160-167.
[18] Kishore Papineni, Salim Roukos, Todd Ward,et al. BLEU: a method for automatic evaluation of machine translation[C]//Proceedings of ACL, 2002: 311-318.
[19] Miantao He, Miao Li, Lei Chen. Mongolian morphological segmentation with hidden Markov model[C]//Proceedings of IALP, 2012: 117-120.
[20] Hui Liu, Miao Li, Jian Zhang, et al. Morpheme Segmentation Using Bilingual Features[C]//Proceedings of IALP, 2012: 209- 212.
{{custom_fnGroup.title_cn}}
脚注
{{custom_fn.content}}
基金
国家自然科学基金(61502445,61572462);中国科学院信息化专项(XXH12504-1-10)
{{custom_fund}}