汉蒙统计机器翻译中的形态学方法研究

杨攀,张建,李淼,乌达巴拉,雪艳3

PDF(688 KB)
PDF(688 KB)
中文信息学报 ›› 2009, Vol. 23 ›› Issue (1) : 50.
综述

汉蒙统计机器翻译中的形态学方法研究

  • 杨攀1,2,张建1,李淼1,乌达巴拉1,雪艳3
作者信息 +

Morpholog-Processing in Chinese-Mongolian Statistical Machine Translation

  • YANG Pan1,2, ZHANG Jian1, LI Miao1, Wudabala1, XUE Yan3
Author information +
History +

摘要

该文将形态学方法引入到汉蒙统计机器翻译的研究中,尝试解决译文词形选择及语序混乱问题。首先介绍语料库的准备对原始汉蒙平行语料库进行词法分析及标注,得到两组基础语料库,再由基础语料库生成两组用于形态学实验的派生语料库。其次阐述统计模型的训练,包括语言模型、翻译模型及生成模型。同时讨论了解码的扩展问题。最后重点分析两组形态学方法实验词素模型实验和factored方法实验。结果表明,相对于基线(baseline)实验,引入形态学方法后两组实验的BLEU评分均有所提高,译文词形选择及语序混乱问题得到了一定程度的解决。

Abstract

This paper presents an approach to morphology processing in Chinese-Mongolian statistical machine translation, attempting to resolve problems of the word form selection and the word re-ordering in translation generation. On the basis of the original Chinese-Mongolian parallel corpus which is morphologically analyzed and POS tagged, two corpora are derived for the morphological experiments. Then the statistical models, including the language model, the translation model and the generation model, are established. The issue of decoding expansion is also discussed. Finally we analyze the two experiments based on different morphological processing methodsmorpheme model experiment and factored method experiment. The results show that the BLEU scores of on the two morphological processing methods are better than the baseline system, revealing our method partially solved the problem of word form selection and word ordering.

关键词

计算机应用 / 中文信息处理 / 形态学 / 统计机器翻译 / 语料库 / 统计模型 / 解码

Key words

wordscomputer application / Chinese information processing / morphology / statistical machine translation / corpus / statistical model / decoding

引用本文

导出引用
杨攀,张建,李淼,乌达巴拉,雪艳3. 汉蒙统计机器翻译中的形态学方法研究. 中文信息学报. 2009, 23(1): 50
YANG Pan, ZHANG Jian, LI Miao, Wudabala, XUE Yan. Morpholog-Processing in Chinese-Mongolian Statistical Machine Translation. Journal of Chinese Information Processing. 2009, 23(1): 50

参考文献

[1] 侯宏旭,刘群,那顺乌日图.基于实例的汉蒙机器翻译[J].中文信息学报,2007,21(4): 65-72.
[2] Sonja Niessen, Hermann Ney. Statistical Machine translation with Scarce Resources Using Morpho-syntatic Information[J].Computational Linguistics,2004,30(2): 181-204.
[3] Mei Yang,Katrin Kirchhoff.Phrase-based Backoff Models for Machine Translation of Highly Inflected Languages[C]// Proceedings of EACL. 2006: 41-48.
[4] Young-Suk Lee.Morphological analysis for statistical machine translation[C]//Proceedings of HLT-NAACL 2004-Companion Volume. 2004: 57-60.
[5] Andreas Zollmann, Ashish Venugopal, Stephan Vogel.Bridging the Inflection Morphology Gap for Arabic Statistical Machine Translation[C]//Proceedings of the Human Language Technology Conference of the NAACL, Companion Volume. 2006: 201-204.
[6] Maja Popovic, Hermann Ney. Towards the Use of Word Stems and Suffixes for Statistical Machine Translation[C]//Proceedings of the 4th International Conference on Language Resources and Evaluation (LREC). 2004:1585-1588.
[7] Sharon Goldwater, David McClosky. Improving Statistical MT Through Morphological Analysis[C]//Proceedings of Human Language Technology Conference and Conference on Empirical Methods in Natural Language Processing. 2005: 676-683.
[8] Einat Minkov, Kristina Toutanova, Hisami Suzuki. Generating Complex Morphology for Machine Translation[C]//Proceedings of the 45th Annual Meeting of the Association for Computational Linguistics (ACL’07). 2007: 128-135.
[9] Kemal Oflazer, Ilknur Durgar El-Kahlout. Exploring Different Representational Units in English-to-Turkish Statistical Machine Translation[C]//Proceedings of the Second Workshop on Statistical Machine Translation (ACL’07). 2007: 25-32.
[10] P.Koehn, Hieu Hoang, Alexandra Birch et al. Moses: Open Source Toolkit for Statistical Machine Translation[C]//Proceedings of the ACL 2007 Demo and Poster Sessions (ACL’07).2007: 177-180.
[11] P.Koehn, Hieu Hoang. Factored Translation Models[C]//Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning(ACL’07).2007: 868-876.
[12] P. Koehn, F. J. Och, D. Marcu. Statistical Phrase-Based Translation[C]//Proceedings of the Human Language Technology Conference of the North American Chapter of the Association for Computational Linguistics. 2003. Edmonton, Alberta, Canada.
[13] 刘群,张华平,俞鸿魁,等. 基于层叠隐马模型的汉语词法分析[J]. 计算机研究与发展,2004,41(8): 1421-1429.
[14] 那顺乌日图,雪艳,叶嘉明.现代蒙古语语料库加工技术的新进展—新一代蒙古语词语自动切分与标注系统(Darhan Tagging System)[C]//第十届全国少数民族语言文字信息处理学术研讨会论文集.青海: 2005.
[15] 付雷,刘群.单纯形算法在统计机器翻译Re-ranking中的应用[J].中文信息学报,2007,21(3): 28-33.

PDF(688 KB)

Accesses

Citation

Detail

段落导航
相关文章

/