统计机器翻译系统用于翻译领域文本时,常常会遇到跨领域的问题 当待翻译文本与训练语料来自同一领域时,通常会得到较好的翻译效果;当领域差别较大时,翻译质量会明显下降。某个特定领域的双语平行语料是有限的,相对来说,领域混杂的平行语料和特定领域的单语文本更容易获得。该文充分利用这一特点,提出了一种包含领域信息的翻译概率计算模型,该模型联合使用混合领域双语和特定领域源语言单语进行机器翻译领域自适应。实验显示,自适应模型在IWSLT机器翻译评测3个测试集上均比Baseline有提高,证明了该文方法的有效性。
Abstract
Domain adaptation problem will arise when statistical machine translation (SMT) system is used to translate domain-specific texts. When the texts to be translated and the training data come from the same domain, SMT system can achieve good performance. Otherwise, the translation quality will degrade dramatically. In general, domain-specific parallel corpus is limited, while domain-mixed parallel corpus and domain-specific monolingual corpus are easy to obtain. According to the fact, this paper proposed a new translation model which utilized domain-mixed parallel corpus and domain-specific monolingual corpus to improve the domain translation quality. Experiments show that the proposed method improves translation performance in three IWSLT evaluation tests significantly.
Key wordsstatistical machine translation; domain adaptation; context information
关键词
统计机器翻译 /
领域自适应 /
上下文信息
{{custom_keyword}} /
Key words
statistical machine translation /
domain adaptation /
context information
/
/
/
/
/
/
/
/
{{custom_keyword}} /
{{custom_sec.title}}
{{custom_sec.title}}
{{custom_sec.content}}
参考文献
[1] Peter. F. Brown, Stephen A. Della Pietra, Vincent J. Della Pietra,Vincent J. Della Pietra, Robert L. Mercer, The Mathematics of Statistical Machine Translation: Parameter Estimation[J]. Computational Linguisitics, 1993,19(2):263-312.
[2] Philipp Koehn, Franz Josef Och, and Daniel Marcu.2003. Statistical phrase-based translateion[C]//Proceedings of HLT-NAACL 2003: 127-133.
[3] Franz Josef Och and Hermann Ney. Discrimitive training and maximum entropy models for statistical machine translation[C]//Proceedings of ACL 2002, 2002: 295-302.
[4] Matthias Eck, Stephan Vogel, Alex Waibel. Language model adaptation for statistical machine translation based on information retrieval[C]//International Conference on Language Resources and Evaluation,2004.
[5] Bing Zhao, Matthias Eck, Stephan Vogel. Language Model Adaptation for Statistical Machine Translation ria structured query modes[C]//Proc. of COLING, 2004: 411-417.
[6] Almut Silja Hildebrand et al, Adaptation of the Translation Model for Statistical Machine Translation based on Information Retrieval [C]//Proc. of EAMT 2005, 2005: 133-142.
[7] Nicola Ueffing, Gholamreza Haffari and Anoop Sarkar. Semi-superivesed Model Adaptation for Statistical Machine Translation[J]. Machine Translation, 2008, 21(2): 77-94.
[8] Yajuan Lü, Jin Huang. Improving Statistical Machine Translation Performance by Training Data Selection and Optimization[C]//International Conference on Empirical Methods in Natural Language Processing (EMNLP), 2007: 343-350.
[9] A.Stolcke. 2002. SRILM-an extensible language modeling toolkit[C]//Proc. of ICSLP, 2002: 901-904.
[10] Papinensi, Kishore, Salim Roukos, Todd Ward, and Wei-Jing Zhu. BLEU: A Method for Automatic Evaluation of Machine Translation[C]//Proc. of the 40th Annual Meeting of the Association of Computational Linguistics, 2002: 311-318.
[11] 俞士汶,段慧明,朱学锋,孙斌,常宝宝. 北大语料库加工规范: 分词 词性标注 注音[J]. Journal of Chinese Language and Computing, 2002, 13(2): 121-158.
{{custom_fnGroup.title_cn}}
脚注
{{custom_fn.content}}
基金
国家自然科学基金资助项目(60873167)
{{custom_fund}}