基于翻译模型和语言模型相融合的双语句对选择方法

姚 亮,洪 宇,刘 昊,刘 乐,姚建民

PDF(1360 KB)
PDF(1360 KB)
中文信息学报 ›› 2016, Vol. 30 ›› Issue (5) : 145-152.
综述

基于翻译模型和语言模型相融合的双语句对选择方法

  • 姚 亮,洪 宇,刘 昊,刘 乐,姚建民
作者信息 +

Combining Translation and Language Models for Bilingual Data Selection

  • YAO Liang, HONG Yu, LIU Hao, LIU Le, YAO Jianmin
Author information +
History +

摘要

双语句对选择方法旨在从大规模通用领域双语语料库中,自动抽取与待翻译文本领域相关性较高的句对,以缓解特定领域翻译模型训练语料不足的问题。区别于原有基于语言模型的双语句对选择方法,该文从句对生成式建模的角度出发,提出一种基于翻译模型和语言模型相融合的双语句对选择方法。该方法能够有效评价双语句对的领域相关性及互译性。实验结果显示,利用该文所提方法选择双语句对训练所得翻译系统,相比于基准系统,在测试集上性能提升3.5个BLEU值;此外,针对不同句对质量评价特征之间的权重调节问题,该文提出一种基于句对重排序的特征权重自动优化方法。基于该方法的机器翻译系统性能继续提升0.68个BLEU值。

Abstract

Data Selection aims at selecting sentence pairs most relevant to target domain from large scale general-domain bilingual corpus that are , so as to alleviate the lack of high quality bi-text for statistical machine translation in the domain of interest. Instead of solely using traditional language models, we propose a novel approach combining translation models with language models for data selection from the perspective of generative modeling. The approach can better measure the relevance between sentence pairs and the target domain, as well as the translation probability of sentence pair. Experiments show that the optimized system trained on selected bi-text using our methods outperforms the baseline system trained on general-domain corpus by 3.5 BLEU points. In addition, we present an effective method based on sentence pairs re-ranking to tune the weights of different features which are used for evaluating quality of general domain texts. Machine translation system based on this method achieves further imporvments of 0.68 BLEU points.
Keywords: bilingual data selection; generative modeling; translation model; language model; weight tuning 收稿日期: 2015-07-31 定稿日期: 2016-01-25 基金项目: 国家自然科学基金(61373097, 61272259, 61272260)

关键词

双语句对选择 / 生成式建模 / 翻译模型 / 语言模型 / 权重调节

引用本文

导出引用
姚 亮,洪 宇,刘 昊,刘 乐,姚建民. 基于翻译模型和语言模型相融合的双语句对选择方法. 中文信息学报. 2016, 30(5): 145-152
YAO Liang, HONG Yu, LIU Hao, LIU Le, YAO Jianmin. Combining Translation and Language Models for Bilingual Data Selection. Journal of Chinese Information Processing. 2016, 30(5): 145-152

参考文献

[1] Pavel P, Antonio T, Andy W, et al. Towards using web-crawled data for domain adaptation in statistical machine translation[C]//Proceedings of the 15th Annual Conference of the European Association for Machine Translation.2011: 297-304.
[2] 刘昊, 洪宇, 刘乐等. 基于全局搜索和局部分类的特定领域双语网站识别方法[C]. 第二十届全国信息检索学术会议(CCIR). KunMing, China, 2014.
[3] SpencerRarrick, Chris Quirk, Will Lewis. MT detection in web-scraped parallel corpora[C]//Proceedings of the Machine Translation Summit.2011: 422-429.
[4] Su J, Wu H, Wang H, et al. Translation model adaptation for statistical machine translation with monolingual topic information[C]//Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, 2012: 459-468.
[5] Foster G,Goutte C, Kuhn R. Discriminative instance weighting for domain adaptation in statistical machine translation[C]//Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 2010: 451-459.
[6] Sennrich R, Schwenk H, Aransa W. A Multi-Domain Translation Model Framework for Statistical Machine Translation[C]//Proceedings of the 51th Annual Meeting of the Association for Computational Linguistics.2013: 832-840.
[7] Lü, Yajuan, Jin H, Qun L. Improving Statistical Machine Translation Performance by Training Data Selection and Optimization[C]//Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational, 2007: 343-350.
[8] 黄瑾, 吕雅娟, 刘群. 基于信息检索方法的统计翻译系统训练数据选择与优化[J]. 中文信息学报, 2008, 22(2): 40-46.
[9] Yasuda K, Zhang R, Yamamoto H, et al. Method of Selecting Training Data to Build a Compact and Efficient Translation Model[C]//Proceedings of the IJCNLP.2008: 655-660.
[10] Moore R C, Lewis W. Intelligent selection of language model training data[C]//Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, 2010: 220-224.
[11] Axelrod A, He X,Gao J. Domain adaptation via pseudo in-domain data selection[C]//Proceedings of the Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 2011: 355-362.
[12] Haddow B, Philipp K. Analysing the effect of out-of-domain data on SMT systems[C]//Proceedings of the Seventh Workshop on Statistical Machine Translation. Association for Computational Linguistics, 2012: 422-432.
[13] Duh K,Neubig G, Sudoh K, et al. Adaptation Data Selection using Neural Language Models: Experiments in Machine Translation[C]//Proceedings of the 51th Annual Meeting of the Association for Computational Linguistics.2013: 678-683.
[14] 姚树杰, 肖桐, 朱靖波. 基于句对质量和覆盖度的统计机器翻译训练语料选取[J]. 中文信息学报, 2011, 25(2): 72-77.
[15] 王星, 涂兆鹏, 谢军, 等. 一种基于分类的平行语料选择方法[J]. 中文信息学报, 2013, 27(6): 144-150.
[16] Brown P F,Pietra V J D, Pietra S A D, et al. The mathematics of statistical machine translation: Parameter estimation [J]. Computational linguistics, 1993, 19(2): 263-311.
[17] Buckley C, Voorhees E M. Evaluating evaluation measure stability[C]//Proceedings of the 23rd annual international ACM SIGIR conference on Research and development in information retrieval. ACM, 2000: 33-40.
[18] Liu L, Hong Y, Lu J, et al. An Iterative Link-based Method for Parallel Web Page Mining [C]//Proceedings of the Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics.2014: 1216-1233.
[19] Xiao T, Zhu J, Zhang H, et al. NiuTrans: an open source toolkit for phrase-based and syntax-based machine translation[C]//Proceedings of the ACL 2012 System Demonstrations. Association for Computational Linguistics, 2012: 19-24.
[20] Och F J, Ney H. A systematic comparison of various statistical alignment models [J]. Computational linguistics, 2003, 29(1): 19-51.
[21] Andreas Stolcke. SRILM-an extensible language modeling toolkit[C]//Proceedings of the International Conference on Spoken Language Processing.2002: 901-904.
[22] Och F J. Minimum error rate training in statistical machine translation[C]//Proceedings of the 41st Annual Meeting on Association for Computational Linguistics. Association for Computational Linguistics, 2003: 160-167.
[23] Papineni K, Roukos S, Ward T, et al. BLEU: a method for automatic evaluation of machine translation[C]//Proceedings of the 40th annual meeting on association for computational linguistics. Association for Computational Linguistics, 2002: 311-318.

基金

国家自然科学基金(61373097, 61272259, 61272260)
PDF(1360 KB)

536

Accesses

0

Citation

Detail

段落导航
相关文章

/