译文语序的领域性思考:一种融合主题信息的领域自适应调序模型

刘梦眙,姚亮,洪宇,刘昊,姚建民

PDF(1878 KB)
PDF(1878 KB)
中文信息学报 ›› 2017, Vol. 31 ›› Issue (5) : 50-58.
机器翻译

译文语序的领域性思考:一种融合主题信息的领域自适应调序模型

  • 刘梦眙,姚亮,洪宇,刘昊,姚建民
作者信息 +

Domain Adaptation of Reordering Model via Topic Information: Word Order in Translated Text across Domains

  • LIU Mengyi, YAO Liang, HONG Yu, LIU Hao, YAO Jianmin
Author information +
History +

摘要

领域自适应研究的目标是建立一种动态调整翻译模型,使翻译模型对目标领域的语言特征具备较强的学习和处理能力,借以保证翻译系统在不同领域获得平衡可靠的翻译能力。现有翻译模型的自适应研究已经取得显著进展,但调序过程的领域适应性研究相对较少。在该文前期工作中通过对大规模源语言和目标语言的真实互译样本统计发现,在语义等价的短语级互译对子中,36.17%的样本在不同领域中的语序存在显著差异。针对这一问题,该文从主题角度出发,探索不同主题分布下的短语调序差异,提出一种融合主题信息的领域自适应调序模型。实验结果显示,嵌入调序适应性模型的翻译系统取得了较为明显的性能优势。

Abstract

The research on domain adaptation (DA) for statistical machine translation (SMT) aims at dynamically adjusting the translation model to ensure balanced and reliable translation quality in different domains. Existing researches on adaptation of translation model have made remarkable progress, but neglect the reordering issue. This paper investigates the translation samples in a large scale source bilingual corpus, revealing that 36.17% samples exhibits clear word order differences in phrase level translation pairs. Therefore, we propose a domain adaptive reordering model based on fusing topic information, to explore the reordering differences of phrases under different topic distribution. Experimental results show that translation systems with adaptive reordering model yield obvious performance improvements.

关键词

统计机器翻译 / 领域适应性 / 调序模型 / 主题模型

Key words

statistical machine translation / domain adaptation / reordering model / topic model

引用本文

导出引用
刘梦眙,姚亮,洪宇,刘昊,姚建民. 译文语序的领域性思考:一种融合主题信息的领域自适应调序模型. 中文信息学报. 2017, 31(5): 50-58
LIU Mengyi, YAO Liang, HONG Yu, LIU Hao, YAO Jianmin. Domain Adaptation of Reordering Model via Topic Information: Word Order in Translated Text across Domains. Journal of Chinese Information Processing. 2017, 31(5): 50-58

参考文献

[1] Axelrod A, He Xiaodong, Gao Jianfeng. Domain adaptation via pseudo in-domain data selection[C]//Proceedings of the Conference on Empirical Methods in Natural Language Processins. Edingburg, Scotland, United Kingdom:Association for Computational Linguistics, 2011, 355-362.
[2] 庞弘燊, 方曙, 杨志刚, 等. 研究领域的主题发展趋势分析方法研究:基于多重共现的视角[J]. 情报理论与实践, 2012, 35(8):44-47, 73.
[3] 冯洋, 张冬冬, 刘群. 层次短语翻译模型的介词短语调序[J]. 中文信息学报, 2012, 26(1):31-36.
[4] 何钟豪, 苏劲松, 史晓东, 等. 引入集成学习的最大熵短语调序模型[J]. 中文信息学报, 2014, 28(1):87-93.
[5] 肖欣延, 刘洋, 刘群, 等. 面向层次短语翻译的词汇化调序方法研究[J]. 中文信息学报, 2012, 26(1):37-41, 50.
[6] Cao Hailong, Zhang Dongdong, Li Mu, et al. A lexicalized reordering model for hierarchical phrase-based translation[C]//Proceedings of the 25th International Conference on Computational Linguistics. Dublin, Ireland:Technical Papers, 2014:1144-1153.
[7] Yasuda K, Zhang Ruiqiang, Hirofumi Y, et al. Method of selecting training data to build a compact and efficient translation model[C]//Proceedings of the 3rd International Joint Conference on Natural Language Processing. Hyderabad, India:The Association for Computer Linguistics, 2008:655-660.
[8] Duh K, Neubig G, Sudoh K, et al. Adaptation data selection using neural language models:experiment in machine translation[C]//Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics. Sofia, Bulgaria:Association for Computational Linguistics, 2013:678-683.
[9] 王星, 涂兆鹏, 谢军, 等. 一种基于分类的平行语料选择方法[J]. 中文信息学报, 2013, 27(6):144-150.
[10] Liu Le, Hong Yu, Liu Hao, et al. Effective selection of translation model training data[C]//Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics. Baltimore, Maryland, USA:Association for Computational Linguistics, 2014:569-573.
[11] Foster G, Kuhn R. Mixture-model adaptation for SMT[C]//Proceedings of the 2nd Workshop on Statistical Machine Translation. Prague, Czech Republic:Association for Computational Linguistics, 2007:128-135.
[12] Matsoukas S, Rosti A V I, Zhang B. Discriminative corpus weight estimation for machine translation[C]//Proceedings of the Conference on Empirical Methods in Natural Language Processing. Singapore:Association for Computational Linguistics, 2009:708-717.
[13] 曹杰, 吕雅娟, 苏劲松, 等. 利用上下文信息的统计机器翻译领域自适应[J]. 中文信息学报, 2010, 24(6):50-56.
[14] Foster G, Goutte C, Kuhn R. Discriminative instance weighting for domain adaptation in statistical machine translation[C]//Proceedings of the Conference on Empirical Methods in Natural Language Processing. Massachusetts, USA:Association for Computational Linguistics, 2010:451-459.
[15] Su Jinsong, Wu Hua, Wang Haifeng, et al. Translation model adaptation for statistical machine translation with monolingual topic information[C]//Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics. Jeju, Republic of Korea:Association for Computational Linguistics, 2012:459-468.
[16] Hewavitharana S, Mehay D N, Ananthakrishnan S, et al. Incremental topic-based translation model adaptation for conversational spoken language translation[C]//Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics. Sofia, Bulgaria:Association for Computational Linguistics, 2013:697-701.
[17] Hasler E, Blunsom P, Koehn P, et al. Dynamic Topic Adaptation for Phrase-based MT[C]//Proceedings of the 14th Conference of the European Chapter of the Association for Computational Linguistics. Gothenburg, Sweden:Association for Computational Linguistics, 2014:328-337.
[18] Chen B, Foster G, Kuhn R. Adaptation of reordering models for statistical machine translation[C]//Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology. Atlanta, Georgia:Association for Computational Linguistics, 2013:938-946.
[19] Wang X, Xiong D, Zhang Min, et al. A topic-based reordering model for statistical machine translation[M]. Berlin Heidelberg:Springer, 2014.
[20] Zhang B, Su J, Xiong D, et al. Discriminative reordering model adaptation via structural learning[C]//Proceedings of the 24th International Conference on Artificial Intelligence. Buenos Aires, Argentina:AAAI Press, 2015:1040-1046.
[21] Tillmann C, Zhang T. A localized prediction model for statistical machine translation[C]//Proceedings of the 43rd Annual Meeting of the Association for Computational Linguistics. Ann Arbor, Michigan:Association for Computational Linguistics, 2005:557-564.
[22] Blei D M, Andrew Y Ng, Michael I J. Latent dirichlet allocation[J]. Journal of Machine Learning Research, 2003, 3:993-1022.
[23] Koehn P, Och F, Marcu D. Statistical phrase-based translation[C]//Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology. Edmonton, Canada:Association for Computational Linguistics, 2003:48-54.
[24] Koehn P, Hoang H, Birch A, et al. Moses:open source toolkit for statistical machine translation[C]//Proceedings of the 45th Annual Meeting of the Association for Computational Linguistics on Interactive Poster and Demonstration Sessions. Prague, Czech Republic:Association for Computational Linguistics, 2007:177-180.
[25] Xiao T, Zhu J, Zhang H, et al. NiuTrans:an open source toolkit for phrase-based and syntax-based machine translation[C]//Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics. Jeju, Republic of Korea:Association for Computational Linguistics, 2012:19-24.
[26] Franz J O, Hermann N. A systematic comparison of various statistical alignment models[J]. Computational Linguistics, 2003, 29(Jan):19-51.
[27] Andreas S. SRILM-an extensible language modeling toolkit[C]//Proceedings of the 7th International Conference on Spoken Language Processing. Denver, Colorado, USA:Interspeech, 2002:901-904.
[28] Franz J O. Minimum error rate training in statistical machine translation[C]//Proceedings of the 41st Annual Meeting of the Association for Computational Linguistics. Sapporo, Japan:Association for Computational Linguistics, 2003:160-167.
[29] Kishore P, Salim R, Todd W, et al. BLEU:a method for automatic evaluation of machine translation[C]//Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics. Philadelphia, Pennsylvania:Association for Computational Linguistics, 2002:311-318.

基金

国家自然科学基金(61373097,61672368,61672367,61331011);江苏省科技计划(SBK2015022101);教育部—中国移动科研基金(MCM20150602)
PDF(1878 KB)

580

Accesses

0

Citation

Detail

段落导航
相关文章

/