随着统计机器翻译系统训练语料的不断增加,长句子的数量越来越多,如何有效地利用长句子中的信息改进翻译质量是统计机器翻译系统面临的主要问题之一。该文基于Xu的句子切分模型,提出了一种在训练阶段切分长句子的方法,该方法利用自动获取的边界词概率和切分后子句对的长度比例来指导切分过程,从而得到更符合语义信息的句子切分结果。在NIST测试集上的实验结果表明,该方法获得了最大0.5个BLEU值的提升。
Abstract
Long sentence segmentation is a valid issue in optimizing the quality of machine translation. This paper proposes a new method for long sentence segmentation during the training process. This method automatically decides the boundary words and their probabilities without manual intervention, which results more meaningful segmentation in semantics. Also, the length of segmented sub-sentences are balanced through both source and target languages. Experiments on the NIST test sets show an improvement of up to 0.5 BLEU scores.
关键词
统计机器翻译 /
句子切分模型 /
边界词概率
{{custom_keyword}} /
Key words
statistical machine translation /
sentence segmentation model /
word boundary probability
{{custom_keyword}} /
{{custom_sec.title}}
{{custom_sec.title}}
{{custom_sec.content}}
参考文献
[1] Yamada K, K Knight. A syntax-based statistical translation model[C]//Proceedings of ACL,2001: 523-530.
[2] Philipp Koehn, Franz Joseph Och, Daniel Marcu. Statistical phrase-based translation[C]//Procedings of In Human Language Technology Conf. / North American Chapter of the Assoc. for Computational Linguistics Annual Meeting(HLT-NAACL). Edmonton. Canada, May/June,2003: 127-133
[3] 刘群. 统计机器翻译综述[J]. 中文信息学报, 2003,17(4): 1-12.
[4] Yang Liu, Qun Liu, Shouxun Lin. Tree-to-string alignment template for statistical machine translation.//Proceedings of COLING/ACL 2006, Sydney, Australia, July,2006: 609-616.
[5] Philipp Koehn, Hieu Hoang, Alexandra Birch, et al.Moses: Open source toolkit for statistical machine translation[C]//Annual Meeting of the Association for Computational Linguistics(ACL), demonstration session, Prague, Czech Republic, June 2007: 177-180.
[6] David Chiang. Hierarchical phrase-based translation[J]. Computational Linguistics, 2007: 201-208.
[7] Yanqing He, Jiajun Zhang, Maoxi Li, et al. The casia statistical machine translation system for iwslt 2008[C]//Proceedings of the IWSLT, 2008: 85-91.
[8] Maoxi Li, Jiajun Zhang, Yu Zhou, et al. The casia statistical machine translation system for iwslt 2009[C]//Proceedings of the IWSLT, 2009: 83-90.
[9] Tong Xiao, Jingbo Zhu, Hao Zhang NiuTrans: An open source toolkit for phrase-based and syntax-based machine translation[C]//Proceedings of ACL 2012 System Demonstrations,2012: 19-24.
[10] Yenu-Bae Kim, Terumasa Thara. A method for partitioning of long Japanese sentences with subject resolution in J/E machine translation[C]//Proceedings of International Conference On Computer Processing of Oriental Language,1994: 467-473.
[11] Francisco Nevado, Francisco Casacuberta, Enrique Vidal. Parallel corpora segmentation using anchor words[C]//Proceedings of the 7th International EAMT workshop on MT and other Language Technology Tools, Improving MT through other Language technology tools: resources and tools for building MT, 2003: 33-40.
[12] J Xu, R Zens. Sentence segmentation using IBM word alignment model 1[C]//Proceedings the 10th Annual Conference of the European Association for Machine Translation, Budapest, Hungary, 2005: 280-287.
[13] B Meng, S Huang, X Dai, et al. J.: Segmenting long sentence pairs for statistical machine translation[C]//Proceedings of International Conference on Asian Language Processing, Singapore, 2009: 53-58.
[14] Takao Doi, Eiichiro Sumita. input sentence splitting and translating[C]//Processings of the HLT/NAACL: Workshop on Building and Using Parallel Texts.2003: 104-110.
[15] Osamu Furuse, Setsuo Yamada, Kazuhide Yamamoto. Splitting long and ill-formed input for robust spoken-language translstion[C]//Processings of COLING-ACL, 1998: 421-460.
[16] Sudoh, K, Duh, K, Tsukada, et al, Divide and translate: improving long distance reordering in statistical machine translation[C]//Proceedings of the Joint 5th Workshop on SMT and Metrics MATR, 2010: 418-427.
[17] D Xiong, M Zhang, H Li, Learning translation boundaries for phrase-based decoding.//Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the ACL,Los Angeles, California2010: 136-144.
[18] Hao Zhang, Daniel Gildea, David Chiang. Extracting synchronous grammars rules from word level alignments in linear time[C]//Proceeding of COLING 2008: 1081-1088.
{{custom_fnGroup.title_cn}}
脚注
{{custom_fn.content}}