目前汉藏机器翻译的研究主要集中在基于规则的方法上,主要原因在于汉藏的平行语料等基础资源相对匮乏,不方便做大规模的基于统计的汉藏机器翻译实验。该文依据汉藏辅助翻译项目的实际需求,在平行语料资源较少的情况下,提出了一种基于短语串实例的机器翻译方法,为辅助翻译提供候选译文。该方法主要利用词语对齐信息来充分挖掘现有平行语料资源信息。实验结果表明,该文提出的基于短语串实例方法优于传统基于句子实例的翻译,能够检索出任意长度的短语串翻译实例。在实验测试集上,该方法与默认参数下的Moses相比,翻译的BULE值接近Moses,短语翻译实例串的召回率提高了约9.71%。在平均句长为20个词的测试语料上,翻译速度达到平均每句0.175s,满足辅助翻译实时性的要求。
Abstract
At present, the research on Chinese-Tibetan machine translation is focused on rule-based methods. Due to the lack of parallel corpus and other resources between Chinese and Tibetan, it is almost impossible to carry statistical experiments on Chinese-Tibetan machine translation. According to the actual needs of the Chinese-Tibetan Computer Aided Translation, this paper proposes an example phrase based machine translation method. It can fully take advantage of the existing parallel corpus resources using the word-align information to improve the translation quality. Allowing the retrieval of arbitrarily long phrase examples, this approach is proved for a better performance than the example based method on sentence level. On the test data, the method achieves a comparable performance with Moses. The recall of translation phrase makes an improvement of 9.71% over Moses. The translation speed is about 0.175s per sentence, which meets the requirement of the computer aided translation system.
Key wordsmachine translation; computer aided translation; phrase-based translation; example-based translation
关键词
机器翻译 /
辅助翻译 /
基于短语的机器翻译 /
基于实例的机器翻译
{{custom_keyword}} /
Key words
machine translation /
computer aided translation /
phrase-based translation /
example-based translation
{{custom_keyword}} /
{{custom_sec.title}}
{{custom_sec.title}}
{{custom_sec.content}}
参考文献
[1] 陈玉忠,俞士汶.藏文信息处理技术的研究现状与展望[J].中国藏学,2003,(4):97-107.
[2] 罗爱军,格朗,伍金加参等.西藏汉藏翻译队伍状况调查与分析[J].西藏科技,2010,(5):21-23.
[3] 德盖才郎,李延福,项青朝加,等.实用化汉藏机器翻译系统的设计与实现[C]//863计划智能计算机主题学术会议论文集.2001:405-411.
[4] 才藏太,华关加.班智达汉藏公文翻译系统中基于二分法的句法分析方法研究[J].中文信息学报,2005,19(6):7-12.
[5] 扎洛,索南仁欠.汉藏机器翻译中复句的翻译规则研究[C]//中文信息处理前沿进展——中国中文信息学会二十五周年学术会议.2006:454-460.
[6] 看卓才旦,金为勋,李延福,等.汉藏翻译系统中的动词处理研究[J].术语标准化与信息技术,2006,(3):28-32.
[7] 才让加.藏语语料库加工方法研究[J].计算机工程与应用,2011,47(6):138-139,146.
[8] 赵维纳,刘汇丹,等. 面向汉藏辅助翻译系统的平行语料库建设[C]//第三届全国少数民族青年自然语言信息处理暨第二届全国多语言知识库联合学术研讨会, 2010:43-46.
[9] 诺明花,张立强,刘汇丹,等.汉藏短语抽取[J].中文信息学报,2011,25(2):105-110,121.
[10] 诺明花,吴健,刘汇丹,等.汉藏短语对抽取中短语译文获取方法研究[J].中文信息学报,2011, 25(3):112-117.
[11] 侯宏旭,刘群,那顺乌日图,等.基于实例的汉蒙机器翻译[J].中文信息学报,2007,21(4):65-72.
[12] 姜柄圭,张秦龙,谌贻荣,等.面向机器辅助翻译的汉语语块自动抽取研究[J].中文信息学报, 2007,21(1):9-16.
[13] Koehn P, H Hoang, et al. Moses: open source toolkit for statistical machine translation, Association for Computational Linguistics[C].2007.
[14] Xin Yu, Weina Zhao, Jian Wu. Dictionary-based Chinese-Tibetan sentence alignment[C]//The 2010 IEEE International Conference on Intelligent Computing and Integrated Systems. 2010
[15] Pi-Chuan Chang, Michel Galley and Chris Manning. Optimizing Chinese Word Segmentation for Machine Translation Performance[C]//ACL Third Workshop on Statistical Machine Translation, 2008.
[16] Huidan Liu, Weina Zhao, Minghua Ruo, et al. Tibetan Number Identification Based on Classification of Number Components in Tibetan Word Segmentation[C]//International Conference on Computational Linguistics. 2010.
[17] Franz Josef Och, Hermann Ney. A Systematic Comparison of Various Statistical Alignment Models[J]. Computational Linguistics, 2003,29(1): 19-51.
[18] 张怡荪. 藏汉大辞典[M]. 民族出版社.1993.12.
[19] 民族出版社,汉藏对照词典[M]. 民族出版社. 2002.7
{{custom_fnGroup.title_cn}}
脚注
{{custom_fn.content}}
基金
中国科学院西部行动计划高新技术项目(KGCX2-YW-512);国家重大科技专项资助项目(2010ZX01036-001-002,2010ZX01037-001-002)
{{custom_fund}}