蒙古语属于小语种,蒙古语到汉语机器翻译相关研究进展缓慢。所以,实现高质量的蒙汉机器翻译对我国少数民族地区信息化发展有着重要意义。其中,词语对齐对机器翻译质量起着至关重要的作用。该文提出了一种基于蒙古语切分的词干词缀为基本单位的蒙汉机器翻译词对齐方法。该方法利用词干词缀表和逆向最大匹配算法来实现蒙古语句子词干词缀的切分。实验结果表明对蒙古语进行词干词缀的切分能够显著提高对数线性词对齐模型的对齐质量。
Abstract
High-quality Mongolian to Chinese machine translation is of great significance to the development of IT in minority areas.To deal with the word alignment, which is a key issue in SMT,this paper proposes a Mongolian segmentation based on stems and affixes. To achieve this kind of basic unit of Mongolian Chinese word alignment, we use stems and affixes table and reverse maximum matching algorithm. The experiment results indicate that the proposed method can significantly improve the alignment quality.
关键词
词对齐 /
IBM模型 /
词干词缀切分 /
对数线性模型
{{custom_keyword}} /
Key words
word alignment /
IBM model /
affix and stem segment /
log linear model
{{custom_keyword}} /
{{custom_sec.title}}
{{custom_sec.title}}
{{custom_sec.content}}
参考文献
[1] 刘群.统计机器翻译综述[J].中文信息学报,2003,(04):1-12.
[2] Brown P F,Cocke J,Pietra S A D,et al.A statistical approach to machine translation[J].Computational Linguistics,1990,16(2):79-85.
[3] Schoenemann T.Training Nondeficient Variants of IBM-3 and IBM-4 for Word Alignment[C]//Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics,2013:22-31.
[4] Yaser A,Jan C,Michael J,et al.Statistical machine translation:final report[C]//Proceedings of the John Hopkins University 1999 Summer Workshop on Language Engineering,1999.
[5] Liu Y,Liu Q,Lin S.Log-Linear Models for Word Alignment[C]//Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics,2005:459-466.
[6] 那顺乌日图.关于在蒙古语文研究中运用统计学方法的问题[J].民族语文,1993(5):46-50.
[7] 员华瑞.基于串到树模型的蒙汉机器翻译研究[D].呼和浩特:内蒙古大学硕士学位论文,2015.
[8] 张贯虹,乌达巴拉,巩政.基于判别式模型的蒙英词对齐方法[J].模式识别与人工智能,2012,(03):521-526.
[9] Brown P F,Pietra V J D,Pietra S A D,et al.The mathematics of statistical machine translation:parameter estimation[J].Computational Linguistics,1993,19(2):263-311.
[10] 刘乐茂,赵铁军.对数线性翻译模型的判别式训练综述[J].智能计算机与应用,2013,3(06):14-17.
[11] Och F J.Ney H.Improved statistical alignment mode-ls[C]//Proceedings of 38th Annual Meeting of the Association for Computational Linguistics.,2000:440-447.
[12] Och F J,Ney H.Discriminative training and maxi- mum entropy models for statistical machine transl- ation[C]//Proceedings of the 40th Annual Meeting on Association for Computational Linguistics,2002:29 5-302.
[13] 李业刚,黄河燕,史树敏,等.多策略机器翻译研究综述[J].中文信息学报,2015,29(2):1-9.
[14] 李良友,贡正仙,周国栋.机器翻译自动评价综述[J].中文信息学报,2014,28(3):81-91.
[15] 姜文斌,吴金星,长青,等.蒙古语词法分析的有向图模型[J].中文信息学报,2011,25(5):94-100.
[16] Zhang H P,Yu H K,Xiong D Y,et al.HHMM based Chinese lexical analyzer ICTCLAS[C]//Proceedings of the second SIGHAN workshop on Chinese la nguage processing,2003:758-759.
[17] 清格尔泰.现代蒙古语语法[M].呼和浩特:内蒙古人民出版社,1999:10-34.
[18] 尹宝生,杨阳.双向词典和语义相似度计算相结合的词对齐算法[J].沈阳航空航天大学学报,2015,32(2):67-74.
[19] Och F J.Minimum error rate training in statistical machine translation[C]//Proceedings of the 41st Annual Meeting on Association for Computational Linguistics,2003:160-167 .
[20] 刘胜奇,朱东华.基于多策略融合GIZA++的术语对齐法[J].软件学报,2015,26(7):1650-1661.
{{custom_fnGroup.title_cn}}
脚注
{{custom_fn.content}}
基金
国家自然科学基金(61363052,61502255);内蒙古自治区自然科学基金(2016MS0605);内蒙古自治区民族事务委员会基金(MW-2017-MGYWXXH-03)
{{custom_fund}}