在蒙汉神经机器翻译任务中,由于语料稀少使得数据稀疏问题严重,极大影响了模型的翻译效果。该文对子字粒度切分技术在蒙汉神经机器翻译模型中的应用进行了研究。通过BPE算法将切分粒度控制在字符和词之间的子字粒度大小,将低频词切分成相对高频的子字片段,来缓解数据稀疏问题,从而在有限的数据和硬件资源条件下,更高效地提升模型的鲁棒性。实验表明,在两种网络模型中使用子字粒度切分技术,BLEU值分别提升了4.81和2.96,且随着语料的扩大,训练周期缩短效果也更加显著,说明子字粒度切分技术有助于提高蒙汉神经机器翻译效果。
Abstract
In the Mongolian-Chinese neural machine translation, the data sparse issue is of substantial effect to the translation quality. This paper applies the sub-word granularity segmentation in the Mongolian-Chinese neural machine translation model. The Byte Pair Encoding algorithm is adopted to alleviate the data sparseness by reducing the low-frequency words into relatively high-frequency sub-units. Experiments show that the sub-word segmentation technique can improve the Mongolian-Chinese neural machine translation, achieving 4.81 and 2.96 improvements in BLEU score, respectively.
关键词
蒙汉神经机器翻译 /
数据稀疏 /
子字粒度切分
{{custom_keyword}} /
Key words
Mongolian-Chinese neural machine translation /
data sparseness /
sub-word segmentation
{{custom_keyword}} /
{{custom_sec.title}}
{{custom_sec.title}}
{{custom_sec.content}}
参考文献
[1] Luong M T,et al. Addressing the rare word problem in neural machine translation[J]. Bulletin of University of Agricultural Sciences and Veterinary Medicine Cluj-Napoca,2014,27(2):82-86.
[2] Hinton G E. To recognize shapes,first learn to generate images [J]. Progress in Brain Research,2007,165:535-547.
[3] 赵红梅,等.第七届全国机器翻译研讨会(CWMT 2011)机器翻译评测总结[J].中文信息学报,2012,26(1):22-30.
[4] Dzmitry Bahdanau,KyungHyun Cho,Yoshua Bengio.Neural machine translation by jointly learning to align and translate[C]//Proceedings of ACL-IJCNLP,Volume 1:Long Papers,2015.
[5] Sutskever I,Vinyals O,Le Q V. Sequence to sequence learning with neural networks[C]//Proceedings of the Advances in Neural Information Processing Systems,2014:3104-3112.
[6] Mike Schuster,Kuldip K,Paliwal.Bidirectional recurrent neural networks[J].Signal Processing IEEE Transaction,1997,45(11):2673-2681
[7] Mikolov T,et al. Efficient estimation of word representations in vector space[J].arXiv preprint arXiv:1301.3781.2013.
[8] Gehring J,et al.Convolutional Sequence to sequence learning [C]//Proceedings of the 34th International Conference on Machine Learning,Sydney,Australia,PMLR 70,2017.
[9] Gehring J,et al.A convolutional encoder model for neural machine translation[C]//Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics,Vancouver,Canada,2017:123-135.
[10] Dauphin,et al. Language modeling with gated linear units[J]. arXiv preprint arXiv:1612.08083,2016.
[11] He K,et al. Deep residual learning for image recognition[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition,2016:770-778.
[12] Sennrich R,Haddow B,Birch A. Neural machine translation of rare words with subword units[C]//Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1:Long Papers). 2016(1):1715-1725.
[13] Philip Gage.A New Algorithm for Data Compression[J].C Users Jourual,1994,12(2):23-38.
[14] 清格尔泰. 蒙古语语法[M]. 呼和浩特:内蒙古人民出版社,1992.
[15] Papineni K,et al.(IBM Research Report)BLEU:a method for automatic evaluation of machine translation[C]//Proceedings of Annual Meeting of the Association for Computational Linguistics,2002,30(2):311-318.
[16] Zhang R,Yasuda K,Sumita E. Improved statistical machine translation by multiple Chinese word segmentation[C]//Proceedings of the 3rd Workshop on Statistical Machine Translation. Association for Computational Linguistics:Association for Computational Linguistics,2008(216):223.
[17] Koehn P,et al. Moses:Open source toolkit for statistical machine translation[C]//Proceedings of the Association for Computational Linguistics,Prague (Czech Republic):Association for Computational Linguistics,2007.
[18] Hochreiter S,Schmidhuber J. Long short-term memory[J]. Neural Computation,1997,9(8):1735-1780.
[19] Sutskever I,et al. On the importance of initialization and momentum in deep learning[C]//Proceedings of the International Conference on Machine Learning. JMLR.org,2013:III-1139.
[20] 特格希都楞.蒙古语构词词法研究[M].沈阳:辽宁民族出版社,2006.
{{custom_fnGroup.title_cn}}
脚注
{{custom_fn.content}}
基金
内蒙古自然科学基金(2018MS06005);内蒙古蒙古语言文字信息化专项扶持项目(MW-2018-MGYWXXH-302);内蒙古自治区研究生科研创新项目(10000-16010109-18)
{{custom_fund}}