该文提出利用一个大型且精度高的神经机器翻译模型(教师模型)从单语数据中提取隐性双语知识,从而改进小型且精度低的神经机器翻译模型(学生模型)的翻译质量。该文首先提出了“伪双语数据”的教学方法,利用教师模型翻译单语数据获得的合成双语数据改进学生模型,然后提出了“负对数似然—知识蒸馏联合优化”教学方法,除了利用合成双语数据,还利用教师模型获得的目标语言词语概率分布作为知识,从而在知识蒸馏框架下提高学生模型的翻译质量。实验证明,在中英和德英翻译任务上,使用该方法训练的学生模型不仅在领域内测试集上显著超过了基线学生模型,而且在领域外测试集上的泛化性能也得到了提高。
Abstract
This paper proposes to utilize a large and high precision neural machine translation (NMT) model (teacher model) to distill invisible bilingual knowledge from monolingual data in order to improve the translation quality of a small and low precision NMT model (student model). This paper first proposes the method of pseudo bilingual data where the student model is improved based on the synthesized training data by utilizing the teacher model to translate the monolingual data. Further, this paper proposes the joint optimization approach of negative log-likelihood and knowledge distillation. In addition to the synthetic training data, the student model can be enhanced by using the probability distribution of target language words obtained by the teacher model as knowledge under the knowledge distillation framework. Experiments on the Chinese-English and Germany-English translation tasks show that the student model trained by the proposed approaches not only significantly outperforms the baseline student model regarding translation quality on in-domain test sets, but also achieves a better generalization performance on an out-domain test set.
关键词
神经机器翻译 /
知识蒸馏 /
单语数据
{{custom_keyword}} /
Key words
neural machine translation /
knowledge distillation /
monolingual data
{{custom_keyword}} /
{{custom_sec.title}}
{{custom_sec.title}}
{{custom_sec.content}}
参考文献
[1] Ilya Sutskever,Oriol Vinyals,Quoc V Le.Sequence to sequence learning with neural networks[C]//Proceedings of the NIPS,2014:3104-3112.
[2] Kyunghyun Cho,Bart van Merrienboer,Caglar Gulcehre,et al.Learning phrase representations using RNN encoder-decoder for statistical machine translation[C]//Proceedings of the EMNLP,2014:1724-1734.
[3] Dzmitry Bahdanau,Kyunghyun Cho,Yoshua Bengio.Neural machine translation by jointly learning to align and translate[C]//Proceedings of the ICLR,2015.
[4] Rico Sennrich,Barry Haddow,Alexandra Birch.Edinburgh neural Machine translation systems for WMT 16[C]//Proceedings of the 1st Conference on Machine Translation,2016:371-376.
[5] Philipp Koehn,Franz Josef Och,Daniel Marcu.Statistical phrase-based translation[C]//Proceedings of the NAACL,2003:48-54.
[6] David Chiang.Hierarchical phrase-based translation[J].Computational Linguistics,2007,33(2):201-228.
[7] Ashish Vaswani,Noam Shazeer,Niki Parmar,et al.Attention is All You Need[C]//Proceedings of the NIPS,2017:6000-6010.
[8] 雷杰,高鑫,宋杰,等.深度网络模型压缩综述[J].软件学报,2018,29(2):251-266.
[9] Jiaxiang Wu,Cong Leng,Yuhang Wang,et al.Quantized convolutional neural networks for mobile devices[C]//Proceedings of the CVPR,2016:4820-4828.
[10] Song Han,Huizi Mao,William J Dally.Deep compression:Compressing deep neural networks with pruning,trained quantization and huffman coding[C]//Proceedings of the ICLR,2016.
[11] Xiyu Yu,Tongliang Liu,Xinchao Wang,et al.On compressing deep models by Low rank and sparse decomposition[C]//Proceedings of the CVPR,2017:7370-7379.
[12] Geoffrey Hinton,Oriol Vinyals,Jeffrey Dean.Distilling the knowledge in a neural network[C]//Proceedings of the NIPS Deep Learning and Representation Learning Workshop,2015.
[13] Guobin Chen,Wongun Choi,Xiang Yu,et al.Learning efficient object detection models with know-ledge distillation[C]//Proceedings of the NIPS,2017:742-751.
[14] Yoon Kim,Alexander M Rush.Sequence-level knowledge distillation[C]//Proceedings of the EMNLP,2016:1317-1327.
[15] Sam Wiseman,Alexander M Rush.Sequence-to-sequence learning as beam-search optimization[C]//Proceedings of the EMNLP,2016:1296-1306.
[16] Rico Sennrich,Barry Haddow,Alexandra Birch.Neural machine translation of rare words with subword units[C]//Proceedings of the ACL,2016:1715-1725.
[17] Yarin Gal,Zoubin Ghahramani.A theoretically grounded application of dropout in recurrent neural networks[C]//Proceedings of the NIPS,2016:1019-1027.
[18] Diederik P Kingma,Jimmy Ba.Adam:A method for stochastic optimization[C]//Proceedings of the ICLR,2015.
[19] Jimmy Lei Ba,Jamie RyanKiros,Geoffrey E Hinton.Layer normalization[J].arXiv preprint arXiv:1607.06450,2016.
[20] Michal Ziemski,Marcin Junczys-Dowmunt,Bruno Pouliquen.The United Nations Parallel Corpus v1.0[C]//Proceedings of the LREC,2016.
[21] Cristian Buciluǎ,Rich Caruana,Alexandru Niculescu-Mizil.Model Compression[C]//Proc.KDD,2006:535-541.
[22] Jimmy Ba,Rich Caruana.Do deep nets really need to be Deep?[C]//Proceedings of the NIPS,2014:2654-2662.
[23] Yun Chen,Yang Liu,Yong Cheng,et al.A teacher-student framework for zero resource neural machine translation[C]//Proceedings of the ACL,2017:1925-1935.
[24] Rohan Anil,Gabriel Pereyra,Alexandre Passos,et al.Large scale distributed neural network training through online distillation[C]//Proceedings of the ICLR,2018.
[25] Jiajun Zhang,Chengqing Zong.Exploiting source-side monolingual data in neural machine translation[C]//Proceedings of the EMNLP,2016:1535-1545.
[26] Rico Sennrich,Barry Haddow,Alexandra Birch.Improving neural machine translation models with monolingual data[C]//Proceedings of the ACL,2016:86-96.
[27] Yong Cheng,Wei Xu,Zhongjun He,et al.Semi-supervised learning for neural machine translation[C]//Proceedings of the ACL,2016:1965-1974.
[28] Tobias Domhan,Felix Hieber.Using target-side monolingual data for neural machine translation through multi-task learning[C]//Proceedings of the EMNLP,2017:1500-1505.
{{custom_fnGroup.title_cn}}
脚注
{{custom_fn.content}}
基金
国家自然科学基金(61876174,61662077)
{{custom_fund}}