Abstract:Multimodal neural machine translation refers in this paper to a machine learning method that directly uses neural networks to translate image and text modal information in an end-to-end system. This paper proposes a multimodal machine translation model based on dual attention decoding with coverage mechanism. This model works on the source language and the image respectively by means of the coverage mechanism, which can reduce the attention to past repeated information. This paper verifies the effectiveness of the proposed method over the official evaluation datasets of WMT16 and WMT17. Experimental results show that the method increases the performance of multimodal neural machine translation with 1.2%, 0.8%, 0.7% and 0.6% on the four benchmark datasets of WMT16 En-De/En-Fr and WMT17 En-De/En-Fr, respectively.
[1] Poyao Huang, Frederick Liu, Szrung Shiang, et al. Attention-based multi-modal neural machine translation[C]//Proceedings of the 1st Conference on Machine Translation: Volume 2, Shared Task Papers, 2016: 639-645. [2] Ozan Caglayan,Loc Barrault, Fethi Bougares. Multimodal attention for neural machine translation[J]. arXiv preprint arXiv: 1609.03976,2016. [3] Jiquan Ngiam, Aditya Khosla, Mingyu Kim, et al. Multimodal deep learning[C]//Proceedings of the 28th International Conference on Machine Learning (ICML-11), 2011: 689-696. [4] Iacer Calixto, Qun Liu, Nick Campbell. Incorporating global visual features into attention-based neural machine translation[J]. arXiv preprint arXiv: 1701.06521,2017. [5] Mingyang Zhou, Runxiang Cheng, Yong Jae Lee, et al. A visual attention grounding neural model for multimodal machine translation[C]// Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, 2018: 3643-3653. [6] Dzmitry Bahdanau, Kyunghyun Cho, Yoshua Bengio. Neural machine translation by jointly learning to align and translate[C]// Proceedings of International Conference on Learning Representations,ICLR 2015. San Diego, California. arXiv preprint arXiv: 1409.0473,2015. [7] Iacer Calixto, Qun Liu, Nick Campbell. Doubly-attentive decoder for multi-modal neural machine translation[J]. arXiv preprint arXiv: 1702.01287,2017. [8] Junhua Mao, Wei Xu, Yi Yang, et al. Explain images with multimodal recurrent neural networks[J]. arXiv preprint arXiv: 1410.1090,2014. [9] Kelvin Xu, Jimmy Ba, Ryan Kiros, et al. Show,attend and tell: Neural image caption generation with visual attention[C]//Proceedings of the 32nd International Conference on Machine Learning (ICML-15). JMLR Workshop and Conference Proceedings, Lille, France, 2015: 2048-2057. [10] Barret Zoph, Kevin Knight. Multi-source neural translation[C]//Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. San Diego, California, 2016: 30-34. [11] Shi Feng, Shujie Liu, Mu Li, et al. Implicit distortion and fertility models for attention-based encoder-decoder NMT model[J]. arXiv preprint arXiv: 1601.03317, 2016. [12] Zhaopeng Tu, Zhengdong Lu, Yang Liu, et al. Modeling coverage for neural machine translation[J]. arXiv preprint arXiv: 1601.04811,2016. [13] Daxiang Dong, Hua Wu, Wei He, et al. Multi-task learning for multiple language translation[C]//Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing. Beijing, China,2015: 1723-1732. [14] Orhan Firat, Kyunghyun Cho, Yoshua Bengio. Multi-way, multilingual neural machine translation with a shared attention mechanism[C]//Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. San Diego, California, 2016: 866-875. [15] Minh-Thang Luong, Quoc V Le, Ilya Sutskever, et al. Multi-Task sequence to sequence learning[C]//Proceedings of the International Conference on Learning Representations (ICLR), 2016. San Juan, Puerto Rico. arXiv preprint arXiv: 1511.06114,2016. [16] Karen Simonyan, Andrew Zisserman. Very deep convolutional networks for large-scale image recognition[C]//Proceedings of the International Conference on Learning Representations (ICLR). San Diego, CA. arXiv preprint arXiv: 1409.1556,2015. [17] Jindich Libovicky, Jindich Helcl, Marek Tlust, et al. Cuni system for wmt16 automatic post-editing and multimodal translation tasks[J]. arXiv preprint arXiv: 1606.07481,2016. [18] Jindich Libovicky, Jindich Helcl. Attention strategies for multi-source sequence-to-sequence learning[J]. arXiv preprint arXiv: 1704.06567,2017. [19] Jindich Helcl, Jindich Libovicky, Duan Vari. Cuni system for the wmt18 multimodal translation task[C]//Proceedings of the 3rd Conference on Machine Translation: Shared Task Papers, 2018: 616-623. [20] Renjie Zheng, Yilin Yang, Mingbo Ma, et al. Ensemble sequence level training for multimodal mt: Osu-baidu wmt18 multimodal machine translation system report[J]. arXiv preprint arXiv: 1808.10592,2018. [21] Peter Young, Alice Lai, Micah Hodosh, et al. From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions[J]. Transactions of the Association for Computational Linguistics, 2014,2: 67-78. [22] Desmond Elliott, Stella Frank, Khalil Sima’an, et al. Multi30K: Multilingual English-German image descriptions[C]//Proceedings of the 5th Workshop on Vision and Language, 2016. [23] Kishore Papineni, Salim Roukos, Todd Ward, et al. BLEU: A Method for automatic evaluation of machine translation[C]//Proceedings of the 40th Annual Meeting on Association for Computational Linguistics. Philadelphia, Pennsylvania, 311-318. [24] Philipp Koehn, Hieu Hoang, Alexandra Birch, et al. Moses: Open source toolkit for statistical machine translation[C]//Proceedings of the 45th Annual Meeting of the ACL on Interactive Poster and Demonstration Sessions. Prague, Czech Republic,2007: 177-180. [25] Rico Sennrich, Barry Haddow, Alexandra Birch. Neural machine translation of rare words with subword units[C]//Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics. Berlin, Germany, 2016: 1715-1725. [26] Diederik P Kingma, Jimmy Ba. Adam: A method for stochastic optimization[J]. arXiv preprint arXiv: 1412.6980,2014. [27] Michael Denkowski, Alon Lavie. Meteor universal: language specific translation evaluation for any target language[C]//Proceedings of the EACL 2014 Workshop on Statistical Machine Translation,2014: 376-380. [28] Matthew Snover, Bonnie Dorr, Richard Schwartz, et al. A study of translation edit rate with targeted human annotation[C]//Proceedings of Association for Machine Translation in the Americas. Cambridge, MA, 2006: 223-231. [29] Jonathan H Clark, Chris Dyer, Alon Lavie, et al. Better hypothesis testing for statistical machine translation: Controlling for optimizer instability[C]//Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies: Short Papers-Volume 2. Portland, Oregon, HLT ’11, 2011: 176-181. [30] Ozan Caglayan, Walid Aransa, Adrien Bardet, et al. Lium-cvc submissions for wmt17 multimodal translation task[C]//Proceedings of the 2nd Conference on Machine Translation, 2017: 432-439. [31] Desmond Elliott. Adversarial evaluation of multimodal machine translation[C]//Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, 2018: 2974-2978.