多模态神经机器翻译旨在利用视觉信息来提高文本翻译质量。传统多模态机器翻译将图像的全局语义信息融入翻译模型,而忽略了图像的细粒度信息对翻译质量的影响。对此,该文提出一种基于图文细粒度对齐语义引导的多模态神经机器翻译方法,该方法首先采用跨模态交互图文信息,以提取图文细粒度对齐语义信息,然后以图文细粒度对齐语义信息为枢纽,采用门控机制将多模态细粒度信息对齐到文本信息上,实现图文多模态特征融合。在多模态机器翻译基准数据集Multi30K英语到德语、英语到法语以及英语到捷克语翻译任务上的实验结果表明,该文提出的方法是有效的,并且优于大多数先进的多模态机器翻译方法。
Abstract
Multi-modal neural machine translation aims to improve the quality of text translation by utilizing visual information. Traditional multimodal machine translation models incorporate global semantic information from images into the translation model, ignoring the impact of fine-grained image information on translation quality. To address this issue, this paper proposes a multimodal neural machine translation method guided by the semantic information derived from the fine-grained alignment of images and text. Specifically, using the fine-grained alignment semantic information as the pivot, a gating mechanism is employed to align multimodal fine-grained information with textual information, achieving multimodal feature fusion between images and text. Experimental results on the Multi30K English-to-German, English-to-French, and English-to-Czech translation tasks show that the proposed method is effective and outperforms most state-of-the-art methods.
关键词
多模态神经机器翻译 /
图文细粒度 /
语义交互 /
对齐语义
{{custom_keyword}} /
Key words
multi-modal neural machine translation /
image-text fine-grained /
semantic interaction /
alignment semantic
{{custom_keyword}} /
{{custom_sec.title}}
{{custom_sec.title}}
{{custom_sec.content}}
参考文献
[1] CAGLAYAN O, MADHYASTHA P, SPECIA L, et al. Probing the need for visual context in multimodal machine translation[C]//Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Association for Computational Linguistics, 2019: 4159-4170.
[2] YIN Y, MENG F, SU J, et al. A novel graph-based multi-modal fusion encoder for neural machine translation[C]//Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, 2020: 3025-3035.
[3] LI J, ATAMAN D, SENNRICH R. Vision matters when it should: Sanity checking multimodal machine translation models[C]//Proceedings of the Conference on Empirical Methods in Natural Language Processing, 2021: 8556-8562.
[4] YAO S, WAN X. Multimodal transformer for multimodal machine translation[C]//Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, 2020: 4346-4350.
[5] YE J, GUO J. Dual-level interactive multimodal-mixup encoder for multi-modal neural machine translation[J]. Applied Intelligence, 2022, 52(12): 14194-14203.
[6] HUANG P Y, HU J, CHANG X, et al. Unsupervised multimodal neural machine translation with pseudo visual pivoting[C]//Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, 2020: 1-12.
[7] KWON S, GO B H, LEE J H. A text-based visual context modulation neural model for multimodal machine translation[J]. Pattern Recognition Letters, 2020, 136: 212-218.
[8] SONG Y, CHEN S, JIN Q, et al. Enhancing neural machine translation with dual-side multimodal awareness[J]. IEEE Transactions on Multimedia, 2021, 24: 3013-3024.
[9] ZHAO Y, KOMACHI M, KAJIWARA T, et al. Word-region alignment-guided multimodal neural machine translation[J]. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2021, 30: 244-259.
[10] TAKUSHIMA H, TAMURA A, NINOMIYA T, et al. Multimodal neural machine translation using CNN and transformer encoder[C]//Proceedings of the 20th International Conference on Computational Linguistics and Intelligent Text Processing, 2019.
[11] NISHIHARA T, TAMURA A, NINOMIYA T, et al. Supervised visual attention for multimodal neural machine translation[C]//Proceedings of the 28th International Conference on Computational Linguistics, 2020: 4304-4314.
[12] GAIN B, BANDYOPADHYAY D, EKBAL A. Experiences of adapting multimodal machine translation techniques for Hindi[C]//Proceedings of the 1st Workshop on Multimodal Machine Translation for Low Resource Languages, 2021: 40-44.
[13] LIN H, MENG F, SU J, et al. Dynamic context-guided capsule network for multimodal machine translation[C]//Proceedings of the 28th ACM International Conference on Multimedia, 2020: 1320-1329.
[14] CALIXTO I, LIU Q. Incorporating global visual features into attention-based neural machine translation[C]//Proceedings of the Conference on Empirical Methods in Natural Language Processing, 2017: 992-1003.
[15] CAGLAYAN O, ARANSA W, BARDET A, et al. LIUM-CVC submissions for WMT17 multimodal translation task[C]//Proceedings of the 2nd Conference on Machine Translation, 2017: 432-439.
[16] HUANG P Y, LIU F, SHIANG S R, et al. Attention-based multimodal neural machine translation[C]//Proceedings of the 1st Conference on Machine Translation, 2016: 639-645.
[17] CAGLAYAN O, ARANSA W, WANG Y, et al. Does multimodality help human and machine for translation and image captioning?[C]//Proceedings of the 1st Conference on Machine Translation, 2016, 2: 627-633.
[18] CAGLAYAN O, BARRAULT L, BOUGARES F. Multimodal attention for neural machine translation[J/OL]. arXiv preprint arXiv: 1609.03976, 2016.
[19] CALIXTO I, LIU Q, CAMPBELL N. Doubly-attentive decoder for multi-modal neural machine translation[C]//Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics, 2017: 1913-1924.
[20] DELBROUCK J B, DUPONT S. An empirical study on the effectiveness of images in multimodal neural machine translation[C]//Proceedings of the Conference on Empirical Methods in Natural Language Processing, 2017: 910-919.
[21] 李志峰, 张家硕, 洪宇, 等. 融合覆盖机制的多模态神经机器翻译[J]. 中文信息学报, 2020, 34(3): 44-55.
[22] ZHAO Y, KOMACHI M, KAJIWARA T, et al. Region-attentive multimodal neural machine translation[J]. Neurocomputing, 2022, 476: 1-13.
[23] LI B, LV C, ZHOU Z, et al. On vision features in multimodal machine translation[C]//Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics, 2022: 6327-6337.
[24] HE K, ZHANG X, REN S, et al. Deep residual learning for image recognition[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016: 770-778.
[25] PAPINENI K, ROUKOS S, WARD T, et al. BLEU: A method for automatic evaluation of machine translation[C]//Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics. 2002: 311-318.
[26] DENKOWSKI M, LAVIE A. Meteor universal: Language specific translation evaluation for any target language[C]//Proceedings of the 9th Workshop on Statistical Machine Translation, 2014: 376-380.
[27] VASWANI A, SHAZEER N, PARMAR N, et al. Attention is all you need[C]//Proceeding of the 31st International Conference on Neural Information Processing Systems, 2017: 6000-6010.
[28] WU Z, KONG L, BI W, et al. Good for misconceived reasons: An empirical revisiting on the need for visual context in multimodal machine translation[C]//Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, 2021: 6153-6166.
[29] WANG D, XIONG D. Efficient object-level visual context modeling for multimodal machine translation: Masking irrelevant objects helps grounding[C]//Proceedings of the AAAI Conference on Artificial Intelligence, 2021, 35(4): 2720-2728.
[30] ZHOU M, CHENG R, LEE Y J, et al. A visual attention grounding neural model for multimodal machine translation[C]//Proceedings of the Conference on Empirical Methods in Natural Language Processing, 2018: 3643-3653.
[31] ARSLAN H S, FISHEL M, ANBARJAFARI G. Doubly attentive transformer machine translation[J/OL]. arXiv preprint arXiv: 1807.11605, 2018.
{{custom_fnGroup.title_cn}}
脚注
{{custom_fn.content}}
基金
国家重点研究与发展计划(2020AAA0107904);国家自然科学基金(62366025);云南省科技厅自然科学基金(202301AT070444)
{{custom_fund}}