基于门控机制多模态信息融合的图像描述翻译

李志峰,徐旻涵,洪宇,姚建民,周国栋

PDF(2575 KB)
PDF(2575 KB)
中文信息学报 ›› 2024, Vol. 38 ›› Issue (8) : 55-68.
机器翻译

基于门控机制多模态信息融合的图像描述翻译

  • 李志峰,徐旻涵,洪宇,姚建民,周国栋
作者信息 +

Context Gate Based Multimodal Information Fusion for Image Description Translation

  • LI Zhifeng, XU Minhan, HONG Yu, YAO Jianmin, ZHOU Guodong
Author information +
History +

摘要

图像描述翻译是给定图像和图像对应某一语言的描述,采用神经网络以端到端方式融合图像和文本两种模态信息,利用翻译技术为图像描述生成目标语言的任务。传统图像描述翻译,在将源语言翻译成目标语言时,借助图像中的重要特征优化翻译过程。翻译过程中,目标词的生成依赖于源语言上下文和目标语言上下文信息。通过观察发现,源语言上下文偏于影响翻译结果的充分性和忠实度,而目标语言上下文偏于影响翻译结果的流畅性和衔接度。由于缺少有效机制来调节两种上下文信息的贡献度,翻译模型会生成流畅但不充分或者充分但不流畅的句子。针对以上问题,该文提出一种基于门控机制多模态信息融合的解码方法,用于优化现有图像描述翻译模型。该文模型通过源上下文门控调整图像特征和每个源语言词的重要度,过滤掉图像中不相关的特征;通过目标上下文门控动态调整源语言上下文和目标语言上下文对翻译结果的贡献度,从而有效提高翻译结果的充分性和流畅性。在Multi30k数据集上进行实验,验证了上述方法的有效性,在Multi30k-16英德和英法以及Multi30k-17英德和英法测试集上,BLEU-4值对比基准系统分别提升了1.3、1.0、1.5和1.4个百分点。

Abstract

Image description translation translate image description with the image modal information in an end-to-end system. The traditional image description translation is to assist the translation of the source language by using the vital feature in the image. To capture the source language context that affects the adequacy of the translation together with the target language context that affects the fluency, this paper proposes a multi-modal information fusion decoding method based on gating mechanism for the image description translation. Our model uses context gates to dynamically adjusts the contribution of the source and target language contexts to the translation results, improving both the adequacy and fluency of translation results. Experiments show that the method increases the performance of image description translation with 1.3%, 1.0%, 1.5% and 1.4%, respectively, on the four tasks of En-De and En-Fr in Multi30k-16 and Multi30k-17.

关键词

图像描述翻译 / 多模态机器翻译 / 上下文门控 / 忠实度及流畅度

Key words

image description translation / multimodal machine translation / context gates / adequacy and fluency

引用本文

导出引用
李志峰,徐旻涵,洪宇,姚建民,周国栋. 基于门控机制多模态信息融合的图像描述翻译. 中文信息学报. 2024, 38(8): 55-68
LI Zhifeng, XU Minhan, HONG Yu, YAO Jianmin, ZHOU Guodong. Context Gate Based Multimodal Information Fusion for Image Description Translation. Journal of Chinese Information Processing. 2024, 38(8): 55-68

参考文献

[1] HUANG P Y, LIU F, SHIANG S R, et al. Attention-based multi-modal neural machine translation[C]//Proceedings of the 1st Conference on Machine Translation 2016(2): 639-645.
[2] NGIAM J, KHOSLA A, KIM M, et al. Multimodal deep learning[C]//Proceedings of the 28th International Conference on Machine Learning, 2011: 689-696.
[3] CAGLAYAN O, BARRAULT L, BOUGARES F. Multimodal attention for neural machine translation[J].arXiv.1609.03976,2016,2016.
[4] CALIXTO I, LIU Q,CAMPBELL N. Doubly-attentive decoder for multi-modal neural machine translation[J]. arXiv preprintarXiv:1702.01287,2017.
[5] ZHOU M Y, CHENG R X, LEE Y J, et al. A visual attention grounding neural model for multimodal machine translation[C]//Proceedings of the Conference on Empirical Methods in Natural Language Processing, 2018: 3643-3653.
[6] CALIXTO I, RIOS M, AZIZ W. Latent variable model for multi-modal translation[C]//Proceedings of the ACL, 2019: 6392-6405.
[7] VASWANI A, SHAZEER N, PARMAR N,et al. Attention is all you need[C]//Proceedings of the 31st International Conference on Neural Information Processing Systems, 2017: 6000-6010.
[8] MAO J H, XU W, YANG Y, et al. Explain images with multimodal recurrent neural networks[J]. arXiv preprint arXiv:1410.1090,2014.
[9] FENG S H, LIU S H, LI M, et al. Implicit distortion and fertility models for attention-based encoder-decoder NMT model[J]. arXiv preprint arXiv:1601.03317,2016.
[10] TU ZH P, LU ZH D, LIU Y, et al. Modeling Coverage for neural machine translation[C]//Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, 2016: 76-85.
[11] DONG D X, WU H, HE W, et al. Multi-task learning for multiple language translation[C]//Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing, Beijing, China, 2015: 1723-1732.
[12] FIRAT O, CHO K, BENGIO Y. Multi-way, multilingual neural machinetranslation with a shared attention mechanism[C]//Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, San Diego, California, 2016: 866-875.
[13] TU ZH P, LIU Y, LU ZH D, et al. Context gates for neural machine translation[J]. Transactions of the Association for Computational Linguistics, 2017(5): 87-99.
[14] ZHENG R J, YANG Y L, MA M B, et al. Ensemble sequence level training for multimodal mt: Osu-baidu wmt18 multimodal machine translation system report[J]. arXiv preprintarXiv:1808.10592,2018.
[15] CALIXTO I, RIOS M, AZIZ W. Latent variable model for multi-modal translation[C]//Proceedings of the ACL, 2019: 6392-6405.
[16] LIN H, MENG F D, SU J S, et al. Dynamic context-guided capsule network for multimodalmachine translation[C]//Proceedings of the 28th ACM International Conference on Multimedia, 2020: 1320-1329.
[17] WANG D X, XIONG D Y. Efficient object-level visual context modeling formultimodal machine translation: masking irrelevant objects helps grounding [J].arXiv preprint arXiv:2101.05208, 2020.
[18] SU J S, CHEN J CH, JIANG H, et al. Multi-modal neural machine translation with deep semantic interactions[J]. Information Sciences, 2021, 554: 47-60.
[19] ELLIOTT D, KDR A. Imagination improves multimodal translation[C]//Proceedings of IJCNLP, 2017: 130-141.
[20] CAGLAYAN O, ARANSA W, BARDET A, et al. Lium-cvc submissions for wmt17 multimodal translation task[C]//Proceedings of the 2nd Conference on Machine Translation, 2017: 432-439.
[21] LIBOVICK J, HELCL J. Attention strategies for multi-source sequence-to-sequence learning[J]. arXiv preprint arXiv:1704.06567,2017.
[22] CAGLAYAN O, BARDET A, BOUGARES F, et al. LIUM-CVC Submissions for wmt18 multimodal translation task[C]//Proceedings of the 3rd Conference on Machine Translation: Shared Task Papers, 2018: 597-602.
[23] HELCL J, LIBOVICK J, VARI D. Cuni system for the wmt18 multimodal translation task[C]//Proceedings of the 3rd Conferenceon Machine Translation: Shared Task Papers, 2018: 616-623.
[24] IVE J, MADHYASTHA P, SPECIA L. Distilling translations with visual awareness[C]//Proceedings of ACL, 2019: 6525-6538.
[25] YAO S, WAN X J. Multimodal transformer for multimodal machine translation[C]//Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, 2020: 4346-4350.
[26] ZHANG ZH-SH, et al. Neural machine translation with universal visual representation[C]//Proceedings of the International Conference on Learning Representations, 2019.
[27] YIN Y J, MENG F D, SU J S, et al. A novel graph-based multi-modal fusion encoder for neural machine translation[C]//Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, 2020: 3025-3035.
[28] VASWANI A, SHAZEER N, PARMAR N, et al. Attention is all you need[J]. arXiv:1706.03762,2017.
[29] WANG W X, CHEN ZH H, HU H F. Hierarchical attention network for image captioning[C]//Proceedings of the AAAI Conference on Artificial Intelligence,2019.
[30] YOUNG P, LAI A, HODOSH M, et al. From image descriptions tovisual denotations: New similarity metrics for semantic inference over event descriptions[C]//Proceedings of the Transactions of the Association for Computational Linguistics, 2014: 67-78.
[31] ELLIOTT D, FRANK S, KHALIL S, et al. Multi30k: Multilingual English-German image descriptions[J]. arXiv preprint arXiv:1605.00459,2016.
[32] PAPINENI K, ROUKOS S, WARD T, et al. BLEU: A method for automatic evaluation of machine translation[C]//Proceedings of the 40th Annual Meeting on Association for Computational Linguistics. Philadelphia, Pennsylvania, 2002: 311-318.
[33] SENNRICH R, HADDOW B, BIRCH A. Neural machine translation of rarewords with subword units[C]//Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics. Berlin, Germany, 2016: 1715-1725.
[34] DIEDERIK P K, JIMMY B. Adam: A method for stochastic optimization[J]. arXiv:1412.6980,2014.
[35] DENKOWSKI M, LAVIE A. Meteor universal: Language specific translation evaluation for any target language[C]//Proceedings of the 9th Workshop on Statistical Machine Translation, 2014: 376-380.
[36] SNOVER M, DORR B, SCHWARTZ R, et al. A study of translation edit rate with targeted human annotation[C]//Proceedings of Association for Machine Translation in the Americas. Cambridge, MA, 2006: 223-231.
[37] ELLIOTT D. Adversarial evaluation of multimodal machine translation[C]//Proceedings of the Conference on Empirical Methods in Natural Language Processing, 2018: 2974-2978.

基金

国家自然科学基金(62076174,61773276,61836007)
PDF(2575 KB)

771

Accesses

0

Citation

Detail

段落导航
相关文章

/