基于时空注意力机制的视频引导机器翻译方法

姜舟,余正涛,高盛祥,毛存礼,郭军军

PDF(1503 KB)
PDF(1503 KB)
中文信息学报 ›› 2024, Vol. 38 ›› Issue (4) : 50-58.
机器翻译

基于时空注意力机制的视频引导机器翻译方法

  • 姜舟1,2,余正涛1,2,高盛祥1,2,毛存礼1,2,郭军军1,2
作者信息 +

Video-guided Machine Translation by Spatial-Temporal Attention

  • JIANG Zhou 1,2, YU Zhengtao1,2, GAO Shengxiang1,2, MAO Cunli1,2, GUO Junjun1,2
Author information +
History +

摘要

视频引导机器翻译是一种多模态机器翻译任务,其目标是通过视频和文本的结合产生高质量的文本翻译。但是之前的工作只基于视频中的时间结构选择相关片段引导机器翻译,所选片段中存在大量与目标语言无关的信息。因此,在翻译过程中,视频中的时空结构没有得到充分利用,从而无法有效缓解机器翻译中细节缺失或翻译错误的问题。为了解决这一问题,该文提出了一种基于时空注意力(Spatial-Temporal Attention,STA)的模型来充分利用视频中的时空信息引导机器翻译。该文提出的注意力模型不但能够选择与目标语言最相关的时空片段,而且能进一步聚焦片段中最相关的实体信息。所关注的实体信息能有效增强源语言和目标语言的语义对齐,从而使得源语言中的细节信息得到准确翻译。该文的方法基于Vatex公共数据集和构建的汉-越低资源数据集进行实验,在Vatex与汉-越低资源数据集上BLEU4分别达到32.66和18.46,相比于时间注意力基线方法提高了3.54与0.89个BLEU值。

Abstract

Video-guided Machine Translation is one of multimodal neural machine translation tasks aiming at high-quality text translation by tangibly engaging both video and text. In the most of the existing methods, the spatial-temporal structure of the video is still unexplored, and the problem of mistranslation caused by detail missing is still unaddressed. In this paper, we propose a spatial-temporal attention (SAT) method to address these issues. The proposed STA can not only select the most relevant segment video for the target language, but also further focuses the most relevant entity information in the segment given the sentence context. The selected entity information can effectively enhance the semantic alignment between the source language and the target language, so that the details in the source language can be translated accurately. Experimented on the Vatex public dataset and the self-built Chinese-Vietnamese low resource dataset, the proposed method outperforming the baseline method by achieving 32.66 and 18.46 BLEU-4 scores, respectively.

关键词

时空注意力 / 视频引导机器翻译 / 细节缺失 / 时间注意力 / 空间注意力

Key words

spatial-temporal attention / video-guided machine translation / detail missing / temporal attention / spatial attention

引用本文

导出引用
姜舟,余正涛,高盛祥,毛存礼,郭军军. 基于时空注意力机制的视频引导机器翻译方法. 中文信息学报. 2024, 38(4): 50-58
JIANG Zhou , YU Zhengtao, GAO Shengxiang, MAO Cunli, GUO Junjun. Video-guided Machine Translation by Spatial-Temporal Attention. Journal of Chinese Information Processing. 2024, 38(4): 50-58

参考文献

[1] HORI C, HORI T, LEE T, et al. Attention-based multimodal fusion for video description[C]//Proceedings of the International Conference on Computer Vision, ICCV, 2017: 73-82.
[2] PAN P, XU Z W, YANG Y, et al. Hierarchical recurrent neural encoder for video representation with application to captioning[C]//Proceedings of the International Conference on Computer Vision and Patten Recgnition,CVPR,2016: 1029-1038.
[3] YAO L, TORABI A, CHO K, et al. Describing videos by exploiting temporal structure[C]//Proceedings of the International Conference on Computer Vision, ICCV, 2015: 4507-4515.
[4] YU H N, WANG J, HUANG Z H, et al. Video paragraph captioning using hierarchical recurrent neural networks[C]//Proceedings of the Conference in Computer Vision and pattern Recognition,CVPR,2016: 4584-4593.
[5] WANG X, WU J W, CHEN J K, et al. VATEX: A large-scale, high-quality multilingual dataset for video-and-language research[C]//Proceedings of the International Conference on Computer Vision ICCV, 2019: 4581-4591.
[6] REN S Q, He K M, Ross B G, et al. Faster r-cnn: Towards real-time object detection with region proposal networks[C]//Proceedings of the Trans.Pattern Anal. Mach, 2015: 1137-1149.
[7] BENGIO Y S,CUN Y L . Very deep convolutional networks for large-scale image recognition[C]//Proceedings of the 3rd International Conference on Learning Representations, 2015: 730-734.
[8] IOGFGE S, SZEGEDY C. Batch normalization: Accelerating deep network training by reducing internal covariate shift[C]//Proceedings of the 32nd international conference on machine learning, France Lille, 2015: 448-456.
[9] VERMA A, QASSIM H, DAVID F, et al. Deep residual learning for image recognition[C]//Proceedings of the 8th IEEE Annual Ubiquitos Computing, Electronics and Mobile Communication Conference, 2017: 463-469.
[10] ELLIOTT D, FRANK S, HASLER S, et al. Multi-language image description with neural sequence models[J]. arXiv preprint arXiv : 1510.04709,2015.
[11] CALIXTO I, DESMOND E, STELLAF, et al. Dcu-uva multimodal mt system report[C]//Proceedings of the 1st conference on machine translation, Association for Computational Linguistics, Berlin,2016: 634-638.
[12] ZHENG R J, YANG Y L, MA M B,et al. OSU multimodal machine translation system report[C]//Proceedings of the 3rd conference on machine translation, 2018: 632-636.
[13] MADHYASTHA P S, Wang J, SPECIA L, et al. Sheffield MultiMT: Using object posterior predictions for multimodal machine translation[C]//Proceedings of the 2nd conference on machine translation, Association for Computational Linguistics, Copenhagen, 2017: 470-476.
[14] ZHAO Y T, KOMACHI M, KAJIWARA T O, et al. Attention-based multimodal neural machine translation[C]//Proceedings of the 22nd annual conference of the European Association for Machine Translation, 2020: 105-114.
[15] XU K, BA J, KIROS R, et al. Show, attend and tell: Neural image caption generation with visual attention[C]//Proceedings of the 32nd International conference on machine learning, 2015: 2048-2057.
[16] SHETTY R and LAAKSONEN J. Video captioning with recurrent networks based on frame-and video-level features and visual content classification[J]. arXiv preprint arXiv: 1512.02949,2015.
[17] YOU Q Z, JIN H L, WANG Z W, et al. Image captioning with semantic attention[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016: 4651-4659.
[18] LI LH, TAaNG S, DENG L X, et al. Image caption with global-local attention[C]//Proceedings of the 31st AAAI Conference on Aritificial Intelligence, 2017: 4133-4139.
[19] DAKSH V, SRINIVASARAGHAVAN G . Human trajectory prediction using Spatially aware deep attention models[J]. arXiv preprint arXiv: 1705.09436,2017.
[20] Yan Y C, NI B B, YANG X K, et al. Predicting human interaction via relative attention model[C]//Proceedings of the 26th International Joint Conference on Artificial Intelligence, 2017: 3245-3251.
[21] ZHANG X S, GAO K, ZHANG Y D, et al. Task-driven dynamic fusion: Reducing ambiguity in video description[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017: 6250-6258.
[22] YAN C G,XIE H T, YANG D B, et al. Supervised hash coding with deep neural network for environment perception of intelligent vehicles[C]//Proceedings of the IEEE Transactions on Intelligent Transportation Systems, 2017: 284-295.
[23] PAN Y W, MEI T, YAO T, et al. Jointly modeling embedding and translation to bridge video and language[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2016: 4594-4602.
[24] VIGNESH R, TANG K D, MORI G, et al. Learning temporal embeddings for complex video analysis[C]//Proceedings of the IEEE International Conference on Computer Vision, 2015: 4471-4479.
[25] SHU X B, TANG J H, QI G J, et al. Concurrence-aware long short-term sub-memories for person-person action recognition[C]//Proceedings of the Conference on Computer Vision and Pattern Recognition, 2017: 2176-2183.
[26] ZHANG D W, HANG J W, JIANG L, et al. Revealing event saliency in unconstrained video collection[J].IEEE Transactions on Image Processing, 2017,26(4): 1746-1758.
[27] ZHANG D W, HAN J W, LI C, et al. Detection of co-salient objects by looking deep and wide[C]//Proceedings of the International Journal of Computer Vision 2, 2016: 215-232.
[28] ZHANG X S, ZHANG H W, ZHANG Y D, et al. Deep fusion of multiple semantic cues for complex event recognition[C]//Proceedings of the IEEE Transactions on Image Processing 25, 2016: 1033-1046.
[29] YAN C G, XIE H T, LIUS,et al. Effective Uyghur language text detection in complex background images for traffic prompt identification[C]//Proceedings of the IEEE Transactions on Intelligent Transportation Systems, 2017.
[30] YAN C G, ZHANG Y D, XU J Z, et al. A highly parallel framework for HEVC coding unit partitioning tree decision on many-core processors[C]//Proceedings of the IEEE Signal Processing Letters, 2014: 573-576.
[31] CLARENCE C G, ZHANG Y D, XU J Z, et al. Efficient parallel framework for HEVC motion estimation on many-core processors[C]//Proceedings of the IEEE Transactions on Circuits and Systems for Video Technology 24, 2014: 2077-2089.
[32] JEFF D, LISA A H, SERGIO G, et al. Long-term recurrent convolutional networks for visual recognition and description.[C]//Proceedings of the IEEE conference on computer vision and pattern recognition.Boston,MA,USA, 2015: 2625-2634.
[33] ANNA R, MARCUS R, WEI Q, et al. Coherent multi-sentence video description with variable level of detail[C]//Proceedings of the 36th German Conference, 2014: 184-195.
[34] Ross B. G, JEFF D, TREVOR D, et al. Rich feature hierarchies for accurate object detection and semantic segmentation[C]//Proceedings of the Conference on Computer Vision and Pattern Recognition, 2014: 580-587.

基金

国家重点研究与发展计划(2019QY1801,2019QY1802,2019QY1800);国家自然科学基金(U23A20388,62376111,U21B2027,61732005,61761026,61972186,61762056);云南高新技术产业发展项目(201606);云南省重大科技专项计划(202401BC070021,202103AA080015,202303AP140008,202002AD080001-5);云南省基础研究计划(202001AS070014,2018FB104);云南省学术和技术带头人后备人才(202105AC160018)
PDF(1503 KB)

537

Accesses

0

Citation

Detail

段落导航
相关文章

/