基于多模态特征的视频密集描述生成方法

马苗,陈小秋,田卓钰

PDF(20986 KB)
PDF(20986 KB)
中文信息学报 ›› 2022, Vol. 36 ›› Issue (11) : 156-168.
自然语言理解与生成

基于多模态特征的视频密集描述生成方法

  • 马苗1,2,陈小秋1,田卓钰1
作者信息 +

A Dense Video Captioning Method Based on Multi-modal Features

  • MA Miao1,2, CHEN Xiaoqiu1, TIAN Zhuoyu1
Author information +
History +

摘要

根据视频内容自动生成文本序列的密集描述生成融合了计算机视觉与自然语言处理技术。现有密集描述生成方法多强调视频中的视觉与运动信息而忽略了其中的音频信息,关注事件的局部信息或简单的事件级上下文信息而忽略了事件间的时序结构和语义关系。为此,该文提出一种基于多模态特征的视频密集描述生成方法。该方法首先在动作提议生成阶段使用Timeception层作为基础模块以更好适应动作片段时间跨度的多样性,其次在动作提议生成和描述生成两阶段均利用音频特征增强提议和描述生成效果,最后使用时序语义关系模块建模事件间的时序结构和语义信息以进一步增强描述生成的准确性。特别地,该文还构建了一个基于学习场景的视频密集描述数据集SDVC以探究该文所提方法在学习场景现实应用中的有效性。在ActivityNet Captions和SDVC数据集上的实验结果表明,动作提议生成AUC值分别提升0.8%和6.7%;使用真实动作提议进行描述生成时,BLEU_3值分别提升1.4%和4.7%,BLEU_4值分别提升0.9%和5.3%;使用生成的动作提议进行描述生成时,SDVC数据集BLEU_3、BLEU_4值分别提升2.3%和2.2%。

Abstract

Dense video captioning can automatically generate sentence sequence corresponding to video content, involving issues in both computer vision and natural language processing. To capture the audio information as well as the temporal structure and semantic relationship between events, this paper proposes a dense video captioning method based on multi-modal features. Firstly, Timeception layer is used as basic module in action proposal generation stage to better adapt various time span of action segments. Secondly, audio features are used to enhance the effect of proposal and description generation stages. Finally, the temporal semantic relation module models the temporal structure and semantic information between events to further enhance the accuracy of description generation. In addition, this paper also constructs a dataset named SDVC to explore the effectiveness of this method in application of real learning scene. The experimental results on ActivityNet Captions and SDVC datasets show that the AUC of action proposal generation increases by 0.8% and 6.7%, respectively; and in turn, using generated action proposals for description generation, BLEU_3 and BLEU_4 of SDVC dataset increased by 2.3% and 2.2%, respectively.

关键词

密集描述生成 / 多模态特征 / 时序结构 / 语义关系

Key words

dense video captioning / multi-modal features / temporal structure / semantic relationship

引用本文

导出引用
马苗,陈小秋,田卓钰. 基于多模态特征的视频密集描述生成方法. 中文信息学报. 2022, 36(11): 156-168
MA Miao, CHEN Xiaoqiu, TIAN Zhuoyu. A Dense Video Captioning Method Based on Multi-modal Features. Journal of Chinese Information Processing. 2022, 36(11): 156-168

参考文献

[1] 汤鹏杰,王瀚漓.从视频到语言: 视频标题生成与描述研究综述[J].自动化学报,2022,48(2): 375-397.
[2] LIN T, ZHAO X, SU H, et al. BSN: Boundary sensitive network for temporal action proposal generation[C]//Proceedings of the European Conference on Co-mputer Vision, 2018: 3-19.
[3] 王辰成,杨麟儿,王莹莹,等.基于Transformer增强架构的中文语法纠错方法[J].中文信息学报,2020,34(6): 106-114.
[4] WANG T, ZHENG H, YU M, et al. Event-centric hierarchical representation for dense video captioning[J]. IEEE Transactions on Circuits and Systems for Video Technology, 2021,31(5): 1890-1900.
[5] HUSSEIN N, GAVVES E, SMEULDERS A W M. Timeception for complex action recognition[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2019: 254-263.
[6] 马苗,王伯龙,吴琦,等.视觉场景描述及其效果评价[J].软件学报,2019,30(4): 867-883.
[7] KRISHNA R, HATA K, REN F, et al. Dense-captioning events in videos[C]//Proceedings of the IEEE International Conference on Computer Vision, 2017: 706-715.
[8] WANG J, JIANG W, MA L, et al. Bidirectional attentive fusion with context gating for dense video captioning[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition,2018: 7190-7198.
[9] XU H, LI B, RAMANISHKA V, et al. Joint event detection and description in continuous video streams[C]//Proceedings of the IEEE Winter Conference on Applications of Computer vision, 2019: 396-405.
[10] VASWANI A, SHAZEER N, PARMAR N, et al. Attention is all you need[C]//Proceedings of the 31th International Conference on Neural Information Processing Systems, 2017: 6000-6010.
[11] ZHOU L, ZHOU Y, CORSO J, et al. End-to-end dense video captioning with masked transformer[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018: 8739-8748.
[12] MUN J, YANG L, REN Z, et al. Streamlined dense video captioning[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2020: 6581-6590.
[13] DUAN X, HUANG W, GAN C, et al. Weakly supervised dense event captioning in videos[C]//Proceedings of the International Conference on Neural Information Processing Systems, 2018: 3063-3073.
[14] RAHMAN T, XU B, SIGAL L. Watch,listen and tell: multi-modal weakly supervised dense event captioning[C]//Proceedings of the IEEE International Conference on Computer Vision, 2019: 8907-8916.
[15] IASHIN V, RAHTU E. A better use of audio-visual cues: dense video captioning with bi-modal transformer[J]. arXiv preprint arXiv: 2005.08271, 2020.
[16] CARREIRA J, ZISSERMAN A. Quo vadis, action recognition? A new model and the kinetics dataset[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017: 6299-6308.
[17] HERSHEY S, CHAUDHURI S, ELLIS D P W, et al. CNN architectures for large-scale audio classification[C]//Proceedings of the International Conference on Acoustics, Speech and Signal Processing, 2017: 131-135.
[18] WANG B, MA L, ZHANG W, et al. Controllable video captioning with pos sequence guidance based on gat-ed fusion network[C]//Proceedings of the IEEE International Conference on Computer Vision, 2019: 2641-2650.
[19] PAPINENI K, ROUKOS S, WARD T, et al. BLEU: A method for automatic evaluation of machine translation[C]//Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, 2002: 311-318.
[20] DENKOWSKI M, LAVIE A. Meteor universal: Language specific translation evaluation for any target language[C]//Proceedings of the 9th Workshop on Statistical Machine Translation, 2014: 376-380.
[21] GEMMEKE J F, ELLIS D P W, FREEDMAN D, et al. Audio set: An ontology and human-labeled dataset for audio events[C]//Proceedings of the International Conference on Acoustics, Speech and Signal Processing, 2017: 776-780.
[22] LI Y, YAO T, PAN Y, et al.Jointly localizing and describing events for dense video captioning[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018: 7492-7500.
[23] XIONG Y, DAI B, LIN D. Move forward and tell: A progressive generator of video descriptions[C]//Proceedings of the European Conference on Computer Vision, 2018: 468-483.
[24] IASHIN V, RAHTU E. Multi-modal dense video captioning[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, 2020: 958-959.

基金

国家自然科学基金(61877038,U2001205);陕西师范大学研究生创新团队项目课题(TD2020044Y)
PDF(20986 KB)

Accesses

Citation

Detail

段落导航
相关文章

/