多级跨模态对齐的文本检索视频方法研究

PDF(4972 KB)

中文信息学报 ›› 2025, Vol. 39 ›› Issue (2) : 111-122.

信息检索

多级跨模态对齐的文本检索视频方法研究

习怡萌¹,刘立波^1,2,邓箴¹,刘倩¹

作者信息 +

A Multilevel Cross-modal Alignment for Textbase-Video Retrieval

XI Yimeng¹, LIU Libo^1,2, DENG Zhen¹, LIU Qian¹

Author information +

History +

摘要

现有文本检索视频方法在进行跨模态对齐时,未充分考虑文本细节和复杂视觉语义间的信息交互,使检索性能受到影响。为解决此问题,该文提出一种多级跨模态对齐的文本检索视频方法。首先,将查询文本按词性进行分解并编码,同时对视频帧进行编码和聚类操作;然后,对查询文本和视频的全局编码进行对齐,获取二者间的全局语义关系;接着,对文本动词编码与视频子动作编码进行动作对齐,以实现动作关联;最后,将名词编码与经动作对齐筛选的关键帧进行实体对齐,进一步消弱视频中弱相关或不相关帧,提高文本与视频之间的相关性。实验证明,该方法在MSR-VTT、DiDeMo和LSMDC公共数据集上的R@1指标分别提升了2.3%、1.5%和0.9%,优于现有文本检索视频方法。

Abstract

To capture the interaction of information between textual details and complex visual semantics during cross-modal alignment, this paper proposes a multi-level cross-modal alignment approach for text-based video retrieval. Firstly, the query text is decomposed and encoded based on its part-of-speech, while video frames are encoded and clustered. Then the alignment is performed between the global encodings of the query text and the video, capturing their global semantic relationship. Meanwhile, the action alignment is carried out between the textual verb encoding and video sub-action encoding to establish action correlations. Finally, entity alignment is performed between noun encoding and key frames filtered through action alignment, further reducing weakly or unrelated frames in the video. Experimental results demonstrate that this method outperforms existing methods by improving the R@1 metrics by 2.3%, 1.5%, and 0.9% on the MSR-VTT, DiDeMo, and LSMDC public datasets, respectively.

导出引用

习怡萌,刘立波,邓箴,刘倩. 多级跨模态对齐的文本检索视频方法研究. 中文信息学报. 2025, 39(2): 111-122

XI Yimeng, LIU Libo, DENG Zhen, LIU Qian. A Multilevel Cross-modal Alignment for Textbase-Video Retrieval. Journal of Chinese Information Processing. 2025, 39(2): 111-122

参考文献

[1] MITHUN N C, LI J, METZE F, et al. Learning joint embedding with multimodal cues for cross-modal video-text retrieval[C]//Proceedings of the ACM on International Conference on Multimedia Retrieval, 2018: 19-27.
[2] CHEN C Y,WANG J C, WANG J F. Efficient news video querying and browsing based on distributed news video servers[J].IEEE Transactions on Multimedia, 2006,8(2): 257-269.
[3] AWAD G,BUTT A, FISCUS J, et al.Trecvid 2017: Evaluating ad-hoc and instance video search, events detection, video captioning and hyperlinking[C]//Proceedings of TRECVID, 2017.
[4] RASIWASIA N, COSTA PEREIRA J, COVIELLO E, et al. A new approach to cross-modal multimedia retrieval[C]//Proceedings of the 18th ACM International Conference on Multimedia, 2010: 251-260.
[5] SUN X,LONG X,HE D et al.VSRNet: End-to-end video segment retrieval with text query[J].Pattern Recognition, 2021, 119(4): 1-10.
[6] MIN S, KONG W, TU R C, et al.HunYuan_tvr for text-video retrivial[J]. arXiv preprint arXiv: 2204.03382, 2022.
[7] CHOO J,KIM J T,CHOI S. Cars cant fly up in the sky: Improving urban-scene segmentation via height-driven attention networks[C]//Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020: 9370-9380.
[8] JIA C, YANG Y, XIA Y, et al. Scaling up visual and vision-language representation learning with noisy text supervision[C]//Proceedings of International Conference on Machine Learning. PMLR, 2021: 4904-4916.
[9] RADFORD A, KIM J W, HALLACY C, et al. Learning transferable visual models from natural language supervision[C]//Proceedings of International Conference on Machine Learning. PMLR, 2021: 8748-8763.
[10] LUO H S, JI L, IHONG M,et al.CLIP4Clip: An empirical study of CLIP for end to end video clip retrieval[J].Neurocomputing,2021,508(1): 293-304.
[11] KRISHNA G S, SATYA K, et al. X-Pool: Cross-modal language-video attention for text-video retrieval[C]//Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022: 4996-5005.
[12] MANNING C D, SURDEANU M, BAUER J, et al.The Stanford CoreNLP natural language processing toolkit[C]//Proceedings of 52nd Annual Meeting of the Association for Computational Linguistics: System Demonstrations, 2014: 55-60.
[13] MACOUEEN J. Some methods for classification and analysis of multivariate observations[C]//Proceedings of the 5th Berkeley Symposium on Mathematical Statistics and Probability,1967, 1(14): 281-297.
[14] CALINSKI T, HARABASZ J. A dendrite method for cluster analysis[J].Communications in Statistics, 1974, 3(1): 1-27.
[15] XU J, MEI T, YAO T, et al. Msr-vtt: A large video description dataset for bridging video and language[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016: 5288-5296.
[16] WU Z, YAO T, FU Y, et al. Deep learning for video classification and captioning[M].Frontiers of Multimedia Research, 2017: 3-29.
[17] ANNE H L, WANG O, SHECHTMAN E, et al. Localizing moments in video with natural language[C]//Proceedings of the IEEE International Conference on Computer Vision, 2017: 5803-5812.
[18] XU C, YANG G, DONG J, et al.Dual encoding for video retrieval by text[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence,2022,44(8): 4065-4080.
[19] LIU S, FAN H, QIAN S, et al. HiT: Hierarchical transformer with momentum contrast for video-text retrieval[C]//Proceedings of IEEE/CVF International Conference on Computer Vision. QC,Canada: Montreal,2021: 11895-11905.
[20] FENG Z, ZENG Z, GUO C, et al.Temporal multimodal graph transformer with global-local alignment for video-text retrieval[J].IEEE Transactions on Circuits and Systems for Video Technology,2023,33(3): 1438-1453.
[21] LIU Y, ALBANIE S, NAGRANI A, et al.Use what you have: Video retrieval using representations from collaborative experts[J].arXiv preprint arXiv: 1907.13487,2019.
[22] GABEUR V, SUN C, ALAHARI K, et al. Multi-modal transformer for video retrieval[C]//Proceedings of ECCV: 16th European Conference. Springer International Publishing, 2020: 214-229.
[23] DZABRAEV M, KALASHNIKOV M, KOMKOV S, et al. MDMMT: Multidomain multimodal transformer for video retrieval[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Piscataway.NJ: IEEE,2021: 3354-3363.
[24] FANG H, XIONG P, XU L, et al.CLIP2 Video: Mastering video-text retrieval via image CLIP[J].arXiv preprint arXiv.2106.11097,2021.
[25] BAIN M, NAGRANI A, VAROL G, et al.Frozen in time: A joint video and image encoder for end-to-end retrieval[C]//Proceedings of IEEE/CVF International Conference on Computer Vision. QC,Canada: Montreal,2021: 1708-1718.
[26] LEI J, LI L, ZHOU L, et al. Less is more: CLIPBERT for video-and-language learning via sparse sampling[C]//Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition. TN,USA: Nashville, 2021: 7327-7337.
[27] GE Y, GE Y, LIU X, et al. Bridging video-text retrieval with multiple choice questions[C]//Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition. LA,USA: New Orleans, 2022: 16146-16155.