[1] MITHUN N C, LI J, METZE F, et al. Learning joint embedding with multimodal cues for cross-modal video-text retrieval[C]//Proceedings of the ACM on International Conference on Multimedia Retrieval, 2018: 19-27.
[2] CHEN C Y,WANG J C, WANG J F. Efficient news video querying and browsing based on distributed news video servers[J].IEEE Transactions on Multimedia, 2006,8(2): 257-269.
[3] AWAD G,BUTT A, FISCUS J, et al.Trecvid 2017: Evaluating ad-hoc and instance video search, events detection, video captioning and hyperlinking[C]//Proceedings of TRECVID, 2017.
[4] RASIWASIA N, COSTA PEREIRA J, COVIELLO E, et al. A new approach to cross-modal multimedia retrieval[C]//Proceedings of the 18th ACM International Conference on Multimedia, 2010: 251-260.
[5] SUN X,LONG X,HE D et al.VSRNet: End-to-end video segment retrieval with text query[J].Pattern Recognition, 2021, 119(4): 1-10.
[6] MIN S, KONG W, TU R C, et al.HunYuan_tvr for text-video retrivial[J]. arXiv preprint arXiv: 2204.03382, 2022.
[7] CHOO J,KIM J T,CHOI S. Cars cant fly up in the sky: Improving urban-scene segmentation via height-driven attention networks[C]//Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020: 9370-9380.
[8] JIA C, YANG Y, XIA Y, et al. Scaling up visual and vision-language representation learning with noisy text supervision[C]//Proceedings of International Conference on Machine Learning. PMLR, 2021: 4904-4916.
[9] RADFORD A, KIM J W, HALLACY C, et al. Learning transferable visual models from natural language supervision[C]//Proceedings of International Conference on Machine Learning. PMLR, 2021: 8748-8763.
[10] LUO H S, JI L, IHONG M,et al.CLIP4Clip: An empirical study of CLIP for end to end video clip retrieval[J].Neurocomputing,2021,508(1): 293-304.
[11] KRISHNA G S, SATYA K, et al. X-Pool: Cross-modal language-video attention for text-video retrieval[C]//Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022: 4996-5005.
[12] MANNING C D, SURDEANU M, BAUER J, et al.The Stanford CoreNLP natural language processing toolkit[C]//Proceedings of 52nd Annual Meeting of the Association for Computational Linguistics: System Demonstrations, 2014: 55-60.
[13] MACOUEEN J. Some methods for classification and analysis of multivariate observations[C]//Proceedings of the 5th Berkeley Symposium on Mathematical Statistics and Probability,1967, 1(14): 281-297.
[14] CALINSKI T, HARABASZ J. A dendrite method for cluster analysis[J].Communications in Statistics, 1974, 3(1): 1-27.
[15] XU J, MEI T, YAO T, et al. Msr-vtt: A large video description dataset for bridging video and language[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016: 5288-5296.
[16] WU Z, YAO T, FU Y, et al. Deep learning for video classification and captioning[M].Frontiers of Multimedia Research, 2017: 3-29.
[17] ANNE H L, WANG O, SHECHTMAN E, et al. Localizing moments in video with natural language[C]//Proceedings of the IEEE International Conference on Computer Vision, 2017: 5803-5812.
[18] XU C, YANG G, DONG J, et al.Dual encoding for video retrieval by text[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence,2022,44(8): 4065-4080.
[19] LIU S, FAN H, QIAN S, et al. HiT: Hierarchical transformer with momentum contrast for video-text retrieval[C]//Proceedings of IEEE/CVF International Conference on Computer Vision. QC,Canada: Montreal,2021: 11895-11905.
[20] FENG Z, ZENG Z, GUO C, et al.Temporal multimodal graph transformer with global-local alignment for video-text retrieval[J].IEEE Transactions on Circuits and Systems for Video Technology,2023,33(3): 1438-1453.
[21] LIU Y, ALBANIE S, NAGRANI A, et al.Use what you have: Video retrieval using representations from collaborative experts[J].arXiv preprint arXiv: 1907.13487,2019.
[22] GABEUR V, SUN C, ALAHARI K, et al. Multi-modal transformer for video retrieval[C]//Proceedings of ECCV: 16th European Conference. Springer International Publishing, 2020: 214-229.
[23] DZABRAEV M, KALASHNIKOV M, KOMKOV S, et al. MDMMT: Multidomain multimodal transformer for video retrieval[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Piscataway.NJ: IEEE,2021: 3354-3363.
[24] FANG H, XIONG P, XU L, et al.CLIP2 Video: Mastering video-text retrieval via image CLIP[J].arXiv preprint arXiv.2106.11097,2021.
[25] BAIN M, NAGRANI A, VAROL G, et al.Frozen in time: A joint video and image encoder for end-to-end retrieval[C]//Proceedings of IEEE/CVF International Conference on Computer Vision. QC,Canada: Montreal,2021: 1708-1718.
[26] LEI J, LI L, ZHOU L, et al. Less is more: CLIPBERT for video-and-language learning via sparse sampling[C]//Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition. TN,USA: Nashville, 2021: 7327-7337.
[27] GE Y, GE Y, LIU X, et al. Bridging video-text retrieval with multiple choice questions[C]//Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition. LA,USA: New Orleans, 2022: 16146-16155.

习怡萌(2000—),硕士研究生,主要研究领域为文本检索视频、跨模态检索。
E-mail: xym20000928@163.com

刘立波(1974—),通信作者,博士,教授,博士生导师,主要研究领域为智能信息处理、计算机视觉。
E-mail: liulib@163.com

邓箴(1984—),博士,主要研究领域为图像处理、机器视觉。
E-mail: dengzhen1025@163.com