基于预训练模型自适应匹配的视觉故事生成算法

宁铭,江爱文,崔朝阳,刘长红,王明文

PDF(3610 KB)
PDF(3610 KB)
中文信息学报 ›› 2024, Vol. 38 ›› Issue (5) : 155-166.
自然语言理解与生成

基于预训练模型自适应匹配的视觉故事生成算法

  • 宁铭1,江爱文1,崔朝阳1,刘长红1,王明文1,2
作者信息 +

Visual Story Generation Based on Adaptive Pre-trained Model Matching

  • NING Ming1, JIANG Aiwen1, CUI Zhaoyang1, LIU Changhong1, WANG Mingwen1,2
Author information +
History +

摘要

视觉故事生成任务是为一组图像序列生成具有表现力和连贯性的、能准确描述所涉及视觉内容的语句段落,是当前计算机视觉和自然语言处理交叉领域中一个有趣而又快速发展的多模态研究方向。随着预训练模型在各种下游任务的成功,基于预训练模型的视觉故事生成算法也被广泛研究。但因为数据模态的差异和语义鸿沟的存在,预训练模型在微调学习过程中会产生灾难性遗忘问题。如何协调视觉和语言两种模态数据的预训练模型,是当前多模态预训练模型研究的主要目标之一。该文提出基于预训练模型自适应匹配的视觉故事生成算法,一方面综合挖掘图像流的视觉、关系、序列等多样化互补信息,弥补语义差异;同时,另一方面用适应性损失对图文两种模态数据进行特征对齐,以及对图像流数据进行连续信息对齐,取得了较好的效果。算法在目前已公开的视觉故事生成数据集(VIST)上与近年的先进算法进行实验比较。评测结果表明,该文算法在生成故事的图文相关性、文本多样性、内容逻辑连贯性等指标上取得了具有竞争力的结果。

Abstract

The visual story generation task is to generate expressive and coherent sentences describing the visual content for a set of image sequences. To coordinate the pre training models of visual and linguistic modal data, this article proposes a visual story generation algorithm based on adaptive pre-trained model alignment. On the one hand, it comprehensively mines the diverse complementary information of visual, relational, sequential and other aspects of image streams. On the other hand, it applies the adaptive matching loss to align the features between the two modal data, as well as continuous information among the image stream data. Compared with recent state-of-the-art algorithms on VIST dataset, the proposed method has achieved competitive results in terms of image and text correlation, text diversity, and content logical coherence of the story.

关键词

视觉故事 / 适应匹配损失 / 预训练模型 / 多模态特征 / 图像序列

Key words

visual storytelling / adaptive matching loss / pretrained model / multimodal feature / image sequence

引用本文

导出引用
宁铭,江爱文,崔朝阳,刘长红,王明文. 基于预训练模型自适应匹配的视觉故事生成算法. 中文信息学报. 2024, 38(5): 155-166
NING Ming, JIANG Aiwen, CUI Zhaoyang, LIU Changhong, WANG Mingwen. Visual Story Generation Based on Adaptive Pre-trained Model Matching. Journal of Chinese Information Processing. 2024, 38(5): 155-166

参考文献

[1] WANG X , CHEN W H, WANG Y F, et al. No metrics are perfect: Adversarial reward learning for visual storytelling[C]//Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, 2018: 899-909.
[2] HUANG L, WANG W M, CHEN J, et al. Attention on attention for image captioning[C]//Proceedings of the IEEE International Conference on Computer Vision, 2019: 4634-4643.
[3] JUNG Y J, DAHUN K, SANGHYUN W, et al. Hide-and-tell: Learning to bridge photo streams for visual storytelling[C]//Proceedings of the AAAI Conference on Artificial Intelligence, 2020: 11213-11220.
[4] PAN Y W, YAO T, LI Y H, et al. X-linear attention networks for image captioning[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2020: 10971-10980.
[5] LUO Y, JI J, SUN X, et al. Dual-level collaborative transformer for image captioning[C]//Proceedings of the AAAI Conference on Artificial Intelligence, 2021: 2286-2293.
[6] RADFORD A, KIM J W, HAHACY C,et al. Learning transferable visual models from natural language supervision[C]//Proceedings of the International Conference on Machine Learning PMLR, 2021: 8748-8763.
[7] YU Y, CHUNG J, YUN H, et al. Transitional adaptation of pretrained models for visual storytelling[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021: 12658-12668.
[8] HUANG T H, FERRARO F, MOSTAFAZADEH N, et al. Visual storytelling[C]//Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2016: 1223-1239.
[9] HUANG Q, GAN Z, CELIKYILMAZ A, et al. Hierarchically structured reinforcement learning for topically coherent visual story generation[C]//Proceedings of the AAAI Conference on Artificial Intelligence, 2019, 33(01): 8465-8472.
[10] HU J, CHENG Y, GAN Z, et al. What makes a good story?: Designing composite rewards for visual storytelling [C]//Proceedings of the AAAI Conference on Artificial Intelligence, 2020, 34(05): 7969-7976.
[11] YAO L, PENG N, WEISCHEDEL R, et al. Plan-and-write: Towards better automatic storytelling[C]//Proceedings of the AAAI Conference on Artificial Intelligence, 2019, 33(01): 7378-7385.
[12] HSU C C, CHEN Z Y, HSU C Y, et al. Knowledge-enriched visual storytelling[C]//Proceedings of the AAAI Conference on Artificial Intelligence, 2020, 34(05): 7952-7960.
[13] CHEN H, HUANG Y, TAKAMURA H, et al. Commonsense knowledge aware concept selection for diverse and informative visual storytelling [C]//Proceedings of the AAAI Conference on Artificial Intelligence, 2021, 35(2): 999-1008.
[14] SPEER R, CHIN J, HAVASI C. ConceptNet 5.5: An open multilingual graph of general knowledge [C]//Proceedings of the AAAI Conference on Artificial Intelligence, 2017, 31(1): 4444-4451.
[15] JIASEN L U J, DHRUV B, DEVI P, et al. Vilbert: Pretraining task-agnostic visio linguistic representations for vision-and-language tasks[C]//Proceedings of the 33rd International Conference on Neural Information Processing Systems, 2019: 13-23.
[16] TAN H, MOHIT B. LXMERT: Learning cross-modality encoder representations from transformers [C]//Proceedings of the Conference on Empirical Methods in Natural Language Processing, 2019: 5100-5111.
[17] ZHENG C H, GUO Q, PARISA K. Cross-modality relevance for reasoning on language and vision[C]//Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, 2020: 1-14.
[18] SUN C H, AUSTIN M, CARL V, et al. Videobert: A joint model for video and language representation learning[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision, 2019: 7464-7473.
[19] LI J N, LI D X, XIONG C M, et al. BLIP-2: Bootstrapping language-image pre-training with frozen image encoders and large language models[R]. arXiv preprint arXiv: 2301.12597, 2023.
[20] TOM B B, BENJAMIN M, NICK R, et al. Language models are few-shot learners[C]//Proceedings of the Advances in Neural Information Processing Systems, 2020, 33: 1877-1901.
[21] OUYANG L, WU J, JIANG X, et al. Training language models to follow instructions with human feedback[C]//Proceedings of the Advances in Neural Information Processing Systems, 2022, 35: 27730-27744.
[22] CHEN X, WANG X, SORAVIT C, et al. Pali: A jointly-scaled multilingual language-image model[C]//Proceedings of the 11th International Conference on Learning Representations, 2023.
[23] DRIESS D, XIA F, SAJJADI M S M, et al. Palme: An embodied multimodal language model[R]. arXiv preprint arXiv: 2303.03378, 2023.
[24] HE K, ZHANG X, REN S, et al. Deep residual learning for image recognition[C]//Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, 2016: 770-778.
[25] DOSOVITSKIY A, BEYER L, KOLESNIKOV A, et al. An image is worth 16x16 words: Transformers for image recognition at scale[C]//Proceedings of the 9th International Conference on Learning Representations, 2021: 1-22.
[26] MIKE L, LIU Y H, NAMAN G, et al. Bart: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension[C]//Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, 2020: 1-10.
[27] YANG J, LU J, LEE S, et al. Graph r-cnn for scene graph generation[C]//Proceedings of the European Conference on Computer Vision, 2018: 670-685.
[28] XU D, ZHU Y, CHOY C B, et al. Scene graph generation by iterative message passing[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017: 5410-5419.
[29] GIRSHICK R. Fast r-cnn[C]//Proceedings of the IEEE International Conference on Computer Vision, 2015: 1440-1448.
[30] LI T P, WANG H L, HE B, et al. Knowledge-enriched attention network with group-wise semantic for visual storytelling[C]//Proceedings of the IEEE Transactions on Pattern Analysis and Machine Intelligence, Accepted, 2023.
[31] KRISHNA R, ZHU Y, GGROTH O, et al. Visualgenome: Connecting language and vision using crowd sourced dense image annotations[J]. International Journal of Computer Vision,2017,123(1): 32-73.
[32] LI Y, GAN Z, SHEN Y, et al. Storygan: A sequential conditional GAN for story visualization [C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019: 6329-6338.
[33] 李朦朦,江爱文,龙羽中,等. 基于细粒度视觉特征和知识图谱的视觉故事生成算法[J]. 中文信息学报, 2022, 36(9): 139-148.
[34] LI J, GALLEY M, BROCKETT C, et al. A diversity-promoting objective function for neural conversation models[C]//Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2016: 110-119.
[35] CONNORS L H , LIM A , PROKAEVA T , et al. Tabulation of human transthyretin variants[J]. Amyloid, 2003, 10(3): 160-184.
[36] BROWN P F, DELLA PIETRA V J, DESOUZA P V, et al. Class-based n-gram models of natural language[J]. Computational Linguistics, 1992, 18(4): 467-480.
[37] WANG E, HAN C R, JOSIAH P. RoViST: Learning robust metrics for visual storytelling[C]//Proceedings of the Association for Computational Linguistics: NAACL, 2022: 2691-2702.
[38] LAN Z, CHEN M, GOODMAN S, et al.ALBERT: A lite BERT for self-supervised learning of language representations[C]//Proceedings of the International Conference on Learning Representations, 2019.

基金

国家自然科学基金(61966018,62067004,62266023)
PDF(3610 KB)

416

Accesses

0

Citation

Detail

段落导航
相关文章

/