基于细粒度视觉特征和知识图谱的视觉故事生成算法

李朦朦,江爱文,龙羽中,宁铭,彭虎,王明文

PDF(2151 KB)
PDF(2151 KB)
中文信息学报 ›› 2022, Vol. 36 ›› Issue (9) : 139-148.
自然语言理解与生成

基于细粒度视觉特征和知识图谱的视觉故事生成算法

  • 李朦朦1,江爱文1,龙羽中1,宁铭1,彭虎2,王明文1
作者信息 +

A Visual Storytelling Algorithm Based on Fine-grained Image Features and Knowledge Graph

  • LI Mengmeng1, JIANG Aiwen1, LONG Yuzhong1, NING Ming1, PENG Hu2, WANG Mingwen1
Author information +
History +

摘要

视觉故事生成是图像内容描述衍生的跨模态学习任务,在图文游记自动生成、启蒙教育等领域有较好的应用研究意义。目前主流方法存在对图像细粒度特征描述薄弱、故事文本的图文相关性低、语言不丰富等问题。为此,该文提出了基于细粒度视觉特征和知识图谱的视觉故事生成算法。该算法针对如何对图像内容进行充分挖掘和扩展表示,在视觉和高层语义方面,分别设计实现了图像细粒度视觉特征生成器和图像语义概念词集合生成器两个重要模块。在这两个模块中,细粒度视觉信息通过含有实体关系的场景图结构进行图卷积学习,高层语义信息综合外部知识图谱与相邻图像的语义关联进行扩充丰富,最终实现对图像序列内容较为全面细致的表示。该文算法在目前视觉故事生成领域规模最大的VIST数据集上与主流先进的算法进行了测试。实验结果表明,该文所提算法生成的故事文本,在图文相关性、故事逻辑性、文字多样性等方面,在Distinct-N和TTR等客观指标上均取得较大领先优势,具有良好的应用前景。

Abstract

Visual storytelling is a cross-modal task derived from image captioning, with substantial academic significance and wide application in the fields of automatic generation of travel notes, education and so on. Current methods are challenged by insufficient description of fine-grained image contents, low correlation between image and the generated story, lack of richness in language and so on. This paper proposes a visual storytelling algorithm based on fine-grained visual features and knowledge graph. To fully mine and enhance the representations on image content, we design a fine-grained visual feature generator and a semantic concept generator. The first generator applies graph convolution learning on scene graph to embed entity relationships into the fine-grained visual information. The second generator integrates external knowledge graph to enrich high-level semantic associations between adjacent images. As a result, comprehensive and detailed representations for image sequence are finally realized. Compared with several state-of-the-art methods on the VIST dataset, the proposed algorithm has great advantages on Distinct-N and TTR performance over story-image correlation, story logic and word diversity.

关键词

视觉故事生成 / 场景图 / 知识图谱 / 文本生成 / 细粒度视觉特征

Key words

visual storytelling / scene graph / knowledge graph / text generation / fine-grained vision features

引用本文

导出引用
李朦朦,江爱文,龙羽中,宁铭,彭虎,王明文. 基于细粒度视觉特征和知识图谱的视觉故事生成算法. 中文信息学报. 2022, 36(9): 139-148
LI Mengmeng, JIANG Aiwen, LONG Yuzhong, NING Ming, PENG Hu, WANG Mingwen. A Visual Storytelling Algorithm Based on Fine-grained Image Features and Knowledge Graph. Journal of Chinese Information Processing. 2022, 36(9): 139-148

参考文献

[1] Wang J, Fu J, Tang J, et al. Show, reward and tell: Automatic generation of narrative paragraph from photo stream by adversarial training[C]//Proceedings of the AAAI Conference on Artificial Intelligence, 2018: 7396-7403.
[2] Li J, Shi H, Tang S, et al. Informative visual storytelling with cross-modal rules[C] //Proceedings of the 27th ACM International Conference on Multimedia, 2019: 2314-2322.
[3] Li J, Tang S, Li J, et al. Topic adaptation and prototype encoding for few-shot visual storytelling[C]//Proceedings of the 28th ACM International Conference on Multimedia, 2020: 4208-4216.
[4] 李志欣,魏海洋,黄飞成,等. 结合视觉特征和场景语义的图像描述生成[J]. 计算机学报,2020,43(9):1624-1640.
[5] 张家硕, 洪宇, 李志峰, 等. 基于双向注意力机制的图像描述生成[J]. 中文信息学报, 2020, 34(9): 53-61.
[6] Huang T H, Ferraro F, Mostafazadeh N, et al. Visual storytelling[C]//Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2016: 1233-1239.
[7] Sutskever I, Vinyals O, Le Q V. Sequence to sequence learning with neural neworks[C]//Proceedings of Advances in Neural Information Processing Systems, 2014: 3104-3112.
[8] Kim T, Heo M O, Son S, et al.GLAC Net: GLocal attention cascading networks for multi-image cued story generation[J],arXiv preprint arXiv: 1805.10973, 2018.
[9] Hochreiter S, Schmidhuber J. Long short-term memory[J]. Neural Computation, 1997, 9(8): 1735-1780.
[10] Wang X, Chen W, Wang Y F, et al. No metrics are perfect: adversarial reward learning for visual storytelling[C]//Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, 2018: 899-909.
[11] Cho K, Merrienboer B V, Gulcehre C , et al. Learning phrase representations using RNN encoder-decoder for statistical machine translation[C]//Proceedings of the Conference on Empirical Methods in Natural Language Processing, 2014:1724-1734.
[12] Wang R, Wei Z, Li P, et al. Storytelling from an image stream using scene graphs[C]//Proceedings of the AAAI Conference on Artificial Intelligence, 2020: 9185-9192.
[13] Jung Y, Kim D, Woo S, et al. Hide-and-tell: Learning to bridge photo streams for visual storytelling[C]//Proceedings of the AAAI Conference on Artificial Intelligence, 2020: 11213-11220.
[14] 刘道文, 阮彤, 张晨童, 等. 基于多源知识图谱融合的智能导诊算法[J]. 中文信息学报, 2021, 35(1): 125-134.
[15] 刘勘, 张雅荃. 基于医疗知识图谱的并发症辅助诊断[J]. 中文信息学报, 2020, 34(10): 85-93.
[16] 林旺群, 汪淼, 王伟, 等. 知识图谱研究现状及军事应用[J]. 中文信息学报, 2020, 34(12): 9-16.
[17] Hsu C C, Chen Z Y, Hsu C Y, et al. Knowled-geenriched visual storytelling[C]//Proceedings of the AAAI Conference on Artificial Intelligence, 2020: 7952-7960.
[18] Hsu C Y, Chu Y W, Huang T H K, et al. Plot and rework: modeling storylines for visual storytelling[C]//Proceedings of the Joint Conference of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, 2021:4443-4453.
[19] Yang P, Luo F, Chen P, et al. Knowledgeable storyteller: A commonsense-driven generative model for visual storytelling[C]//Proceedings of International Joint Conference on Artificial Intelligence, 2019: 5356-5362.
[20] Chen H, Huang Y, Takamura H, et al. Common-sense knowledge aware concept selection for diverse and informative visual storytelling[C]//Proceedings of the AAAI Conference on Artificial Intelligence, 2021, 35(2): 999-1008.
[21] Liu H, Singh P. ConceptNet: A practical commonsense reasoning tool-kit[J]. BT Technology Journal., 2004, 22(4):211-226.
[22] He K, Zhang X, Ren S, et al. Deep residual learning for image recognition[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016: 770-778..
[23] Lewis M, Liu Y, Goyal N, et al. BART: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension[C]//Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, 2020: 7871-7880..
[24] Yang J, Lu J, Lee S, et al. Graph R-CNN for scene graph generation[C]//Proceedings of the European Conference on Computer Vision, 2018: 670-685..
[25] Xu D, Zhu Y, Choy C B, et al. Scene graph generation by iterative message passing[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017: 5410-5419.
[26] Girshick R. Fast R-CNN[C]//Proceedings of the IEEE International Conference on Computer Vision, 2015: 1440-1448.
[27] Kipf T N, Welling M. Semi-supervised classification with graph convolutional networks[C] //Proceedings of the 5th International Conference on Learning Representations, 2017:1-8.
[28] 王思宇, 邱江涛, 洪川洋, 等. 基于知识图谱的在线商品问答研究[J]. 中文信息学报, 2020, 34(11): 104-112.
[29] 冯小兰,赵小兵. 汉藏双语旅游领域知识图谱系统构建[J]. 中文信息学报, 2019, 33(11): 64-72.
[30] Krishna R, Zhu Y, Ggroth O, et al. Visual genome: Connecting language and vision using crowd sourced dense image annotations[J]. International Journal of Computer Vision, 2017, 123(1): 32-73.
[31] Li J, Galley M, Brockett C, et al. A diversity-promoting objective function for neural conversation models[C]//Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2016: 110-119.
[32] Connors L H, Lim A, Prokaeva T, et al. Tabulation of human transthyretin (TTR) variants[J]. Amyloid, 2003, 10(3): 160-184.
[33] Brown P F, Della Pietra V J, Desouza P V, et al. Class-based n-gram models of natural language[J]. Computational Linguistics, 1992, 18(4): 467-480.
[34] Papineni K, Roukos S, Ward T, et al. Bleu: A method for automatic evaluation of machine translation[C]//Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, 2002: 311-318.
[35] Banerjee S, Lavie A. METEOR: An automatic metric for MT evaluation with improved correlation with human judgments[C]//Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization, 2005: 65-72.
[36] Lin C Y. Rouge: A package for automatic evaluation of summaries[G]. Text Summarization Branches Out, 2004: 74-81.
[37] Vedantam R, Lawrence Z C, Parikh D. Cider: Consensus-based image description evaluation[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015: 4566-4575.

基金

国家自然科学基金(61966018,61876074)
PDF(2151 KB)

Accesses

Citation

Detail

段落导航
相关文章

/