从视觉到文本: 图像描述生成的研究进展综述

魏忠钰,范智昊,王瑞泽,承怡菁,赵王榕,黄萱菁

PDF(981 KB)
PDF(981 KB)
中文信息学报 ›› 2020, Vol. 34 ›› Issue (7) : 19-29.
综述

从视觉到文本: 图像描述生成的研究进展综述

  • 魏忠钰1,范智昊1,王瑞泽2,承怡菁1,赵王榕1,黄萱菁3
作者信息 +

From Vision to Text: A Brief Survey for Image Captioning

  • WEI Zhongyu1, FAN Zhihao1, WANG Ruize2, CHENG Yijing1, ZHAO Wangrong1, HUANG Xuanjing3
Author information +
History +

摘要

近年来,跨模态研究吸引了越来越多学者的关注,尤其是连接视觉和语言的相关课题。该文针对跨视觉和语言模态研究中的核心任务——图像描述生成,进行文献综述。该文从基于视觉的文本生成框架、基于视觉的文本生成研究中的关键问题、图像描述生成模型的性能评价和图像描述生成模型的主要发展过程四个方面对相关文献进行介绍和总结。最后,该文给出了几个未来的重点研究方向,包括跨视觉和语言模态的特征对齐、自动化评价指标的设计以及多样化图像描述生成。

Abstract

In recent years, increasing attention has been attracted to the research field related to cross-modality, especially vision and language. This survey focuses on the task of image captioning and summarizes literatures from four aspects, including the overall architecture, some key questions for cross-modality research, the evaluation of image captioning and the state-of-the-art approaches to image captioning. In conclusion, we suggest three directions for future research, i.e., cross-modality representation, automatic evaluation metrics and diverse text generation.

关键词

图像描述生成 / 跨模态特征对齐 / 文献综述

Key words

image captioning / cross-modality alignment / literature review

引用本文

导出引用
魏忠钰,范智昊,王瑞泽,承怡菁,赵王榕,黄萱菁. 从视觉到文本: 图像描述生成的研究进展综述. 中文信息学报. 2020, 34(7): 19-29
WEI Zhongyu, FAN Zhihao, WANG Ruize, CHENG Yijing, ZHAO Wangrong, HUANG Xuanjing. From Vision to Text: A Brief Survey for Image Captioning. Journal of Chinese Information Processing. 2020, 34(7): 19-29

参考文献

[1] He X, Deng L. Deeplearning for image-to-text generation: A technical overview[J]. IEEE Signal Processing Magazine, 2017, 34(6):109-116.
[2] Hossain M Z, Sohel F, Shiratuddin M F, et al. Acomprehensive survey of deep learning for image captioning[J]. ACM Computing Surveys, 2019, 51(6):1-36.
[3] Zhang C, Yang Z, He X, et al. Multimodal intelligence: Representation learning, information fusion, and applications[C]//Proceedings of IEEE Journal of Selected Topics in Signal Processing, 2020.
[4] Vinyals O, Toshev A, Bengio S, et al. Show and tell: A neural image caption generator[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition,2015: 3156-3164.
[5] Karpathy A, Fei-Fei L. Deep visual-semantic alignments for generating image descripttions[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015: 3128-3137.
[6] Antol S, Agrawal A, Lu J, et al. Vqa: visual question answering[C]//Proceedings of the International Conference on Computer Vision,2015: 2425-2433.
[7] Ting-Hao Huang, Ferraro F, et al. Visual storytelling[C]//Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2016: 1233-1239.
[8] Das A, Kottur S, Gupta K, et al. Visual dialog[C]// Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition,2017: 326-335.
[9] Johnson J, Hariharan B, van der Maaten L, et al. Clevr: A diagnostic dataset for compositional language and elementary visual reasoning[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition,2017: 2901-2910.
[10] Zhu Y, Mottaghi R, Kolve E, et al. Target-driven visual navigation in indoor scenes using deep reinforcement learning[C]//Proceedings of the 2017 IEEE International Conference on Robotics and Automation,2017: 3357-3364.
[11] Reed S, Akata Z, Yan X, et al. Generative adversarial text to image synthesis[J]. arXiv preprint: 1605.05396, 2016.
[12] Farhadi A, Hejrati M, Sadeghi M A, et al. Every picture tells a story: Generating sentences from images[C]// Proceedings of the European Conference on Computer Vision. Springer. Berlin. Heidelberg, 2010: 15-29.
[13] Kulkarni G, Premraj V, Ordonez V, et al. Babytalk: Understanding and generating simple image descriptions[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2013, 35(12):2891-2903.
[14] Li S, Kulkarni G, Berg T L, et al. Composing simple image descriptions using web-scale n-grams[C]// Proceedings of the 15th Conference on Computational Natural Language Learning. Association for Computational Linguistics, 2011: 220-228.
[15] Hodosh M, Young P, Hockenmaier J. Framing image description as a ranking task: data, models and evaluation metrics[J]. Journal of Artificial Intelligence Research, 2013, 47(1):853-899.
[16] Ordonez V, Kulkarni G, Berg T L. Im2text: Describing images using 1 million captioned photographs[C]// Proceedings of the Advances in Neural Information Processing Systems,2011: 1143-1151.
[17] Sutskever I, Vinyals O, Le Q V. Sequence to sequence learning with neural networks[C]//Proceedings of the Advances in Neural Information Processing Systems, 2014: 3104-3112.
[18] Krizhevsky A, Sutskever I, Hinton G E. Imagenet classification with deep convolutional neural networks[C]// Proceedings of the Advances in Neural Information Processing Systems,2012: 1097-1105.
[19] He K, Zhang X, Ren S, et al. Deep residual learning for image recognition[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016: 770-778.
[20] Hochreiter S, Schmidhuber J. Long short-term memory[J]. Neural Computation, 1997, 9(8):1735-1780.
[21] ChungJ, Gulcehre C, Cho K, et al. Gated feedback recurrent neural networks[C]//Proceedings of the International Conference on Machine Learning,2015: 2067-2075.
[22] Herdade S, Kappeler A, Boakye K, et al. Imagecaptioning: Transforming objects into words[C]//Proceedings of the Advances in Neural Information Processing Systems, 2019: 11135-11145.
[23] Fang H, Gupta S, Iandola F, et al. From captions to visual concepts and back[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015: 1473-1482.
[24] Lu J, Yang J, Batra D, et al. Neural baby talk[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition,2018: 7219-7228.
[25] Dai B, Fidler S, Lin D. A neural compositional paradigm for image captioning[C]//Proceedings of the Advances in Neural Information Processing Systems,2018: 658-668.
[26] Lecun Y, Boser B, Denker J, et al. Backpropagationapplied to handwritten zip code recognition[J]. Neural Computation, 1989, 1(4):541-551.
[27] Girshick R, Donahue J, Darrell T, et al. Rich feature hierarchies for accurate object detection and semantic segmentation[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition,2014: 580-587.
[28] Ren S, He K, Girshick R, et al. Faster r-cnn: Towards real-time object detection with region proposal networks[C]//Proceedings of the Advances in Neural Information Processing Systems,2015: 91-99.
[29] Wu Q, Shen C, Liu L, et al. What value do explicit high level concepts have in vision to language problems?[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition,2016: 203-212.
[30] Gan Z, Gan C, He X, et al. Semantic compositional networks for visual captioning[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition,2017: 5630-5639.
[31] You Q, Jin H, Wang Z, et al. Image captioning with semantic attention[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016: 4651-4659.
[32] Yao T, Pan Y, Li Y, et al. Exploring visual relationship for image captioning[C]//Proceedings of the European Conference on Computer Vision (ECCV),2018: 684-699.
[33] Bahdanau D, Cho K, Bengio Y. Neural machine translation by jointly learning to align and translate[J]. arXiv preprint : 1409.0473, 2014.
[34] Xu K, Ba J, Kiros R, et al. Show, attend and tell: Neural image caption generation with visual attention[C]//Proceedings of the International Conference on Machine Learning,2015: 2048-2057.
[35] Anderson P, He X, Buehler C, et al. Bottom-up and top-down attention for image captioning and visual question answering[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition,2018: 6077-6086.
[36] Devlin J, Chang M W, Lee K, et al. Bert: Pre-training of deep bidirectional transformers for language understanding[J]. arXiv preprint : 1810.04805, 2018.
[37] He K, Zhang X, Ren S, et al. Deep residual learning for image recognition[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition,2016: 770-778.
[38] Li L H, Yatskar M, Yin D, et al. Visualbert: A simple and performant baseline for vision and language[J]. arXiv preprint : 1908.03557, 2019.
[39] Li G, Duan N, Fang Y, et al. Unicoder-vl: A universal encoder for vision and language by cross-modal pre-training[J]. arXiv preprint : 1908.06066, 2019.
[40] Su W, Zhu X, Cao Y, et al. Vl-bert: Pre-training of generic visual-linguistic representations[J]. arXiv preprint : 1908.08530, 2019.
[41] Lu J, Batra D, Parikh D, et al. Vilbert: Ptretraining task-agnostic visiolinguistic representations for vision-and-language tasks[C]//Proceedings of the Advances in Neural Information Processing Systems,2019: 13-23.
[42] Tan H, Bansal M. Lxmert: learning cross-modality encoder representations from transformers[J]. arXiv preprint : 1908.07490, 2019.
[43] Chen Y C, Li L, Yu L, et al. Uniter: Learning universal image-text representations[J]. arXiv preprint : 1909.11740, 2019.
[44] Ranzato M A, Chopra S, Auli M, et al. Sequence level training with recurrent neural networks[J]. arXiv preprint : 1511.06732, 2015.
[45] Rennie S J, Marcheret E, Mroueh Y, et al. Self-critical sequence training for image captioning[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition,2017: 7008-7024.
[46] Zhang L, Sung F, Liu F, et al. Actor-critic sequence training for image captioning[J]. arXiv preprint : 1706.09601, 2017.
[47] Vijayakumar A K, Cogswell M, Selvaraju R R, et al. Diverse beam search: Decoding diverse solutions from neural sequence models[J]. arXiv preprint : 1610.02424, 2016.
[48] Wang Z, Wu F, Lu W, et al. Diverse image captioning via grouptalk[C]//Proceedings of the IJCAI,2016: 2957-2964.
[49] Shetty R, Rohrbach M, Anne Hendricks L, et al. Speaking the same language: Matching machine to human captions by adversarial training[C]//Proceedings of the IEEE International Conference on Computer Vision,2017: 4135-4144.
[50] Fan Z, Wei Z, Li P, et al. A question type driven framework to diversify visual question generation[C]//Proceedings of the IJCAI,2018: 4048-4054.
[51] Wang L, Schwing A, Lazebnik S. Diverse and accurate image description using a variational auto-encoder with an additive gaussian encoding space[C]//Proceedings of the Advances in Neural Information Processing Systems, 2017: 5756-5766.
[52] Chen F, Ji R, Ji J, et al. Variational structured semantic inference for diverse image captioning[C]//Proceedings of the Advances in Neural Information Processing Systems,2019: 1929-1939.
[53] Dai B, Fidler S, Urtasun R, et al. Towards diverse and natural image descriptions via a conditional gan[C]// Proceedings of the IEEE International Conference on Computer Vision,2017: 2970-2979.
[54] Li D, Huang Q, He X, et al. Generating diverse and accurate visual captions by comparative adversarial learning[J]. arXiv preprint : 1804.00861, 2018.
[55] Li J, Galley M, Brockett C, et al. A diversity-promoting objective function for neural conversation models[J]. arXiv preprint : 1510.03055, 2015.
[56] Young P, Lai A, Hodosh M, et al. From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions[J]. Transactions of the Association for Computational Linguistics, 2014, 2: 67-78.
[57] Plummer B A, Wang L, Cervantes C M, et al. Flickr30k entities: collecting region-to-phrase correspondences for richer image-to-sentence models[C]//Proceedings of the IEEE International Conference on Computer Vision,2015: 2641-2649.
[58] Lin T Y, Maire M, Belongie S, et al. Microsoft coco: Common objects in context[C]//Proceedings of the European Conference on Computer Vision. Springer, Cham, 2014: 740-755.
[59] Deng J, Dong W, Socher R, et al. Imagenet: A large-scale hierarchical image database[C]//Proceedings of the 2009 IEEE Conference on Computer Vision and Pattern Recognition,2009: 248-255.
[60] Krishna R, Zhu Y,Groth O, et al. Visual genome: Connecting language and vision using crowdsourced dense image annotations[J]. International Journal of Computer Vision, 2017, 123(1):32-73.
[61] Grubinger M, Clough P, Müller H, et al. The iapr tc-12 benchmark: A new evaluation resource for visual information systems[C]//Proceedings of the International Workshop onto Image,2006: 13-22.
[62] Escalante H J, Hernández C A, Gonzalez J A, et al. The segmented and annotatediapr tc-12 benchmark[J]. Computer Vision and Image Understanding, 2010, 114(4):419-428.
[63] Kazemzadeh S, Ordonez V, Matten M, et al. Referitgame: Referring to objects in photographs of natural scenes[C]//Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP),2014: 787-798.
[64] Tran K, He X, Zhang L, et al. Rich image captioning in the wild[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops,2016: 49-56.
[65] Chunseong Park C, Kim B, Kim G. Attend to you: Personalized image captioning with context sequence memory networks[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition,2017: 895-903.
[66] Wang Y, Lin Z, Shen X, et al. Skeleton key: Image captioning by skeleton-attribute decomposition[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition,2017: 7272-7281.
[67] Bychkovsky V, Paris S, Chan E, et al. Learning photographic global tonal adjustment with a database of input/output image pairs[C]//Proceedings of the CVPR. IEEE, 2011: 97-104.
[68] Gan C, Gan Z, He X, et al. Stylenet: Generating attractive visual captions with styles[C]//Proceedings of the IEEE Conference on Computer Vision and Pa-ttern Recognition,2017: 3137-3146.
[69] Wang X, Chen W, Wang Y F,et al. No metrics are perfect: Adversarial reward learning for visual storytelling[J]. arXiv preprint : 1804.09160, 2018.
[70] Fan Z, Wei Z, Wang S, et al. Bridging by word: Image grounded vocabulary construction for visual captioning[C]//Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics,2019: 6514-6524.
[71] Lin C Y, Hovy E. Automatic evaluation of summaries using n-gram co-occurrence statistics[C]//Proceedings of the 2003 Human Language Technology Conference of the North American Chapter of the Association for Computational Linguistics,2003: 150-157.
[72] Papineni K, Roukos S, Ward T, et al. Bleu: a method for automatic evaluation of machine translation[C]//Proceedings of the 40th Annual Meeting on Association for Computational Linguistics. Association for Computational Linguistics, 2002: 311-318.
[73] Vedantam R, Lawrence Zitnick C, Parikh D. Cider: consensus-based image description evaluation[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition,2015: 4566-4575.
[74] Banerjee S, Lavie A. Meteor: An automatic metric for mt evaluation with improved correlation with human judgments[C]//Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization,2005: 65-72.
[75] Anderson P, Fernando B, Johnson M, et al. Spice: semantic propositional image caption evaluation[C]//Proceedings of the European Conference on Computer Vision. Springer, Cham, 2016: 382-398.
[76] Cui Y, Yang G, Veit A, et al. Learning to evaluate image captioning[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition,2018: 5804-5812.
[77] Jiang M, Huang Q, Zhang L, et al. Tiger: Text-to-image grounding for image caption evaluation[J]. arXiv preprint : 1909.02050, 2019.
[78] Mao J, Xu W, Yang Y, et al. Deep captioning with multimodal recurrent neural networks (m-RNN)[J]. arXiv preprint : 1412.6632, 2014.
[79] Donahue J, Hendricks L A, Rohrbach M, et al. Long-term recurrent convolutional networks for visual recognition and description[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition,2015: 2625-2634.
[80] Lu J, Xiong C, Parikh D, et al. Knowing when to look: Adaptive attention via a visual sentinel for image captioning[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition,2017: 375-383.
[81] Gu J, Cai J, Wang G, et al. Stack-captioning: coarse-to-fine learning for image captioning[C]//Proceedings of the 32nd AAAI Conference on Artificial Intelligence,2018.
[82] Huang L, Wang W, Chen J, et al. Attention on attention for image captioning[C]//Proceedings of the IEEE International Conference on Computer Vision,2019: 4634-4643.
[83] Yang X , Tang K , Zhang H , et al. Auto-encoding scene graphs for image captioning[C]//Proceedings of the IEEE Conference on Computer Vision and Pa-ttern Recognition,2019: 10685-10694.
[84] Sammani F, Melas-Kyriazi L. Show, edit and tell: a framework for editing image captions[J]. arXiv preprint : 2003.03107, 2020.
[85] Zhou L, Palangi H, Zhang L, et al. Unified vision-language pre-training for image captioning and vqa[J]. arXiv preprint : 1909.11059, 2019.
[86] Feng Y, Ma L, Liu W, et al. Unsupervised image captioning[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition,2019: 4125-4134.

基金

国家自然科学基金(71991471);国家社会科学基金(20ZDA060);上海市科学技术委员会(18DZ1201000,17JC1420200)
PDF(981 KB)

5986

Accesses

0

Citation

Detail

段落导航
相关文章

/