基于视觉-语义中间综合属性特征的图像中文描述生成算法

肖雨寒,江爱文,王明文,揭安全

PDF(20735 KB)
PDF(20735 KB)
中文信息学报 ›› 2021, Vol. 35 ›› Issue (4) : 129-138.
多模态自然语言处理

基于视觉-语义中间综合属性特征的图像中文描述生成算法

  • 肖雨寒,江爱文,王明文,揭安全
作者信息 +

Chinese Image Captioning Based on Middle-Level Visual-Semantic Composite Attributes

  • XIAO Yuhan, JIANG Aiwen, WANG Mingwen, JIE Anquan
Author information +
History +

摘要

图像描述是计算机视觉、自然语言处理与机器学习的交叉领域多模态信息处理任务,需要算法能够有效地处理图像和语言两种不同模态的信息。由于异构语义鸿沟的存在,该任务具有较大的挑战性。目前主流的研究仍集中在基于英文的图像描述任务,对图像中文描述的研究相对较少。图像视觉信息在图像描述算法中没有得到足够的重视,算法模型的性能更多地取决于语言模型。针对以上两个方面的研究不足,该文提出了基于多层次选择性视觉语义属性特征的图像中文描述生成算法。该算法结合目标检测和注意力机制,充分考虑了图像高层视觉语义所对应的中文属性信息,抽取不同尺度和层次的属性上下文表示。为了验证该文算法的有效性,在目前规模最大的AI Challenger 2017图像中文描述数据集以及Flick8k-CN图像中文描述数据集上进行了测试。实验结果表明,该算法能够有效地实现视觉-语义关联,生成文字表述较为准确、内容丰富的描述语句。较现阶段主流图像描述算法在中文语句上的性能表现,该文算法在各项评价指标上均有约3%~30%的较大幅度提升。为了便于后续研究复现,该文的相关源代码和模型已在开源网站Github上公开。

Abstract

Image captioning is a multi-modal information processing task in the cross domain of computer vision, natural language processing and machine learning. In contrast to the existing studies on English image captioning, this paper proposes an image Chinese image captioning algorithm by extracting multi-level visual semantic attributes for content representation. Experiments are performed on the AI Challenger 2017, the largest Chinese image captioning dataset at present, and the Flick8k-CN Chinese image captioning dataset. Compared with mainstream image description algorithms, the algorithm has a significant improvement of about 3%-30%.

关键词

图像中文描述 / 目标检测 / 注意力机制

Key words

image Chinese description / object detection / attention mechanism

引用本文

导出引用
肖雨寒,江爱文,王明文,揭安全. 基于视觉-语义中间综合属性特征的图像中文描述生成算法. 中文信息学报. 2021, 35(4): 129-138
XIAO Yuhan, JIANG Aiwen, WANG Mingwen, JIE Anquan. Chinese Image Captioning Based on Middle-Level Visual-Semantic Composite Attributes. Journal of Chinese Information Processing. 2021, 35(4): 129-138

参考文献

[1] 周铭柯,柯逍,杜明智.基于数据均衡的增进式深度自动图像标注[J].软件学报,2017,28(07):1862-1880.
[2] 汤鹏杰,王瀚漓,许恺晟.LSTM逐层多目标优化及多层概率融合的图像描述[J].自动化学报, 2018, 44(07): 1237-1249.
[3] Farhadi A, Hejrati M, Sadeghi M A, et al. Every picture tells a story: Generating sentences from images[C]//Proceedings of the 11th European Conference on Computer Vision. Crete, Greece, 2010: 15-29.
[4] Kulkarni G, Premraj V, Dhar S, et al. Baby talk understanding and generating simple image descriptions[C]//Proceedings of IEEE Conference on Computer Vision and Pattern Recognition. Colorado Springs, USA, 2011: 1601-1608.
[5] Sutskever I, Vinyals O, Le Q V. Sequence to sequence learning with neural networks[C]//Proceedings of the 33rd Annual Conference on Neural Information Processing Systems. Montreal, Canada, 2014:3104-3112.
[6] Mao J, Xu W, Yang Y, et al. Deep captioning with multimodal recurrent neural networks[C]//Proceedings of the 3rd International Conference on Learning Representation, 2015: 1-17.
[7] Vinyals O, Toshev A, Bengio S, et al. Show and tell: A neural image caption generator[C]//Proceedings of IEEE International Conference on Computer Vision & Pattern Recognition. Boston, USA, 2015: 3156-3164.
[8] Karpathy A, Fei-Fei L. Deep visual-semantic alignments for generating image descriptions[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2014, 39(4):664-676.
[9] Girshick R, Donahue J, Darrelland T, et al. Rich feature hierarchies for object detection and semantic segmentation[C]//Proceedings of the 27th IEEE Conference on Computer Vision and Pattern Recognition, 2014: 580-587.
[10] Li X, Lan W, Dong J, et al. Adding Chinese captions to images[C]//Proceedings of the 2016 ACM International Conference on Multimedia Retrieval. New York, USA, 2016: 271-275.
[11] Xu Kelvin, et al. Show, attend and tell: Neural image caption generation with visual attention[C]//Proceedings of the 32nd International Conference on Machine Learning. Lille, France, 2015: 2048-2057.
[12] Lu J, Xiong C, Parikh D, et al. Knowing when to look: Adaptive attention via a visual sentinel for image captioning[C]//Proceedings of the 29th IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas, NV, USA, 2016: 375-383.
[13] Jia X, Gavves E, Fernando B, et al. Guiding the long-short term memory model for image caption generation[C]//Proceedings of IEEE International Conference on Computer Vision. Santiago, Chile, 2015: 2407-2415.
[14] 张博,郝杰,马刚,等.基于弱匹配概率典型相关性分析的图像自动标注[J].软件学报, 2017, 28(02): 292-309.
[15] Wu Q, Shen C, Liu L, et al. What value do explicit high-level concepts have in vision to language problems? [C]//Proceedings of the 29th IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas, NV, USA, 2016: 203-212.
[16] J Aneja, A Deshpande, A Schwing. Convolutional image captioning[C]//Proceedings of the 31st IEEE Conference on Computer Vision and Pattern Recognition. Salt Lake City, Utah, USA, 2018: 5561-5570.
[17] J Gu, J Cai, G Wang, et al. Stack-captioning: Coarse-to-fine learning for image captioning[C]//Proceedings of the 32nd AAAI Conference on Artificial Intelligence. New Orleans, Louisiana, USA, 2018: 6837-6844.
[18] 余燕,基于视觉注意力与主题模型的图像中文描述生成方法研究[D]. 武汉: 武汉科技大学硕士学位论文,2019.
[19] 吕世伟,基于深度学习的图像中文文本描述方法研究[D]. 哈尔滨: 哈尔滨理工大学硕士学位论文,2018
[20] Min Yang, Wei Zhao, Wei Xu, et al. Multitask learning for cross-domain image captioning[J]. IEEE Transactions on Multimedia, 2019, 21(4):1047-1061.
[21] Bin Zhao,Xuelong Li,Xiaoqiang Lu. CAM-RNN: Co-attention model based RNN for video captioning[J]. IEEE Transactions on Image Processing,2019, 28(11): 5552-5565.
[22] Xinyu Xiao, Lingfeng Wang, Kun Ding, et al. Deep hierarchical encoder-decoder network for image captioning[J]. IEEE Transactions on Multimedia, 2019, 21(11): 2942-2956.
[23] Zhongliang Yang,Yu-Jin Zhang, Sadaqatur Rehman, et al. Image captioning with objectdetection and localization[C]//Proceedings of the International Conference on Image and Graphics, 2017: 109-118.
[24] Peter Anderson, Xiaodong He, Chris Buehler, et al. Bottom-up and top-down attention for image captioning and visual question answering[C]//Proceedings of the 31st IEEE Conference on Computer Vision and Pattern Recognition. Salt Lake City, Utah, USA, 2018: 6077-6086.
[25] Jiasen Lu, Jianwei Yang, Dhruv Batra, et al, Neural baby talk[C]//Proceedings of the 31st IEEE Conference on Computer Vision and Pattern Recognition. Salt Lake City, Utah, USA, 2018: 7219-7228.
[26] He K M, Zhang X, Ren S, et al. Deep residual learning for image recognition[C]//Proceedings of the 29th IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas, NV, USA,2016: 770-778.
[27] Redmon Joseph, Farhadi Ali. YOLOv3: An Incremental improvement[J/OL]. arXiv:1804.02767, 2018.
[28] Wu J, Zheng H, Zhao B, et al. AI Challenger: A Large-scale dataset for going deeper in image understanding[J/OL]. arXiv:1711.06475v1. 2017.
[29] Denkowski M, Lavie A. Meteor universal: Language specific translation evaluation for any target language[C]//Proceedings of the 9th Workshop on Statistical Machine Translation. Sofia, Bulgaria, 2014: 376-380.
[30] Lin C Y. Rouge: A package for automatic evaluation of summaries[C]//Proceedings of the Workshop on Text Summarization Branches Out. Barcelona, Spain, 2004: 74-81.
[31] Ramakrishna V C. Lawrence Zitnick, Devi Parikh. CIDEr: Consensus based image description evaluation[C]//Proceedings of the 28th IEEE Conference on Computer Vision and Pattern Recognition. Boston, USA,2015: 4566-4575.
[32] Hany Hassan Awadalla, Anthony Aue, Chang Chen, et al. Achieving human parity on automatic Chinese to English news translation[J/OL]. arXiv:1803.05567, 2018.

基金

国家自然科学基金(61966018,61876074);江西省自然科学基金(20181BAB202013);江西省教育厅科技项目(GJJ160277,GJJ150350)
PDF(20735 KB)

1186

Accesses

0

Citation

Detail

段落导航
相关文章

/