Abstract:The attention-based encoder-decoder framework is widely used in image captioning. In previous methods, the single-directional attention mechanism does not check the consistency between semantic information and image content, causing low accuracy in the generated caption. In order to solve the above problem, this paper proposes an image captioning method based on bi-directional attention mechanism. On the basis of the single-directional attention mechanism, the attention calculation is added from image feature to the semantic information, enabling the interaction between the image and the semantic information in two directions. This paper designs a gated network to fuse information in the above two directions. In contrast to previous studies, this paper uses the historical semantic information to assist in current word generation in the attention module. Using two types of image features, the experimental results show that on MSCOCO dataset, the BLEU4 score is increased by 1.3 and the CIDEr score by 6.3 in average. And on Flickr30k, the BLEU4 score is increased by 0.9 and the CIDEr score by 2.4 in average.
[1] Vinyals O, Toshev A, Bengio S, et al. Show and tell: A neural image caption generator[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015: 3156-3164. [2] Xu K, Ba J, Kiros R, et al. Show, attend and tell: Neural image caption generation with visual attention[C]//Proceedings of the International Conference on Machine Learning, 2015: 2048-2057. [3] Anderson P, He X, Buehler C, et al. Bottom-up and top-down attention for image captioning and visual question answering[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018: 6077-6086. [4] Ren S, He K, Girshick R, et al. Faster R-CNN: Towards real-time object detection with region proposal networks[C]//Proceedings of the Advances in Neural Information Processing Systems, 2015: 91-99. [5] Simonyan K, Zisserman A. Very deep convolutional networks for large-scale image recognition[J]. arXiv preprint arXiv: 1409.1556, 2014. [6] He K, Zhang X, Ren S, et al. Deep residual learning for image recognition[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016: 770-778. [7] Karpathy A, Fei-Fei L. Deep visual-semantic alignments for generating image descriptions[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015: 3128-3137. [8] Kulkarni G, Premraj V, Ordonez V, et al. Babytalk: Understanding and generating simple image descriptions[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2013, 35(12): 2891-2903. [9] Kuznetsova P, Ordonez V, Berg A C, et al. Collective generation of natural image descriptions[C]//Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Long Papers-Volume 1. Association for Computational Linguistics,2012: 359-368. [10] You Q, Jin H, Wang Z, et al. Image captioning with semantic attention[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016: 4651-4659. [11] Lu J,Xiong C, Parikh D, et al. Knowing when to look: Adaptive attention via a visual sentinel for image captioning[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017: 375-383. [12] Chen L, Zhang H, Xiao J, et al. Sca-cnn: Spatial and channel-wise attention in convolutional networks for image captioning[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2017: 5659-5667. [13] 周治平,张威.结合视觉属性注意力和残差连接的图像描述生成模型[J].计算机辅助设计与图形学学报,2018,30(08): 1536-1542, 1553. [14] Yao T, Pan Y, Li Y, et al. Incorporating copying mechanism in image captioning for learning novel objects[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017: 6580-6588. [15] Li Y, Yao T, Pan Y, et al. Pointing novel objects in image captioning[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2019: 12497-12506. [16] Lu J, Yang J, Batra D, et al. Neural baby talk[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018: 7219-7228. [17] Wang W, Chen Z, Hu H. Hierarchical attention network for image captioning[C]//Proceedings of the AAAI Conference on Artificial Intelligence, 2019, 33: 8957-8964. [18] Rennie S J, Marcheret E, Mroueh Y, et al. Self-critical sequence training for image captioning[C]//Proceedings ofthe IEEE Conference on Computer Vision and Pattern Recognition, 2017: 7008-7024. [19] Li X, Bing L, Li P, et al. Aspect term extraction with history attention and selective transformation[J]. arXiv preprint arXiv: 1805.00760, 2018. [20] Lin T Y, Maire M, Belongie S, et al. Microsoft coco: Common objects in context[C]//Proceedings of European Conference on Computer Vision. Springer, Cham, 2014: 740-755. [21] Kingma D P, Ba J. Adam: A method for stochastic optimization[J]. arXiv preprint arXiv: 1412.6980, 2014. [22] Shuster K,Humeau S, Hu H, et al. Engaging image captioning via personality[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2019: 12516-12526. [23] Li N, Chen Z, Liu S. Meta learning for image captioning[C]//Proceedings of the AAAI Conference on Artificial Intelligence, 2019, 33: 8626-8633.