基于枢轴语言的图像描述生成研究

张凯,李军辉,周国栋

PDF(2377 KB)
PDF(2377 KB)
中文信息学报 ›› 2019, Vol. 33 ›› Issue (3) : 110-117.
自然语言处理

基于枢轴语言的图像描述生成研究

  • 张凯,李军辉,周国栋
作者信息 +

Image Caption via Pivot Language

  • ZHANG Kai, LI Junhui, ZHOU Guodong
Author information +
History +

摘要

当前图像描述生成的研究主要仅限于单语言(如英文),这得益于大规模的已人工标注的图像及其英文描述语料。该文探索零标注资源情况下,以英文作为枢轴语言的图像中文描述生成研究。具体地,借助于神经机器翻译技术,该文提出并比较了两种图像中文描述生成的方法: (1)串行法,该方法首先将图像生成英文描述,然后由英文描述翻译成中文描述; (2)构建伪训练语料法,该方法首先将训练集中图像的英文描述翻译为中文描述,得到图像-中文描述的伪标注语料,然后训练一个图像中文描述生成模型。特别地,对于第二种方法,该文还比较了基于词和基于字的中文描述生成模型。实验结果表明,采用构建伪训练语料法优于串行法,同时基于字的中文描述生成模型也要优于基于词的模型,BLEU_4值达到0.341。

Abstract

Due to the publically available large-scale image dataset with manually labeled English captions, most studies on image caption aim at generating captions in a single language (e.g., English). In this paper, we explore zero-resource image caption to generate Chinese captions via English as the pivot language. Specifically, we propose and compare two approaches by taking advantage of recent advances in neural machine translation. The first approach, called pipeline approach, first generates English caption for a given image and then translates the English caption into Chinese. The second approach, called building pseudo-training set approach, first translates all English captions in training sets and development set into Chinese to obtain image-Chinese caption datasets, and then directly train a model to generate Chinese caption for a given image. Experimental results show that the second approach, i.e., the character-based Chinese caption generation model on the pseudo-training set, is superior to the pipeline approach.

关键词

图像描述生成 / 机器翻译 / 神经网络 / 枢轴语言

Key words

image caption / machine translation / neural network / pivot language

引用本文

导出引用
张凯,李军辉,周国栋. 基于枢轴语言的图像描述生成研究. 中文信息学报. 2019, 33(3): 110-117
ZHANG Kai, LI Junhui, ZHOU Guodong. Image Caption via Pivot Language. Journal of Chinese Information Processing. 2019, 33(3): 110-117

参考文献

[1] Ali Farhadi, Mohsen Hejrati, Mohammad Amin Sadeghi, et al. Every Picture Tells a Story: Generating Sentences from Images [C]//Proceedings Part IV of the 11th European Conference on Computer Vision. Heraklion, Crete, Greece: Springer, 2010: 15-29.
[2] JunHua Mao, Wei Xu, Yi Yang, et al. Deep captioning with multimodal recurrent neural networks (m-rnn)[J]. arXiv preprint arXiv: 1412.6632, 2014.
[3] Oriol Vinyals, Alexander Toshev, Samy Bengio, et al. Show and tell: A neural image caption generator [C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Boston, MA, USA: IEEE, 2015: 3156-3164.
[4] Xu Jia, Efstrations Gavves, Basura Fernando, et al. Guiding the long-short term memory for image caption generation [C]//Proceedings of the 2015 IEEE International Conference on Computer Vision. Santiago, Chile: IEEE, 2015: 2407-2415.
[5] Kelvin Xu, Jimmy Lei Ba, Ryan Kiros, et al. Show, Attend and Tell: Neural Image Caption Generation with Visual Attention [C]//Proceedings of the 32nd International Conference on Machine Learning. Lille, France: JMLR.org, 2015: 2048-2057.
[6] Xirong Li, Weiyu Lan, Jianfeng Dong, et al. Adding Chinese Captions to Images [C]//Proceedings of the 2016 Association for Computing Machinery(ACM) on International Conference on Multimedia Retrieval. New York, USA: ACM, 2016: 271-275.
[7] Christian Szegedy, Wei Liu, Yang Qing Jia. Going deeper with convolutions[C]//Proceedings of the 32nd International Conference on Machine Learning. Lille, France: JMLR. org, 2015: 1-9.
[8] Orhan Firat,Kyunghyun Cho,Yoshua Bengio. Multi-way, multilingual neural machine translation with a shared attention mechanism[J].arXiv preprint arXiv: 1601.01073,2016.
[9] Yong Cheng, Yang Liu, Qian Yang, et al. Neural machine translation with pivot languages[J]. arXiv preprint arXiv: 1611.04928,2016.
[10] Hideki Nakayama, Noriki Nishida. Zero resource machine translation by multimodal encoder-decoder network with multimedia pivot[J].arXiv preprint arXiv: 1611.04503,2016.
[11] Dzmitry Bahdanau, KyungHyun Cho, Bengio Yoshua. Neural machine translation by jointly learning to align and translate [C]//Proceedings of the International Conference on Learning Representations, 2015.
[12] Jonas Gehring, Michael Auli, David Grangier, et al. Convolutional Sequence to Sequence Learning[J].arXiv preprint arXiv: 1705.03122,2017.
[13] Ashish Vaswani, Noam Shazeer, Niki Parmar, et al. Attention is all you need[J].arXiv preprint arXiv: 1601.00173,2016.
[14] Subhasish Mitra, LaNae J Avra, Edward J McCluskey. Scan synthesis for one-hot signals [C]//Proceedings of the IEEE Test Conference. IEEE, 1997: 714-722.
[15] Tsung-Yi Lin, Michael Maire, Serge Belongie, et al. Microsoft coco: Common objects in context[J]. arXiv preprint arXiv: 1405.0314,2014.
[16] Christian Szegedy, Sergey Ioffe, Vincent Vanhoucke, et al. Inception-v4, inception-resnet and the impact of residual connections on learning[J].arXiv preprint arXiv: 1602.07261,2016.
[17] Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, et al. Rethinking the Inception Architecture for Computer Vision[C]//Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas, NV, USA: IEEE, 2016: 2818-2826.
[18] Olga Russakovsky, Jia Deng, Hao Su, et al. ImageNet Large Scale Visual Recognition Challenge[J]. International Journal of Computer Vision, 2015, 115(3): 211-252.
[19] Sam Wiseman, AlexanderM. Rush. Sequence-to-sequence learning as beam-search optimization[J]. arXiv preprint arXiv: 1606. 02960, 2016.
[20] Sergey Ioffe, Christian Szegedy. Batch Normalization. Accelerating Deep Network Training by Reducing Internal Covariate Shift [C]//Proceedings of the 32nd International Conference on Machine Learning. Lille, France: JMLR.org, 2015: 448-456
[21] Kishore Papineni, Salim Roukos, Todd Ward, et al. Bleu: a Method for Automatic Evaluation of Machine Translation [C]//Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics. Philadelphia, PA, USA: ACL, 2002: 311-318.
[22] Michael Denkowski, Alon Lavi. Meteor universal: Language specific translation evaluation for any target language[C]//Proceedings of the 9th Workshop on Statistical Machine Translation. 2014: 376-380.
[23] Chin-Yew Lin. Rouge: A package for automatic evaluation of summaries [C]//Proceedings of the 42nd Annual Meeting of the Association for Computational Linguistics. Barcelona, Spain: ACL, 2004: 10-18.
[24] Ramakrishna Vedantam, C Lawrence Zitnick, Devi Parikh. CIDEr: Consensus-based image description evaluation [C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2015: 4566-4575.
[25] Junhui Li, Deyi Xiong, Zhaopeng Tu, et al. Modeling source syntax for neural machine translation [C]//Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics, 2017: 688-697.

基金

国家自然科学基金(61401295)
PDF(2377 KB)

Accesses

Citation

Detail

段落导航
相关文章

/