多模态信息处理前沿综述:应用、融合和预训练

吴友政,李浩然,姚霆,何晓冬

PDF(4551 KB)
PDF(4551 KB)
中文信息学报 ›› 2022, Vol. 36 ›› Issue (5) : 1-20.
综述

多模态信息处理前沿综述:应用、融合和预训练

  • 吴友政,李浩然,姚霆,何晓冬
作者信息 +

A Survey of Multimodal Information Processing Frontiers: Application, Fusion and Pre-training

  • WU Youzheng, LI Haoran, YAO Ting, HE Xiaodong
Author information +
History +

摘要

随着视觉、听觉、语言等单模态人工智能技术的突破,让计算机拥有更接近人类理解多模态信息的能力受到研究者们的广泛关注。另一方面,随着图文社交、短视频、视频会议、直播和虚拟数字人等应用的涌现,对多模态信息处理技术提出了更高要求,同时也给多模态研究提供了海量的数据和丰富的应用场景。该文首先介绍了近期自然语言处理领域关注度较高的多模态应用,并从单模态的特征表示、多模态的特征融合阶段、融合模型的网络结构、未对齐模态和模态缺失下的多模态融合等角度综述了主流的多模态融合方法,同时也综合分析了视觉-语言跨模态预训练模型的最新进展。

Abstract

Over the past decade, there has been a steady momentum of innovation and breakthroughs that convincingly push the limits of modeling single modality, e.g., vision, speech and language. Going beyond such research progresses made in single modality, the rise of multimodal social network, short video applications, video conferencing, live video streaming and digital human highly demands the development of multimodal intelligence and offers a fertile ground for multimodal analysis. This paper reviews recent multimodal applications that have attracted intensive attention in the field of natural language processing, and summarizes the mainstream multimodal fusion approaches from the perspectives of single modal representation, multimodal fusion stage, fusion network, fusion of unaligned modalities, and fusion of missing modalities. In addition, this paper elaborate the latest progresses of the vision-language pre-training.

关键词

多模态信息处理 / 多模态融合 / 多模态预训练 / 自然语言处理

Key words

multimodal information processing / multimodal fusion / multimodal pre-training / natural language processing

引用本文

导出引用
吴友政,李浩然,姚霆,何晓冬. 多模态信息处理前沿综述:应用、融合和预训练. 中文信息学报. 2022, 36(5): 1-20
WU Youzheng, LI Haoran, YAO Ting, HE Xiaodong. A Survey of Multimodal Information Processing Frontiers: Application, Fusion and Pre-training. Journal of Chinese Information Processing. 2022, 36(5): 1-20

参考文献

[1] Morency L P, Baltrusaitis T. Tutorial on multimodal machine learning[C]//Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics, 2017.
[2] Zhang C, Yang Z, He X, et al. Multimodal intelligence: representation learning, information fusion and applications[J]. IEEE Journal of Selected Topics in Signal Processing, 2020, 14(3): 478-493.
[3] Summaira J, Li X, Shoib A M, et al. Recent advances and trends in multimodal deep learning: a review[J]. arXiv preprint arXiv:2105.11087, 2015.
[4] Zhang S, Song L, Jin L, et al. Video-aided unsupervised grammar induction[C]//Proceedings of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2021: 1513-1524.
[5] Busso C, Bulut M, Lee C, et al. IEMOCAP: interactive emotional dyadic motion capture database[J]. Language Resources and Evaluation,2008,42(4): 335-359.
[6] Zadeh A, Zellers R, Pincus E, et al. Multimodal sentiment intensity analysis in videos: facial gestures and verbal messages[J]. IEEE Intelligent Systems, 2016, 31(6):82-88.
[7] Zadeh A B, Liang P P, Poria S, et al. Multimodal language analysis in the wild: CMU-MOSEI dataset and interpretable dynamic fusion graph[C]//Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, 2018: 2236-2246.
[8] Yu W, Xu H, Meng F, et al. CH-SIMS: a Chinese multimodal sentiment analysis dataset with fine-grained annotations of modality[C]//Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, 2020: 3718-3727.
[9] Jia J, Zhou S, Yin Y, et al. Inferring emotions from large-scale internet voice data[J]. IEEE Transactions on Multimedia, 2019, 21(7): 1853-1866.
[10] Cai Y, Cai H, Wan X. Multi-modal sarcasm detection in Twitter with hierarchical fusion model[C]//Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, 2019: 2506-2515.
[11] Truong Q T, Lauw H W. VistaNet: visual aspect attention network for multimodal sentiment analysis[C]//Proceedings of the AAAI Conference on Artificial Intelligence, 2019: 305-312.
[12] Lin T, Maire M, Belongie S, et al. Microsoft COCO: common objects in context[C]//Proceedings of the European Conference on Computer Vision, 2014: 740-755.
[13] Sharma P, Ding N, Goodman S, et al. Conceptual Captions: a cleaned, hypernymed, image ALT-text dataset for automatic image captioning[C]//Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, 2018: 2556-2565.
[14] Young P, Lai A, Hodosh M, et al. From image descriptions to visual denotations: new similarity metrics for semantic inference over event descriptions[J]. Transactions of the Association for Computational Linguistics, 2014(2): 67-78.
[15] Krishna R, Zhu Y, Groth O, et al. Visual Genome: connecting language and vision using crowdsourced dense image annotations[J]. International Journal of Computer Vision, 2017, 123(1):32-73.
[16] Ordonez V, Kulkarni G, Berg T T. Im2Text: Describing images using 1 million captioned photographs[C]//Proceedings of the 24th International Comference on Neural Information Processing Systems, 2011: 1143-1151.
[17] Xu J, Mei T, Yao T, et al. MSR-VTT: a large video description dataset for bridging video and language[C]//Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2016: 5288-5296.
[18] Krishna R, Hata K, Ren F, et al. Dense-captioning events in videos[C]//Proceedings of the IEEE International Conference on Computer Vision, 2017.
[19] Zhou L, Xu C, Corso J J. Towards automatic learning of procedures from web instructional videos[C]//Proceedings of the AAAI, 2018: 7590-7598.
[20] Pan Y, Li Y, Luo J, et al. Auto-captions on GIF: a large-scale video-sentence dataset for vision-language pre-training[J]. arXiv preprint arXiv:2007.02375,2020.
[21] Huang T H, Ferraro F, Mostafazadeh N, et al. Visual storytelling[C]//Proceedings of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2016: 1233-1239.
[22] Agrawal A, Lu J, Antol S, et al. VQA: visual question answering[C]//Proceedings of the IEEE International Conference on Computer Vision, 2015: 2425-2433.
[23] Goyal Y, Khot T, Summers-Stay D, et al. Making the V in VQA Matter: elevating the role of image understanding in visual question answering[C]//Proceedings of the IEEE International Conference on Computer Vision, 2017: 6325-6334.
[24] Johnson J, Hariharan B, Maaten L, et al. CLEVR: a diagnostic dataset for compositional language and elementary visual reasoning[C]//Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2017: 1988-1997.
[25] Hudson D A, Manning C D. GQA: a new dataset for real-world visual reasoning and compositional question answering[C]//Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2019: 6700-6709.
[26] Zhou Y, Ji R, Su J, et al. More than an answer: neural pivot network for visual question answering[C]//Proceedings of the 25th ACM International Conference on Multimedia, 2017: 681-689.
[27] Zhou Y, Ji R, Su J, et al. Dynamic capsule attention for visual question answering[C]// Proceedings of the AAAI Conference on Artificial Intelligence, 2019: 9324-9331.
[28] Das A, Kottur S, Gupta K, et al. Visual dialog[C]//Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2017: 1080-1089.
[29] Mostafazadeh N, Brockett C, Dolan B, et al. Image-grounded conversations: multimodal context for natural question and response generation[C]//Proceedings of the International Joint Conference on Natural Language Processing, 2017: 462-472.
[30] Vries H D, Strub F, Chandar S, et al. Guesswhat?! visual object discovery through multi-modal dialogue[C]//Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2017: 4466-4475.
[31] Shuster K, Humeau S, Bordes A, et al. Image-Chat: engaging grounded conversations[C]//Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, 2020: 2414-2429.
[32] Alamri H, Cartillier V, Das A, et al. Audio-visual scene-aware dialog[C]//Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2019: 7558-7567.
[33] Saha A, Khapra M, Sankaranarayanan K. Towards building large scale multimodal domain-aware conversation systems[C]//Proceedings of the AAAI Conference on Artificial Intelligence, 2018: 696-704.
[34] Moon S, Kottur S, Crook P A, et al. Situated and interactive multimodal conversations[C]//Proceedings of the 28th International Conference on Computational Linguistics, 2020: 1103-1121.
[35] Zhao N, Li H, Wu Y, et al. The JDDC 2.0 Corpus: a large-scale multimodal multi-turn Chinese dialogue dataset for e-commerce customer service[J]. arXiv preprint arXiv:2109.12913,2021.
[36] Li M, Zhang L, Ji H, et al. Keep meeting summaries on Topic: Abstractive Multi-Modal Meeting Summarization[C]//Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, 2019: 2190-2196.
[37] Palaskar S, Libovicky J, Gella S, et al. Multimodal Abstractive Summarization for How2 Videos[C]//Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, 2019: 6587-6596.
[38] Li H, Zhu J, Ma C, et al. Multi-modal summarization for asynchronous collection of text, image, audio and video[C]//Proceedings of the Conference on Empirical Methods in Natural Language Processing, 2017: 1092-1102.
[39] Li H, Zhu J, Liu T, et al. Multi-modal sentence summarization with modality attention and image filtering[C]//Proceedings of the 27th International Joint Conference on Artificial Intelligence, 2018: 4152-4158.
[40] Li H, Zhu J, Zhang J, et al. Multimodal Sentence Summarization via Multimodal Selective Encoding[C]//Proceedings of the 28th International Conference on Computational Linguistics, 2020: 5655-5667.
[41] Zhu J, Li H, Liu T, et al. MSMO: multimodal summarization with multimodal output[C]//Proceedings of the Conference on Empirical Methods in Natural Language Processing, 2018: 4154-4164.
[42] Zhu J, Zhou Y, Zhang J, et al. Multimodal Summarization with Guidance of Multimodal Reference[C]//Proceedings of the AAAI Conference on Artificial Intelligence, 2020:9749-9756.
[43] Li H, Yuan P, Xu S, et al. Aspect-aware multimodal summarization for Chinese e-commerce products[C]//Proceedings of the AAAI Conference on Artificial Intelligence, 2020: 8188-8195.
[44] Carletta J, Ashby S, Bourban S, et al. The AMI meeting corpus: a pre-announcement[C]//Proceedings of the 2nd International Workshop on Machine Learning for Multimodal Interaction, 2005: 28-39.
[45] Palaskar S, Caglayan O, Palaskar S, et al. Metze. How2: a large-scale dataset for multimodal language understanding[C]//Proceedings of the 32nd Conference on Neural Information Processing Systems, 2018: 1-12.
[46] Hessel J, Lee L, Mimno D. Unsupervised discovery of multimodal links in multiimage, multisentence documents[C]//Proceedings of the Conference on Empirical Methods in Natural Language Processing, 2019: 2034-2045.
[47] Suhr A, Zhou S, Zhang A, et al. A corpus for reasoning about natural language grounded in photographs[C]//Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, 2019: 6418-6428.
[48] Tan T, Bansal M. Vokenization: improving language understanding with contextualized, visual-grounded supervision[C]//Proceedings of the Conference on Empirical Methods in Natural Language Processing, 2020: 2066-2080.
[49] Plummer B A, Wang L, Cervantes C M, et al. Flickr30k Entities: collecting region-to-phrase correspondences for richer image-to-sentence models[C]//Proceedings of the IEEE International Conference on Computer Vision, 2015: 2641- 2649.
[50] Wang Q, Tan H, Shen S, et al. MAF: multimodal alignment framework for weakly-supervised phrase grounding[C]//Proceedings of the Conference on Empirical Methods in Natural Language Processing, 2020: 2030-2038.
[51] Zhang H, Sun A, Jing W, et al. Parallel attention network with sequence matching for video grounding[C]//Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics, 2021: 776-790.
[52] Gao J, Sun C, Yang Z, et al. Tall: temporal activity localization via language query[C]//Proceedings of the IEEE International Conference on Computer Vision, 2017: 5277-5285.
[53] Krishna R, Hata K, Ren F, et al. Dense-captioning events in videos[C]//Proceedings of the IEEE International Conference on Computer Vision, 2017: 706-715.
[54] Regneri M, Rohrbach M, Wetzel D, et al. Grounding action descriptions in videos[J]. Transactions of the Association for Computational Linguistics, 2013(1): 25-36.
[55] Sigurdsson G A, Varol G, Wang X, et al. Hollywood in homes: crowdsourcing data collection for activity understanding[C]//Proceedings of the European Conference on Computer Vision, 2016: 510-526.
[56] Elliott D, Frank S, Hasler E. Multi-language image description with neural sequence models[J]. arXiv preprint arXiv:1510.04709,2015.
[57] Elliott D, Frank S, Simaan K, et al. Multi30k: multilingual English-German image descriptions[C]//Proceedings of the 5th Workshop on Vision and Language, 2016: 70-74.
[58] Huang P, Liu F, Shiang S R, et al. Attention-based multimodal neural machine translation[C]//Proceedings of the 1st Conference on Machine Translation, 2016: 639-645.
[59] Calixto I, Liu Q. Incorporating global visual features into attention-based neural machine translation[C]//Proceedings of the Conference on Empirical Methods in Natural Language Processing, 2017: 992-1003.
[60] Elliott D, Kadar A. Imagination improves multimodal translation[C]//Proceedings of the International Joint Conference on Natural Language Processing, 2017: 130-141.
[61] Zhou M, Cheng R, Lee Y J, et al. A visual attention grounding neural model for multimodal machine translation[C]//Proceedings of the Conference on Empirical Methods in Natural Language Processing, 2018: 3643-3653.
[62] Ive J, Madhyastha P, Specia L. Distilling translations with visual awareness[C]//Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, 2019: 6525-6538.
[63] Yao S, Wan X. Multimodal transformer for multimodal machine translation[C]//Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, 2020: 4346-4350.
[64] Lin H, Meng F, Su J, et al. Dynamic context-guided capsule network for multimodal machine translation[C]//Proceedings of the 28th ACM International Conference on Multimedia, 2020: 1320-1329.
[65] Wu Z, Kong L, Bi W, et al. Good for misconceived reasons: an empirical revisiting on the need for visual context in multimodal machine translation[C]//Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics, 2021: 6153-6166.
[66] Moon S, Neves L, Carvalho V. Multimodal named entity recognition for short social media posts[C]//Proceedings of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2018: 852-860.
[67] Zhang Q, Fu J, Liu X, et al. Adaptive coattention network for named entity recognition in Tweets[C]//Proceedings of the AAAI Conference on Artificial Intelligence, 2018: 5674-5681.
[68] Lu D, Neves L, Carvalho V, et al. Visual attention model for name tagging in multimodal social media[C]//Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, 2018: 1990-1999.
[69] Yu J, Jiang J, Yang L, et al. Improving multimodal named entity recognition via entity span detection with unified multimodal transformer[C]//Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, 2020: 3342-3352.
[70] Sui D, Tian Z, Chen Y, et al. A large-scale Chinese multimodal NER dataset with speech clues[C]//Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics, 2021: 2807-2818.
[71] IV R L, Humeau S, Singh S. Multimodal attribute extraction[C]//Proceedings of the 31st Conference on Neural Information Processing Systems, 2017: 1-7.
[72] Zhu T, Wang Y, Li H, et al. Multimodal joint attribute prediction and value extraction for e-commerce product[C]//Proceedings of the Conference on Empirical Methods in Natural Language Processing, 2020: 2129-2139.
[73] Johnson J, Krishna R, Stark M, et al. Image retrieval using scene graphs[C]//Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2015: 3668-3678.
[74] Lu C, Krishna R, Bernstein M, et al. Visual relationship detection with language priors[C]//Proceedings of the European Conference on Computer Vision, 2016: 852-869.
[75] Yao T, Pan Y, Li T, et al. Exploring visual relationship for image captioning[C]//Proceedings of the European Conference on Computer Vision, 2018: 711-727.
[76] Yao T, Pan Y, Li Y, et al. Hierarchy parsing for image captioning[C]//Proceedings of the IEEE International Conference on Computer Vision, 2019: 2621-2629.
[77] Yang Z, He X, Gao J, et al. Stacked attention networks for image question answering[C]//Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2016: 21-29.
[78] Anderson P, He X, Buehler C, et al. Bottom-up and top-down attention for image captioning and visual question answering[C]//Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2018: 6077-6086.
[79] Li L, Gan Z, Cheng Y, et al. Relation-aware graph attention network for visual question answering[C]//Proceedings of the IEEE International Conference on Computer Vision, 2019: 10312-10321.
[80] Shrestha R, Kafle K, Kanan C. Answer them all! toward universal visual question answering models[C]//Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2019: 10472-10481.
[81] Huang Q, Wei J, Cai Y, et al. Aligned dual channel graph convolutional network for visual question answering[C]//Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, 2020: 7166-7176.
[82] Yu F, Tang J, Yin W, et al. ERNIE-ViL: knowledge enhanced vision-language representations through scene graphs[C]//Proceedings of the AAAI Conference on Artificial Intelligence, 2021: 3208-3216.
[83] Degottex G, Kane J, Drugman T, et al. COVAREP: a collaborative voice analysis repository for speech technologies[C]//Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing, 2014: 960-964.
[84] Baevski A, Zhou Y, Mohamed A, et al. Wav2vec 2.0: a framework for self-supervised learning of speech representations[C]//Proceedings of the Advances in Neural Information Processing Systems, 2020.
[85] Liu J, Zhu X, Liu F, et al. OPT: omni-perception pre-trainer for cross-modal understanding and generation[J].arXiv preprint arXiv:2107.00249, 2017.
[86] He K, Zhang X, Ren S, et al. Deep residual learning for image recognition[C]//Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2016: 770-778.
[87] Nguyen D K, Okatani T. Improved fusion of visual and language representations by dense symmetric Co-attention for visual question answering[C]//Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2018: 6087-6096.
[88] Ren S, He K, Girshick R B, et al. Faster R-CNN: towards real-time object detection with region proposal networks[J]. IEEE Trans Pattern Aral Mach Intell. 2017,39(6): 1137-1149.
[89] Khademi M. Multimodal neural graph memory networks for visual question answering[C]//Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, 2020: 7177-7188.
[90] Alberti C, Ling J, Collins M, et al. Fusion of detected objects in text for visual question answering[C]//Proceedings of the Conference on Empirical Methods in Natural Language Processing, 2020: 2131-2140.
[91] Tan H, Bansal M. LXMERT: learning cross-modality encoder representations from transformers[C]//Proceedings of the Conference on Empirical Methods in Natural Language Processing, 2019: 5099-5110.
[92] Sun S, Chen Y, Li L, et al. LightningDOT: pre-training visual-semantic embeddings for real-time image-text retrieval[C]//Proceedings of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2021: 982-997.
[93] Xu H, Yan M, Li C, et al. E2E-VLP: end-to-end vision-language pre-training enhanced by visual learning[C]//Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics, 2021: 503-513.
[94] Sun C, Myers A, Vondrick C, et al. VideoBERT: a joint model for video and language representation learning[C]//Proceedings of the IEEE International Conference on Computer Vision, 2019: 7463-7472.
[95] Li G, Duan N, Fang Y, et al. Unicoder-VL: a universal encoder for vision and language by cross-modal pre-training[C]//Proceedings of the AAAI Conference on Artificial Intelligence, 2020: 11336-11344.
[96] Li X, Yin X, Li C, et al. Oscar: object-semantics aligned pre-training for vision-language tasks[C]//Proceedings of the European Conference on Computer Vision, 2020: 121-137.
[97] Su W, Zhu X, Cao Y, et al. VL-BERT: pre-training of generic visual-linguistic representations[C]//Proceedings of the 8th International Comference on Learning Representations, 2020.
[98] Huang H, Su L, Qi D, et al. M3P: learning universal representations via multitask multilingual multimodal pre-training[C]//Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2021: 3977-3986.
[99] Zhu L, Yang Y. ActBERT: learning global-local video-text representations[C]//Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2020: 8743-8752.
[100] Lu J, Batra D, Parikh D, et al. ViLBERT: pretraining task-agnostic visiolinguistic representations for vision-and-language tasks[C]//Proceedings of the 33rd Conference on Neural Information Processing Systems, 2019: 13-23.
[101] Devlin J, Chang M, Lee K, et al. BERT: pre-training of deep bidirectional transformers for language understanding[C]//Proceedings of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2019: 4171-4186.
[102] Brown T B, Mann B, Ryder N, et al. Language models are few-shot learners[C]//Proceedings of the 34th Comference on Neural Information Processing Systems, 2020.
[103] Rahman W, Hasan M K, Lee S, et al. Integrating multimodal information in large pretrained transformers[C]//Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, 2020: 2359-2369.
[104] Yang Z, Dai Z, Yang Y, et al. XLNet: generalized autoregressive pretraining for language understanding[C]//Proceedings of the 33rd Comference on Neural Information Processing Systems,2019: 5754-5764.
[105] Bahdanau D, Cho K, Bengio Y. Neural machine translation by jointly learning to align and translate[C]//Proceedings of the 3rd International Conference on Learning Representations, 2015.
[106] Vaswani A, Shazeer N, Parmar N, et al. Attention is all you need[C]//Proceedings of the 31st International Conference on Neural Information Processing Systems, 2017: 5998-6008.
[107] Lu J, Yang J, Batra D, et al. Hierarchical question-image co-attention for visual question answering[C]//Proceedings of the 30th International Conference on Neural Information Processing Systems, 2016: 289-297.
[108] Nam H, Ha J, Kim J. Dual attention networks for multimodal reasoning and matching[C]//Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2017: 2156-2164.
[109] Yu Z, Yu J, Cui Y, et al. Deep modular co-attention networks for visual question answering[C]//Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2019: 6281-6290.
[110] Tsai Y H, Bai S, Liang P P, et al. Multimodal transformer for unaligned multimodal language sequences[C]//Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, 2019: 6558-6569.
[111] Sahay S, Okur E, Kumar S H, et al. Low rank fusion based transformers for multimodal sequences[C]//Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, 2020: 29-34.
[112] Yin Y, Meng F, Su J, et al. A novel graph-based multi-modal fusion encoder for neural machine translation[C]//Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, 2020: 3025-3035.
[113] Hu J, Liu Y, Zhao J, et al. MMGCN: multimodal fusion via deep graph convolution network for emotion recognition in conversation[C]//Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics, 2021: 5666-5675.
[114] Kim J H, Jun J, Zhang B T. Bilinear attention networks[C]//Proceedings of the 32rd Conference on Neural Information Processing Systems, 2018: 1571-1581.
[115] Yang J, Wang Y, Yi R, et al. MTAG: Modal-temporal attention graph for unaligned human multimodal language sequences[C]//Proceedings of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2021: 1009-1021.
[116] Zhao J, Li R, Jin Q. Missing modality imagination network for emotion recognition with uncertain missing modalities[C]//Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics, 2021: 2608-2618.
[117] Lewis M, Liu Y, Goyal N, et al. BART: denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension[C]//Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, 2020: 7871-7880.
[118] Raffel C, Shazeer N, Roberts A, et al. Exploring the limits of transfer learning with a unified text-to-text transformer[J]. arXiv preprint arXiv:1910.10683, 2019.
[119] Huang P Y, Patrick M, Hu J, et al. Multilingual multimodal pre-training for zero-shot cross-lingual transfer of vision-language models[C]//Proceedings of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2021: 2443-2459.
[120] Xu H, Ghosh G, Huang P Y, et al. VLM: task-agnostic video-language model pre-training for video understanding[C]//Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics, 2021: 4227-4239.
[121] Li Y, Pan Y, Yao T, et al. Scheduled sampling in vision-language pretraining with decoupled encoder-decoder network[C]//Proceedings of the 35th AAAI Conference on Artificial Intelligence, 2021: 8518-8526.
[122] Li W, Gao C, Niu G, et al. UNIMO: towards unified-modal understanding and generation via cross-modal contrastive learning[C]//Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics, 2021: 2592-2607.
[123] Pires T, Schlinger E, Garrette D. How multilingual is multilingual BERT?[C]//Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, 2019: 4996-5001.
[124] Conneau A, Wu S, Li H, et al. Emerging cross-lingual structure in pretrained language models[C]//Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, 2020: 6022-6034.
[125] Li L H, You H, Wang Z, et al. Unsupervised vision-and-language pre-training without parallel images and captions[C]//Proceedings of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2021: 5339-5350.
[126] Wang M, Qi G, Wang H, et al. Richpedia: a comprehensive multi-modal knowledge graph[C]// Proceedings of the Joint International Semantic Technology Conference. Springer, Cham, 2019: 130-145.
[127] 郑秋硕,漆桂林,王萌. 多模态知识图谱[EB/OL].https://zhuanlan.zhihu.com/p/163278672.[2020-07-26].

基金

科技创新2030-“新一代人工智能”重大项目(2020AAA0108600)
PDF(4551 KB)

12193

Accesses

0

Citation

Detail

段落导航
相关文章

/