该文基于序列到序列的神经网络,提出了使用文本语义信息和图片语义信息对多模态文本摘要生成任务进行建模。具体而言,使用文本一级编码器和带有图片信息指导的二级门控编码器对多模态语义信息进行编码,对齐文本与图片的语义信息。通过多模态正向注意力机制与反向注意力机制多方面观察对齐后的源文本与图片内容,分别得到各自模态语义信息的正相关和不相关特征表示。使用正向滤波器过滤正向注意力机制中的不相关信息,使用反向滤波器过滤反向注意力机制中的相关信息,达到分别从正向与反向两个方面选择性地融合文本语义信息和图片语义信息的目的。最后基于指针生成网络,使用正相关信息搭建正向指针、使用不相关信息搭建反向指针,生成带有多模态语义信息补偿的文本摘要内容。在京东中文电子商务数据集上,所提模型生成的多模态文本摘要在ROUGE-1、ROUGE-2、ROUGE-L指标上分别取得了38.40、16.71、28.01的结果。
Abstract
Based on sequence-to-sequence neural network, this paper proposes to model multi-modal text summarization generation task using text semantic information and image semantic information. Specifically, a text primary encoder and a secondary gated encoder with image information guidance are used to encode multi-modal semantic information and align the semantic information of text and image. By observing the content of source text and image aligned by multi-modal forward attention mechanism and reverse attention mechanism, the relevant and irrelevant features of semantic information of each mode are obtained respectively. The forward filter is used to filter the irrelevant information in the forward attention mechanism, and the reverse filter is used to filter the relevant information in the reverse attention mechanism, so as to selectively merging the semantic information of text and the semantic information of image in the forward and reverse aspects respectively. Finally, based on the pointer generation network, the relevant information is used to build the forward pointer, the irrelevant information is used to build the reverse pointer, and the text summarization content with multi-modal semantic information compensation is generated. In JD Chinese e-commerce dataset, the multi-modal text summarization by the proposed model reaches 38.40, 16.71 and 28.01 in the indexes of ROUGE-1, ROUGE-2 and ROUGE-L, respectively.
关键词
多模态文本摘要 /
多模态信息对齐 /
二级门控编码机制 /
文本生成模型
{{custom_keyword}} /
Key words
multi-modal text summarization /
multi-modal alignment /
secondary gated encoding /
text-generation model
{{custom_keyword}} /
{{custom_sec.title}}
{{custom_sec.title}}
{{custom_sec.content}}
参考文献
[1] 魏忠钰,范智昊,王瑞泽,等.从视觉到文本: 图片描述生成的研究进展综述[J].中文信息学报,2020,34(07): 19-29.
[2] BALTRUAITIS T, AHUJA C, MORENCY L P. Multimodal machine learning: A survey and taxonomy[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2018, 41(2): 423-443.
[3] LI H, ZHU J, MA C, et al. Read, watch, listen, and summarize: Multi-modal summarization for asynchronous text, image, audio and video[J]. IEEE Transactions on Knowledge and Data Engineering, 2018, 31(5): 996-1009.
[4] LIBOVICK J, HELCL J. Attention strategies for multi-source sequence-to-sequence learning[J/OL]. arXiv preprint arXiv: 1704.06567, 2017.
[5] LI H R, ZHU J N, ZHANG J J, et al. Multimodal sentence summarization via multimodal selective encoding[C]//Proceedings of the 28th International Conference on Computational Linguistics, 2020: 5655-5667.
[6] FU X Y, WANG J, YANG ZH L. MM-AVS: A full-scale dataset for multi-modal summarization[C]//Proceedings of the Conference of the North American Chapter of the Association for Computational, 2021: 5922-5926.
[7] LIANG X, CUI C, WU S, et al. Modeling paragraph-level vision-language semantic alignment for multi-modal summarization[J/OL]. ArXiv, abs/2208.11303, 2022.
[8] LI H, YUAN P, XU S, et al. Aspect-aware multimodal summarization for Chinese e-commerce products[C]//Proceedings of the AAAI Conference on Artificial Intelligence, 2020: 8188-8195.
[9] LI H, ZHU J, LIU T, et al. Multi-modal sentence summarization with modality attention and image filtering[C]//Proceedings of the IJCAI, Stockholm, Sweden, 2018: 4152-4158.
[10] MIN X, ZHAI G, ZHOU J, et al. A multimodal saliency model for videos with high audio-visual correspondence[J]. IEEE Transactions on Image Processing, 2020, 29: 3805-3819.
[11] DO T, DO T T, TRAN H, et al. Compact trilinear interaction for visual question answering[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision, 2019: 392-401.
[12] 韩鹏宇,余正涛,高盛祥,等.案件要素句子关联图卷积的案件舆情摘要方法[J].软件学报,2021,32(12): 3829-3838.
[13] 孙宝山,谭浩.基于ALBERT-UniLM模型的文本自动摘要技术研究[J].计算机工程与应用,2022,58(15): 184-190.
[14] GEHRMANN S, DENG Y, RUSH A M. Bottom-up abstractive summarization[J/OL]. arXiv preprint arXiv: 1808.10792, 2018.
[15] ABIGAIL S, PETER J L, CHRISTOPHER D M. Getto the point: Summarization with pointer-generator networks[C]//Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics, 2017: 1073-1083.
[16] 吴仁守,王红玲,王中卿,等.全局自匹配机制的短文本摘要生成方法[J].软件学报,2019,30(09): 2705-2717.
[17] YAO K, ZHANG L, DU D, et al. Dual encoding for abstractive text summarization[J]. IEEE Trans Cybern. 2020, 50(3): 985-996.
[18] ZHU J N, HAORAN L, TIANSHANG L, et al. MSMO: Multimodal summarization with multimodal output[C]//Proceedings of the Conference on Empirical Methods in Natural Language Processing, 2018: 4154-4164.
[19] AMAN K, UDIT A. MAST: Multimodal abstractive summarization with trimodal hierarchical attention[C]//Proceedings of the 1st International Workshop on Natural Language Processing Beyond Text, 2020: 60-69.
[20] CHO K, VANMERRINBOER B, GULCEHRE C, et al. Learning phrase representations using RNN encoder-decoder for statistical machine translation[J/OL]. arXiv preprint arXiv: 1406.1078, 2014.
[21] DONG L, YANG N, WANG W, et al.Unified language model pre-training for natural language understanding and generation[C]//Proceedings of Advances in Neural Information Processing Systems, 2019: 32-45.
[22] KALMAN, R E. A new approach to linear filtering and prediction problems[J]. Basic English. March, 1960,82(1): 35-45.
[23] 张忠林,余炜,闫光辉,等.基于ACNNC模型的中文分词方法[J].中文信息学报,2022,36(08): 12-19,28.
[24] ZHOU Q Y, YANG N, WEI F R, et al. Selective encoding for abstractive sentence summarization[C]//Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics, 2017: 1095-1104.
[25] SHRUTI P, JINDRICH L, SDANDANA G, et al. Multimodal abstractive summarization for how2 videos[C]//Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, 2019: 6587-6596.
[26] WANG W-SH, LIU P F, YANG S, et al. Dynamic interaction networks for image-text multimodal learning[J]. Neurocomputing, 2020(379): 262-272.
[27] TAN H, BANSAL M. Lxmert: Learning cross-modality encoder representations from transformers[J/OL]. arXiv preprint arXiv: 1908.07490, 2019.
{{custom_fnGroup.title_cn}}
脚注
{{custom_fn.content}}
基金
国家自然科学基金(U20B2061,62102187);江苏省自然科学基金(基础研究计划)(BK20210639)
{{custom_fund}}