情绪分析一直是自然语言处理领域的研究热点,而多模态情绪分析是当前该领域的一个挑战。已有研究在上下文信息和不同模态时间序列信息交互方面存在不足,该文提出了一个新颖的多层LSTM融合模型(Multi-LSTMs Fusion Model,MLFN),通过分层LSTM分别设置单模态模内特征提取层、双模态和三模态模间融合层进行文本、语音和图像三个模态之间的深度融合,在考虑模态内部信息特征的同时深度捕获模态之间的交互信息。实验结果表明,基于多层LSTM多模态融合网路能够较好地融合多模态信息,大幅度提升多模态情绪识别的准确率。
Abstract
Sentiment analysis is a popular research issue in the field of natural language processing, and multimodal sentiment analysis is the current challenge in this task. Existing studies are defected in capturing context information and combining information streams of different models. This paper proposes a novel multi-LSTMs Fusion Model Network (MLFN), which performs deep fusion between the three modalities of text, voice and image via the internal feature extraction layer for single-modal, and the inter-modal fusion layer for dual-modal and tri-modal. This hierarchical LSTM framework takes into account the information features inside the modal while capturing the interaction between the modals. Experimental results show that the proposed method can better integrate multi-modal information, and significantly improve the accuracy of multi-modal emotion recognition.
关键词
多模态 /
情绪分析 /
LSTM
{{custom_keyword}} /
Key words
multi-modal /
emotion analysis /
LSTM
{{custom_keyword}} /
{{custom_sec.title}}
{{custom_sec.title}}
{{custom_sec.content}}
参考文献
[1] Zadeh A, Liang P P, Vanbriesen J, et al. Multimodal language analysis in the wild: CMU-MOSEI dataset and interpretable dynamic fusion graph[C]//Proceedings of the Meeting of the Association for Computational Linguistics, 2018: 2236-2246.
[2] Zadeh A, Chen M, Poria S, et al. Tensor fusion network for multimodal sentiment analysis[C]//Proceedings of the Conference on Empirical Methods in Natural Language Processing, 2017: 1103-1114.
[3] Zadeh A, Liang P, Mazumder N, et al. Memory fusion network for multi-view sequential learning[C]//Proceedings of the AAAI Conference on Artificial Intelligence. arXiv preprint arXiv:1802.00927, 2018.
[4] Ramachandram D, Taylor G W. Deep multimodal learning: a survey on recent advances and trends[J]. IEEE Signal Processing Magazine, 2017, 34(6): 96-108.
[5] Wang H, Meghawat A, Morency L P, et al. Select-additive learning: improving generalization in multimodal sentiment analysis[J].arXiv:1609.05244,2016.
[6] Poria S, Chaturvedi I, Cambria E, et al. Convolutional MKL based multimodal emotion recognition and sentiment analysis[C]//Proceedings of the IEEE 16th International Conference,2016: 439-448.
[7] Wrtwein T, Scherer S. What really matters: an information gain analysis of questions and reactions in automated PTSD screenings[C]//Proceedings of the Affective Computing and Intelligent Interaction, 2017: 15-20
[8] Nojavanasghari B, Gopinath D, Koushik J, et al. Deep multimodal fusion for persuasiveness prediction[C]//Proceedings of the 18th ACM International Conference on Multimodal Interaction, 2016, 284-288.
[9] Amir Z, Paul P L, Soujanya P, et al. Multi-attention recurrent network for human communication comprehension[C]//Proceedings of the AAAI Conference on Artificial Intelligence, 2018, arXiv:1802.00923.
[10] Ekman P. An argument for basic emotions[J]. Cognition and emotion, 1992, 6(3-4): 169-200.
[11] Degottex G, Kane J, Drugman T, et al. COVAREP: A collaborative voice analysis repository for speech technologies[C]//Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing. IEEE, 2014: 960-964.
[12] Kingma D P, Ba J L. Adam: a method for stochastic optimization[C]//Proceedings of the 3rd International Conference on Learning Representations. San Diego,2015.
[13] Tong E, Zadeh A, Jones C, et al. Combating human trafficking with multimodal deep models[C]//Proceedings of the Meeting of the Association for Computational Linguistics, 2017: 1547-1556.
[14] Kahou S E, Michalski V, Konda K, et al. Recurrent neural networks for emotion recognition in video[C]//Proceedings of International Conference on Multimodal Interaction. ACM, 2015: 467-474.
[15] Hochreiter S, Schmidhuber J. Long short-term memory.[J]. Neural Computation, 1997, 9(8): 1735-1780.
[16] Srivastava R K, Greff K, Schmidhuber J. Training very deep networks[J]. Advances in Neural Information Processing Systems, 2015: 2377-2385
[17] Zilly J G, Srivastava R K, Koutník J, et al. Recurrent highway networks[J]. arXiv preprint arXiv:1607.03474, 2016.
[18] Trigeorgis G, Ringeval F, Brueckner R, et al. Adieu features: end-to-end speech emotion recognition using a deep convolutional recurrent network[C]//Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing. IEEE, 2016: 5200-5204.
[19] Lim W, Jang D, Lee T. Speech emotion recognition using convolutional and recurrent neural networks[C]//Proceedings of Signal and Information Processing Association Summit and Conference.IEEE, 2017: 1-4.
[20] Ho T K. The random subspace method for constructing decision forests[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 1998, 20(8): 832-844.
[21] Iyyer M, Manjunatha V, Boyd-Graber J, et al. Deep unordered composition rivals syntactic methods for text classification[C]//Proceedings of the Meeting of the Association for Computational Linguistics and the International Joint Conference on Natural Language Processing, 2015: 1681-1691.
[22] Cortes C, Vapnik V. Support-vector networks[J]. Machine Learning, 1995, 20(3): 273-297.
[23] Nojavanasghari B, Gopinath D, Koushik J, et al. Deep multimodal fusion for persuasiveness prediction[C]//Proceedings of International Conference on Multimodal Interaction. ACM, 2016: 284-288.
[24] Zadeh A, Liang P P, Poria S, et al. Multi-attention recurrent network for human communication comprehension[C]//Proceedings of the AAAI Conference on Artificial Intelligence, 2018.
{{custom_fnGroup.title_cn}}
脚注
{{custom_fn.content}}
基金
国家自然科学基金(62006166,61976146,62076176);中国博士后科学基金(2019M661930);江苏高校优势学科建设工程自主项目
{{custom_fund}}