Abstract:Sentiment analysis is a popular research issue in the field of natural language processing, and multimodal sentiment analysis is the current challenge in this task. Existing studies are defected in capturing context information and combining information streams of different models. This paper proposes a novel multi-LSTMs Fusion Model Network (MLFN), which performs deep fusion between the three modalities of text, voice and image via the internal feature extraction layer for single-modal, and the inter-modal fusion layer for dual-modal and tri-modal. This hierarchical LSTM framework takes into account the information features inside the modal while capturing the interaction between the modals. Experimental results show that the proposed method can better integrate multi-modal information, and significantly improve the accuracy of multi-modal emotion recognition.
[1] Zadeh A, Liang P P, Vanbriesen J, et al. Multimodal language analysis in the wild: CMU-MOSEI dataset and interpretable dynamic fusion graph[C]//Proceedings of the Meeting of the Association for Computational Linguistics, 2018: 2236-2246. [2] Zadeh A, Chen M, Poria S, et al. Tensor fusion network for multimodal sentiment analysis[C]//Proceedings of the Conference on Empirical Methods in Natural Language Processing, 2017: 1103-1114. [3] Zadeh A, Liang P, Mazumder N, et al. Memory fusion network for multi-view sequential learning[C]//Proceedings of the AAAI Conference on Artificial Intelligence. arXiv preprint arXiv:1802.00927, 2018. [4] Ramachandram D, Taylor G W. Deep multimodal learning: a survey on recent advances and trends[J]. IEEE Signal Processing Magazine, 2017, 34(6): 96-108. [5] Wang H, Meghawat A, Morency L P, et al. Select-additive learning: improving generalization in multimodal sentiment analysis[J].arXiv:1609.05244,2016. [6] Poria S, Chaturvedi I, Cambria E, et al. Convolutional MKL based multimodal emotion recognition and sentiment analysis[C]//Proceedings of the IEEE 16th International Conference,2016: 439-448. [7] Wrtwein T, Scherer S. What really matters: an information gain analysis of questions and reactions in automated PTSD screenings[C]//Proceedings of the Affective Computing and Intelligent Interaction, 2017: 15-20 [8] Nojavanasghari B, Gopinath D, Koushik J, et al. Deep multimodal fusion for persuasiveness prediction[C]//Proceedings of the 18th ACM International Conference on Multimodal Interaction, 2016, 284-288. [9] Amir Z, Paul P L, Soujanya P, et al. Multi-attention recurrent network for human communication comprehension[C]//Proceedings of the AAAI Conference on Artificial Intelligence, 2018, arXiv:1802.00923. [10] Ekman P. An argument for basic emotions[J]. Cognition and emotion, 1992, 6(3-4): 169-200. [11] Degottex G, Kane J, Drugman T, et al. COVAREP: A collaborative voice analysis repository for speech technologies[C]//Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing. IEEE, 2014: 960-964. [12] Kingma D P, Ba J L. Adam: a method for stochastic optimization[C]//Proceedings of the 3rd International Conference on Learning Representations. San Diego,2015. [13] Tong E, Zadeh A, Jones C, et al. Combating human trafficking with multimodal deep models[C]//Proceedings of the Meeting of the Association for Computational Linguistics, 2017: 1547-1556. [14] Kahou S E, Michalski V, Konda K, et al. Recurrent neural networks for emotion recognition in video[C]//Proceedings of International Conference on Multimodal Interaction. ACM, 2015: 467-474. [15] Hochreiter S, Schmidhuber J. Long short-term memory.[J]. Neural Computation, 1997, 9(8): 1735-1780. [16] Srivastava R K, Greff K, Schmidhuber J. Training very deep networks[J]. Advances in Neural Information Processing Systems, 2015: 2377-2385 [17] Zilly J G, Srivastava R K, Koutník J, et al. Recurrent highway networks[J]. arXiv preprint arXiv:1607.03474, 2016. [18] Trigeorgis G, Ringeval F, Brueckner R, et al. Adieu features: end-to-end speech emotion recognition using a deep convolutional recurrent network[C]//Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing. IEEE, 2016: 5200-5204. [19] Lim W, Jang D, Lee T. Speech emotion recognition using convolutional and recurrent neural networks[C]//Proceedings of Signal and Information Processing Association Summit and Conference.IEEE, 2017: 1-4. [20] Ho T K. The random subspace method for constructing decision forests[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 1998, 20(8): 832-844. [21] Iyyer M, Manjunatha V, Boyd-Graber J, et al. Deep unordered composition rivals syntactic methods for text classification[C]//Proceedings of the Meeting of the Association for Computational Linguistics and the International Joint Conference on Natural Language Processing, 2015: 1681-1691. [22] Cortes C, Vapnik V. Support-vector networks[J]. Machine Learning, 1995, 20(3): 273-297. [23] Nojavanasghari B, Gopinath D, Koushik J, et al. Deep multimodal fusion for persuasiveness prediction[C]//Proceedings of International Conference on Multimodal Interaction. ACM, 2016: 284-288. [24] Zadeh A, Liang P P, Poria S, et al. Multi-attention recurrent network for human communication comprehension[C]//Proceedings of the AAAI Conference on Artificial Intelligence, 2018.