Abstract:In order to learn distributed representations of text sequences, the previous methods focus on complex recurrent neural networks or supervised learning. In this paper, we propose a gated mean-max autoencoder both for Chinese and English text representations. In our model, we simply rely on the multi-head self-attention mechanism to construct the encoder and decoder. In the encoding we propose a mean-max strategy that applies both mean and max pooling operations over the hidden vectors to capture diverse information of the input. To enable the information to steer the reconstruction process, the decoder employ element-wise gate to select between mean and max representations dynamically. By training our model on a large amount of Chinese and English un-labelled data respectively, we obtain high-quality text encoders for publicl available. Experimental results of reconstructing coherent long texts from the encoded representations demonstrate the superiority of our model over the traditional recurrent neural network, in terms of both performance and complexity.
[1] Mikolov T, Chen K, Corrado G, et al. Efficient estimation of word representations in vector space[J]. arXiv preprint arXiv: 1301.3781, 2013. [2] Kiros R, Zhu Y, Salakhutdinov R R, et al. Skip-thought vectors[J]arXiv preprint arXiv: 1506.06726,2015. [3] Ba J L, Kiros J R, Hinton G E. Layer normalization[J]. arXiv preprint arXiv: 1607.06450, 2016. [4] Hill F, Cho K, Korhonen A. Learning distributed representations of sentences from unlabelled data[J]. arXiv preprint arXiv: 1602.03483, 2016. [5] Gan Z, Pu Y, Henao R, et al. Learning generic sentence representations using convolutional neural networks[C]//Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, 2017: 2390-2400. [6] Conneau A, Kiela D, Schwenk H, et al. Supervised learning of universal sentence representations from natural language inference data[J]. arXiv preprint arXiv: 1705.02364, 2017. [7] Cer D, Yang Y, Kong S, et al. Universal sentence encoder[J]. arXiv preprint arXiv: 1803.11175, 2018. [8] Bowman S R, Angeli G, Potts C, et al. A large annotated corpus for learning natural language inference[J]. arXiv preprint arXiv: 1508.05326, 2015. [9] Li J, Luong M T, Jurafsky D. A hierarchical neural autoencoder for paragraphs and documents[J]. arXiv preprint arXiv: 1506.01057, 2015. [10] Le Q, Mikolov T. Distributed representations of sentences and documents[C]//Proceedings of the 31st International Conference on Machine Learning, 2014: 1188-1196. [11] Mikolov T, Sutskever I, Chen K, et al. Distributed representations of words and phrases and their compositionality[J]. arXiv preprint arXiv: 1310.4546,2013. [12] Arora S, Liang Y, Ma T. A simple but tough-to-beat baseline for sentence embeddings[C]//Proceedings of the 5th International Conforence on Learning Representations. 2016. [13] Henderson M, Al-Rfou R, Strope B, et al. Efficient natural language response suggestion for smart reply[J]. arXiv preprint arXiv: 1705.00652, 2017. [14] Bahdanau D, Cho K, Bengio Y. Neural machine translation by jointly learning to align and translate[J]. arXiv preprint arXiv: 1409.0473, 2014. [15] Vaswani A, Shazeer N, Parmar N, et al. Attention is all you need[J]. arXiv prepint arXiv: 1706: 03762, 2017. [16] Manning C, Surdeanu M, Bauer J, et al. The Stanford CoreNLP natural language processing toolkit[C]//Proceedings of 52nd Annual Meeting of the Association for Computational Linguistics: System Demonstrations, 2014: 55-60. [17] Kingma D P, Ba J. Adam: A method for stochastic optimization[J]. arXiv preprint arXiv: 1412.6980, 2014. [18] Glorot X, Bengio Y. Understanding the difficulty of training deep feedforward neural networks[C]//Proceedings of the 13th International Conference on Artificial Intelligence and Statistics, 2010: 249-256.