Abstract:A bilingual topical word embedding model is proposed for the Chinese-Korean cross-lingual text classification task. The model combines the topic model with the bilingual word embedding to solve the influence of the ambiguity caused by polysemy on the accuracy to cross-lingual text classification. Firstly, the word embedding representation of bilingual words is trained in a large scale parallel sentence pairs with word-alignment. Secondly, the dataset of classification task is processed and represented by topic model, and the topic words in both languages are obtained. Finally, the word embeddings of these topic words are input into the traditional text classifier and the deep learning text classifier. The experimental results show that the accuracy reach 91.76% in the Chinese-Korean cross-lingual text classification task.
[1] Bel N, Koster C H A, Villegas M. Cross-lingual text categorization[J]. Lecture Notes in Computer Science. 2003, 2769(18): 126-139. [2] Rigutini L, Maggini M, Liu B. An EM based training algorithm for cross-language text categorization[C]//Proceedings of the 2005 IEEE/WIC/ACM Inter-national Conference on Web Intelligence, 2005: 529-535. [3] Ni X, Sun J, Hu J, et al. Mining multilingual topics from wikipedia[C]//Proceedings of the 18th International Conference on World Wide Web. Madrid, Spain, 2009: 1155-1156. [4] Heyman G, Vuli'c I, Moens M F. C-BiLDA extracting cross-lingual topics from non-parallelTexts by distinguishing shared from unshared content[J]. Data Mining & Knowledge Discovery, 2016, 30(5): 1299-1323. [5] 田明杰,崔荣一.面向跨语言文本分类与标签推荐的带标签双语主题模型的研究[J].计算机应用研究,2019,36(10):2911-2915. [6] 田明杰. 基于双语主题词嵌入模型的中朝跨语言文本分类方法的研究[D].延吉: 延边大学硕士学位论文,2019. [7] Luong T, Pham H, Manning C D. Bilingual word representations with monolingual quality in mind[C]//Proceedings of the 1st Workshop on Voctor Space Modeliny for Natural Language Processing. Denver, USA, 2015: 151-159. [8] Faruqui M, Dyer C. Improving vector space word representations using multilingual correlation[C]//Proceedings of the 14th Conference of the European Chapter of the Association for Computational Linguistics. Gothenburg, Sweden, 2014: 462-471. [9] Hermann K M, Blunsom P. Multilingual distributed representations without word alignment[J/OL]. ArXiv Preprint ArXiv:1312.6173, 2013. [10] Hermann K M, Blunsom P. Multilingual models for compositional distributed semantics[C]//Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics. Baltimore, USA, 2014: 58-68. [11] Gouws S, Bengio Y, Corrado G. Bilbowa: Fast bilingual distributed representations without word alignments[C]//Proceedings of the 32nd International Conference on Machine Learning. Lille, France, 2015: 748-756. [12] Vuli'c I, Moens M F. Bilingual word embeddings from non-parallel document-aligned data applied to bilingual lexicon induction[C]//Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing. Beijing, China, 2015: 719-725. [13] 周国强,崔荣一.基于朴素贝叶斯分类器的朝鲜语文本分类的研究[J].中文信息学报,2011,25(04):16-19. [14] Lee S. Korean document classification using extended vector space model[J]. Kips Transactions Partb. 2011, 18B(2): 93-108. [15] Blei D M, Ng A Y, Jordan M I. Latent Dirichlet allocation[J]. Journal of Machine Learning Research. 2003, 31(1): 993-1022. [16] 徐戈,王厚峰.自然语言处理中主题模型的发展[J]. 计算机学报, 2011, 34(8): 1423-1436. [17] Yerebakan H Z, Dundar M. Partially collapsed parallel Gibbs sampler for dirichlet process mixture models[J]. Pattern Recognition Letters, 2017, 90: 22-27. [18] Mikolov T, Chen K, Corrado G etal. Efficient Estimation of Word Representations in Vector Space. Computer Science[J/OL]. ArXiv Preprint ArXiv: 1301.3781, 2013. [19] Mikolov T, Chen K, Corrado G etal. Distributed representations of words and phrases and their compositionality[C]//Proceedings of the 26th International Conference on Neural Information Processing Systems. Lake Tahoe, USA, 2013: 3111-3119. [20] Ruder S, Vuli'c I, Sogaard A. A survey of cross-lingual word embedding models[J/OL]. ArXiv Preprint ArXiv: 1706.04902, 2017. [21] Upadhyay S, Faruqui M, Dyer C, et al. Cross-lingual models of word embeddings: An empirical comparison[C]//Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics. Berlin, Germany, 2016: 1661-1670. [22] Reisinger J, Mooney R J. Multi-prototype vector-space models of word meaning[C]//Proceedings of Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the ACL. Los Angeles, USA, 2010: 109-117. [23] Huang E H, Socher R, Manning C D, et al. Improving word representations via global context and multiple word prototypes[C]//Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics. Jeju, Republic of Korea, 2012: 873-882. [24] Dyer C, Chahuneau V, Smith N A. A simple, fast, and effective reparameterization of IBM model 2[C]//Proceedings of Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Atlanta, USA, 2013: 644-648. [25] Kim Y. Convolutional neural networks for sentence classification[C]//Proceedings of the 2014 Confe-rence on Empirical Methods in Natural Language Processing. Doha, Qatar, 2014: 1746-1751. [26] Yang Z, Yang D, Dyer C, et al. Hierarchical atten-tion networks for document classification[C]//Proceedings of Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. San Diego, USA, 2016: 1480-1489.