1. Institute of Big Data and Language Education, Beijing Language and Culture University, Beijing 100083, China; 2. College of Information Science, Beijing Language and Culture University, Beijing 100083, China
Abstract:This paper presents a unified model for matrix factorization based word embeddings, and applies the model to Chinese-English cross-lingual word embeddings. It proposes a method to determine cross-lingual relevant word on parallel corpus. Both cross-lingual word co-occurrence and pointwise mutual information are served as pointwise relevant measurements to design objective function for learning cross-lingual word embeddings. Experiments are carried out from perspectives of different objective function, corpus, and vector dimension. For the task of cross-lingual document classification, the best performance model achieves 87.04% in accuracy, as it adopts cross-lingual word co-occurrence as relevant measurement. In contrast, models adopt cross-lingual pointwise mutual information get better performance in cross-lingual word similarity calculation task. Meanwhile, for the problem of English word similarity calculation, experimental result shows that our methods get slightly higher performance than English word embeddings trained by state-of-the-art methods.
[1] Alexandre Klementiev, Ivan Titov, Binod Bhattarai. Inducing crosslingual distributed representations of words[C]//Proceedings of COLING 2012, Technical Papers. Mumbai, 2012: 1459-1474. [2] Sarath Chandar A P, Stanislas Lauly, Hugo Larochelle, et al. An autoencoder approach to learning bilingual word representations[C]//Proceedings of NIPS 2014. Montreal, 2014: 1853-1861. [3] Manaal Faruqui, Chris Dyer. Improving vector space word representations using multilingual correlation[C]//Proceedings of EACL2014. Gothenburg, 2014: 462-471. [4] Ivan Vulic, Marie-Francine Moens. Bilingual Word Embeddings from Non-Parallel Document-Aligned Data Applied to Bilingual Lexicon Induction[C]//Proceedings of ACL2015(Short papers). Beijing, 2015: 719-725. [5] Stephan Gouws, Anders Sogaard. Simple task-specific bilingual word embeddings[C]//Proceedings of NAACL2015. Denver, 2015: 1386-1390. [6] Will Y Zou, Richard Socher, Daniel M Cer, et al. Bilingual word embeddings for phrase-based machine translation[C]//Proceedings of EMNLP2013. Seattle, Washington, 2013: 1393-1398. [7] Jiang Guo, Wanxiang Che, David Yarowsky, et al. Crosslingual dependency parsing based on distributed representations[C]//Proceedings of ACL2015.Beijing, 2015: 719-725. [8] Jeffrey Pennington, Richard Socher, Christopher D Manning. Glove: Global vectors for word representation[C]//Proceedings of EMNLP2014. Doha, 2014: 1532-1543. [9] Omer Levy, Yoav Goldberg. Neural word embedding as implicit matrix factorization[J]. Advances in neural information processing systems. 2014,(3): 2177-2185. [10] Karl Moritz Hermann, Phil Blunsom. Multilingual models for compositional distributed semantics[C]// Eprint Arxiv, 2014. [11] Stephan Gouws, Yoshua Bengio, Greg Corrado. Bilbowa: Fast bilingual distributed representations without word alignments[C]//Proceedings of ICML2015. Lille, 2015: 748-756. [12] Hubert Soyer, Pontus Stenetorp, Akiko Aizawa. Leveraging monolingual data for crosslingual compositional word representations[C]//Proceedings of ICLR2015. San Diego, 2015. [13] Tianze Shi, Zhiyuan Liu, Yang Liu, et al. Learning cross-lingualword embeddings via matrix co-factorization[C]//Proceedings of ACL2015(Short papers). Beijing, 2015: 567-572. [14] Jocelyn Coulmance, Jean-Marc Marty, Guillaume Wenzek, et al. Trans-gram, Fast Cross-lingual Word-embeddings[C]//Proceedings of EMNLP2015. Lisbon, 2015: 1109-1113. [15] Tomas Mikolov, Ilya Sutskever, Kai Chen, et al. Distributed representations of words and phrases and their compositionality[C]//Proceedings of NIPS2013. South Lake Tahoe, 2013: 3111-3119. [16] Lev Finkelstein,Evgeniy Gabrilovich, Yossi Matias, et al. Placing Search in Context: The Concept Revisited[J]. ACM Transactions on Information Systems, 2002,20(1): 116-131. [17] Marcin Junczys-Dowmunt, Arkadiusz Szat. Symgiza++: symmetrized word alignment models for statistical machine translation[C]//Proceedings of International Cooference on Security and Intelligent Information Systems, 2011,(7053): 379-390.于东(1982—),博士,副教授,主要研究领域为自然语言处理。 E-mail: yudong_bluc@126.com赵艳(1994—),硕士研究生,主要研究领域为语言信息处理。 E-mail: zhaoyan 0819@126.com韦林煊(1995—),本科生,主要研究领域为语言信息处理。 E-mail: 515984350@qq.com