基于点关联测度矩阵分解的中英跨语言词嵌入方法

PDF(2113 KB)

中文信息学报 ›› 2017, Vol. 31 ›› Issue (1) : 58-65.

自然语言处理应用

基于点关联测度矩阵分解的中英跨语言词嵌入方法

于东^1,2,赵艳²,韦林煊²,荀恩东^1,2

作者信息 +

Chinese-English Cross-lingual Word Embeddings Based on Pointwise
Relevant Measurement Matrix Factorization

YU Dong^1,2, ZHAO Yan², WEI Linxuan², XUN Endong^1,2

Author information +

History +

摘要

研究基于矩阵分解的词嵌入方法,提出统一的描述模型,并应用于中英跨语言词嵌入问题。以双语对齐语料为知识源,提出跨语言关联词计算方法和两种点关联测度的计算方法: 跨语言共现计数和跨语言点互信息。分别设计目标函数学习中英跨语言词嵌入。从目标函数、语料数据、向量维数等角度进行实验,结果表明,在中英跨语言文档分类中以前者作为点关联测度最高得到87.04%的准确率;在中英跨语言词义相似度计算中,后者作为点关联测度得到更好的性能,同时在英—英词义相似度计算中的性能略高于主流的英语词嵌入。

Abstract

This paper presents a unified model for matrix factorization based word embeddings, and applies the model to Chinese-English cross-lingual word embeddings. It proposes a method to determine cross-lingual relevant word on parallel corpus. Both cross-lingual word co-occurrence and pointwise mutual information are served as pointwise relevant measurements to design objective function for learning cross-lingual word embeddings. Experiments are carried out from perspectives of different objective function, corpus, and vector dimension. For the task of cross-lingual document classification, the best performance model achieves 87.04% in accuracy, as it adopts cross-lingual word co-occurrence as relevant measurement. In contrast, models adopt cross-lingual pointwise mutual information get better performance in cross-lingual word similarity calculation task. Meanwhile, for the problem of English word similarity calculation, experimental result shows that our methods get slightly higher performance than English word embeddings trained by state-of-the-art methods.

导出引用

于东;赵艳;韦林煊;荀恩东;. 基于点关联测度矩阵分解的中英跨语言词嵌入方法. 中文信息学报. 2017, 31(1): 58-65

YU Dong; ZHAO Yan; WEI Linxuan; XUN Endong;. Chinese-English Cross-lingual Word Embeddings Based on Pointwise
Relevant Measurement Matrix Factorization. Journal of Chinese Information Processing. 2017, 31(1): 58-65

参考文献

[1] Alexandre Klementiev, Ivan Titov, Binod Bhattarai. Inducing crosslingual distributed representations of words[C]//Proceedings of COLING 2012, Technical Papers. Mumbai, 2012: 1459-1474.
[2] Sarath Chandar A P, Stanislas Lauly, Hugo Larochelle, et al. An autoencoder approach to learning bilingual word representations[C]//Proceedings of NIPS 2014. Montreal, 2014: 1853-1861.
[3] Manaal Faruqui, Chris Dyer. Improving vector space word representations using multilingual correlation[C]//Proceedings of EACL2014. Gothenburg, 2014: 462-471.
[4] Ivan Vulic, Marie-Francine Moens. Bilingual Word Embeddings from Non-Parallel Document-Aligned Data Applied to Bilingual Lexicon Induction[C]//Proceedings of ACL2015(Short papers). Beijing, 2015: 719-725.
[5] Stephan Gouws, Anders Sogaard. Simple task-specific bilingual word embeddings[C]//Proceedings of NAACL2015. Denver, 2015: 1386-1390.
[6] Will Y Zou, Richard Socher, Daniel M Cer, et al. Bilingual word embeddings for phrase-based machine translation[C]//Proceedings of EMNLP2013. Seattle, Washington, 2013: 1393-1398.
[7] Jiang Guo, Wanxiang Che, David Yarowsky, et al. Crosslingual dependency parsing based on distributed representations[C]//Proceedings of ACL2015.Beijing, 2015: 719-725.
[8] Jeffrey Pennington, Richard Socher, Christopher D Manning. Glove: Global vectors for word representation[C]//Proceedings of EMNLP2014. Doha, 2014: 1532-1543.
[9] Omer Levy, Yoav Goldberg. Neural word embedding as implicit matrix factorization[J]. Advances in neural information processing systems. 2014,(3): 2177-2185.
[10] Karl Moritz Hermann, Phil Blunsom. Multilingual models for compositional distributed semantics[C]// Eprint Arxiv, 2014.
[11] Stephan Gouws, Yoshua Bengio, Greg Corrado. Bilbowa: Fast bilingual distributed representations without word alignments[C]//Proceedings of ICML2015. Lille, 2015: 748-756.
[12] Hubert Soyer, Pontus Stenetorp, Akiko Aizawa. Leveraging monolingual data for crosslingual compositional word representations[C]//Proceedings of ICLR2015. San Diego, 2015.
[13] Tianze Shi, Zhiyuan Liu, Yang Liu, et al. Learning cross-lingualword embeddings via matrix co-factorization[C]//Proceedings of ACL2015(Short papers). Beijing, 2015: 567-572.
[14] Jocelyn Coulmance, Jean-Marc Marty, Guillaume Wenzek, et al. Trans-gram, Fast Cross-lingual Word-embeddings[C]//Proceedings of EMNLP2015. Lisbon, 2015: 1109-1113.
[15] Tomas Mikolov, Ilya Sutskever, Kai Chen, et al. Distributed representations of words and phrases and their compositionality[C]//Proceedings of NIPS2013. South Lake Tahoe, 2013: 3111-3119.
[16] Lev Finkelstein,Evgeniy Gabrilovich, Yossi Matias, et al. Placing Search in Context: The Concept Revisited[J]. ACM Transactions on Information Systems, 2002,20(1): 116-131.
[17] Marcin Junczys-Dowmunt, Arkadiusz Szat. Symgiza⁺⁺: symmetrized word alignment models for statistical machine translation[C]//Proceedings of International Cooference on Security and Intelligent Information Systems, 2011,(7053): 379-390.