双语词嵌入通常采用从源语言空间到目标语言空间映射,通过源语言映射嵌入到目标语言空间的最小距离线性变换实现跨语言词嵌入。然而大型的平行语料难以获得,词嵌入的准确率难以提高。针对语料数量不对等、双语语料稀缺情况下的跨语言词嵌入问题,该文提出一种基于小字典不对等语料的跨语言词嵌入方法,首先对单语词向量进行归一化,对小字典词对正交最优线性变换求得梯度下降初始值,然后通过对大型源语言(英语)语料进行聚类,借助小字典找到与每一聚类簇相对应的源语言词,取聚类得到的每一簇词向量均值和源语言与目标语言对应的词向量均值,建立新的双语词向量对应关系,将新建立的双语词向量扩展到小字典中,使得小字典得以泛化和扩展。最后,利用泛化扩展后的字典对跨语言词嵌入映射模型进行梯度下降求得最优值。在英语—意大利语、德语和芬兰语上进行了实验验证,实验结果证明该文方法可以在跨语言词嵌入中减少梯度下降迭代次数,减少训练时间,同时在跨语言词嵌入上表现出较好的正确率。
Abstract
This paper proposed a cross-language word embedding method based on small dictionary and unbalanced monolingual corpus. This method first normalizes monolingual word vectors, obtaining gradient descent initial values for small dictionary words by orthogonal optimal linear transformations. And then the large-scale source language (English) corpus is clustered, and the source language words corresponding to each cluster are detected via dictionary. The average word vector value of each cluster, and the word vector value corresponding to the source language and the target language are thus obtained. A new bilingual word vector correspondence relationship is established, which are extended into the small dictionary. Finally, the generalized extended dictionary is used to conduct gradient descent on the cross-language word embedding mapping model to obtain the optimal value. Experiments in English-Italian, English-German and English-Finnish show that this method can reduce the number of gradient descent iterations in cross-language word embedding and reduce the training time, preserving a good accuracy rate in cross-language word embedding.
关键词
小字典 /
不对等语料 /
词嵌入 /
k-means聚类 /
梯度下降
{{custom_keyword}} /
Key words
small dictionary /
unequal corpus /
word embedding /
k-means clustering /
gradient descent
{{custom_keyword}} /
{{custom_sec.title}}
{{custom_sec.title}}
{{custom_sec.content}}
参考文献
[1] Zou W Y,Socher R,Cer D,et al.Bilingual word embeddings for phrase-based machine translation[C]//Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing.2013:1393-1398.
[2] Tsai C T,Dan R.Cross-lingual wikification using multilingual embeddings [C]//Proceedings of Conference of the North American Chapter of the Association for Computational Linguistics:Human Language Technologies.2016:589-598.
[3] Klementiev A,Titov I,Bhattarai B.Inducing crosslingual distributed representations of words[C]//Proceedings of COLING 2012,2012:1459-1474.
[4] Xiao M,Guo Y.Distributed word representation learning for cross-lingual dependency parsing[C]//Proceedings of the 18th Conference on Computational Natural Language Learning,2014:119-129.
[5] Zhang Y,Gaddy D,Barzilay R,et al.Ten pairs to tag-Multilingual POS tagging via coarse mapping between embeddings[C]//Proceedings of the Association for Computational Linguistics,2016:1307-1317.
[6] Gouws S,Bengio Y,Corrado G.BilBOWA:Fast bilingual distributed representations without word alignments [J].arXiv preprint arXiv:1410.2455,2014.
[7] Luong T,Pham H,Manning C D.Bilingual word representations with monolingual quality in mind[C]//Proceedings of the Workshop on Vector Space Modeling for Natural Language Processing.2015:151-159.
[8] Sogaard A,?eljko AGIc,Alonso H M,et al.Inverted indexing for cross-lingual NLP[C]//Proceedings of Meeting of the Association for Computational Linguistics and the,International Joint Conference on Natural Language Processing.2014:1713-1722.
[9] Hermann K M,Blunsom P.Multilingual distributed representations without word alignment [J].arXiv preprint arXiv:1312.6173,2014.
[10] Vulicc I,Moens M F.A study on bootstrapping bilingual vector spaces from non-parallel data (and nothing else)[C]//Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing.2013:1613-1624.
[11] Mikolov T,Le Q V,Sutskever I.Exploiting similarities among languages for machine translation [J].arXiv preprint arXiv:1309,4168,2013.
[12] Artetxe M,Labaka G,Agirre E.Learning principled bilingual mappings of word embeddings while preserving monolingual invariance[C]//Proceedings of Conference on Empirical Methods in Natural Language Processing.2016:2289-2294.
[13] Xing C,Wang D,Liu C,et al.Normalized word embedding and orthogonal transform for bilingual word translation[C]//Proceedings of Conference of the North American Chapter of the Association for Computational Linguistics:Human Language Technologies.2015:1006-1011.
[14] Artetxe M,Labaka G,Agirre E.Learning bilingual word embeddings with (almost) no bilingual data[C]//Proceedings of Meeting of the Association for Computational Linguistics.2017:451-462.
[15] Smith S L,Turban D H P,Hamblin S,et al.Offline bilingual word vectors,orthogonal transformations and the inverted softmax [J].arXiv preprint arXiv:1702.03859,2017.
[16] Faruqui M,Dyer C.Improving vector space word representations using multilingual correlation [C]//Proceedings of the 14th Conference of the European Chapter of the Association for Computational Linguistics 2014,EACL 2014,2014:462-471.
[17] Dinu G,Lazaridou A,Baroni M.Improving zero-shot learning by mitigating the hubness problem [J].arXiv preprint arXiv:1412.6568,2015.
[18] Tiedemann J.Parallel data,tools and interfaces in OPUS[C]//Proceedings of the International Conference on Language Resources and Evaluation(LREC),2012:2214-2218.
[19] Baroni M,Bemardini S,Ferraresi A,et al.The WaCky Wide Web:A collection of very large linguistically processed Web-Crawled corpora [J].Language Resources & Evaluation,2009,43(3):209-226.
[20] Luong T,Phan H,Maning C D.Bilingual word representations with monolingual quality in mind[C]//Proceedings of the 1st Workshop on Vector Space Modeling for Natural Language Processing.2015:151-159.
{{custom_fnGroup.title_cn}}
脚注
{{custom_fn.content}}
基金
国家自然科学基金(61462054,61732005,61672271);云南省科技厅项目(2015FB135)
{{custom_fund}}