GloVe模型是一种广泛使用的词向量表示学习的模型。许多研究发现,学习得到的词向量维数越大,性能越好;但维数越大,模型学习耗时越长。事实上,GloVe模型中,耗时主要表现在两方面,一是统计词对共现矩阵,二是训练学习词向量表示。该文在利用GloVe模型统计语料中词对共现时,基于对称或非对称窗口得到两个共现矩阵,然后分别学习得到较低维度的词向量表示,再拼接得到较高维度的词向量表示。从计算的复杂度来看,该文方法并不会产生多的计算量,但显然统计共现矩阵和训练学习可通过并行方式实现,能够显著提高计算效率。在使用大规模语料的实验中,以对称和非对称窗口分别统计得到共现矩阵,分别学习得到300维词向量表示,再使用拼接方式得到600维词向量表示。与GloVe模型对称和非对称的600维的词向量相比,在中文和英文的词语推断任务上,显著地提高了预测的准确率,在词语聚类任务上,有较好的聚类效果,验证了该文方法的有效性。
Abstract
GloVe model is a popular word vector model, which is revealed a better performance with the increase of word vector dimensions. To avoid intolerable time cost in training high-dimension word vector, we propose an improved method of GloVe which is easily implemented in a parallel manner. We first construct a co-occurrence matrix with symmetrical windows and a co-occurrence matrix with asymmetrical windows on a corpus respectively, and apply the original GloVe model for two low-dimension word vectors. Then we concatenate the word vectors as the final high-dimension vectors. Tested in large-scale corpus, the 600-dimension word vector trained by the proposed method achieves better performance in the Chinese and English word analogy task and word clustering task, compared with the 600-dimension word vector trained directly by GloVe model.
关键词
GloVe模型 /
拼接的词向量 /
词语推断任务
{{custom_keyword}} /
Key words
GloVe model /
concatenated word vector /
word analogy task
{{custom_keyword}} /
{{custom_sec.title}}
{{custom_sec.title}}
{{custom_sec.content}}
参考文献
[1] Sebastiani F. Machine learning in automated text categorization[J]. ACM Computing Surveys, 2002, 34(1): 1-47.
[2] Tellex S, Katz B, Lin J, et al. Quantitative evaluation of passage retrieval algorithms for question answering[C]//Proceedings of the 26th Annual International ACM SIGIR Conference, Toronto, Canada, 2003.
[3] Turian J, Ratinov L, Bengio Y. Word representations: A simple and general method for semi-supervised learning[C]//Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics, 2010: 384-394.
[4] He L, Lee K, Lewis M, et al. Deep semantic role labeling: What works and what’s next[C]//Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics, 2017: 473-483.
[5] Collobert R, Weston J, Bottou L, et al. Natural language processing (almost) from scratch[J]. Journal of Machine Learning Research, 2011, 12: 2461-2505.
[6] Mikolov T, Chen K, Corrado G, et al. Efficient estimation of word representations in vector space[C]//Proceedings of the International Conference on Learning Representations, 2013.
[7] Pennington J, Socher R, Manning C D. GloVe: Global vectors for word representation[C]//Proceedings of the Conference on Empirical Methods in Natural Language Processing, Doha, Qatar, 2014: 1532-1543.
[8] Peters M E, Neumann M, Iyyer M, et al. Deep contextualized word representations[C]//Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics, 2018: 2227-2237.
[9] Devlin J, Chang M, Lee K, et al. BERT: Pre-training of deep bidirectional transformers for language understanding[J]. arXiv: 1810.04805v1, 2018.
[10] Firth J R. A synopsis of linguistic theory 1930-1955[G]. Selected Papers of J R Firth, Longman, London. 1957: 168-205.
[11] Burgess C, Lund K. Modeling cerebral asymmetries in high-dimensional semantic space[G]. Right Hemisphere Language Comprehension: Perspectives from Cognitive Neuroscience, Mahwah, NJ: Lawrence Erlbaum Associates, Inc, 1998: 215-244.
[12] Church K W, Hanks P. Word association norms, mutual information, and lexicography[J]. Computational Linguistics, 1990, 16(1):22-29.
[13] Baroni M, Dinu G, Kruszewski G. Don’t count, predict! A systematic comparison of context-counting vs. context-predicting semantic vectors[C]//Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics, Baltimore, USA, 2014: 238-247.
[14] Bullinaria J A, Levy J P. Extracting semantic representations from word co-occurrence statistics: A computational study[J]. Behavior Research Methods, 2007, 39(3): 510-526.
[15] Padró M, Idiart M, Ramisch C, et al. Nothing like good old frequency: Studying context filters for distributional thesauri[C]//Proceedings of the Conference on Empirical Methods in Natural Language Processing (Short Papers), Doha, Qatar, 2014: 419-424.
[16] Deerwester S, Dumais S T, Furnas G W, et al. Indexing by latent semantic analysis[J]. Journal of the American Society for Information Science, 1990, 41(6): 391-407.
[17] Salle A, Idiart M, Villavicencio A. Matrix factorization using window sampling and negative sampling for improved word representations[C]//Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, Berlin, 2016: 419-424.
[18] Bansal M, Gimpel K, Livescu K. Tailoring continuous word representations for dependency parsing[C]//Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics Baltimore, Maryland, 2014: 809-815.
[19] Wang L, Dyer C, Black A W, et al. Two/Too simple adaptations of Word2Vec for syntax problems[C]//Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Denver, Colorado, 2015: 1299-1304.
[20] Wang L, Tsvetkov Y, Silvio Amir, et al. Not all contexts are created equal: Better word representations with variable attention[C]//Proceedings of the Conference on Empirical Methods in Natural Language Processing, Lisbon, Portugal, 2015: 1367-1372.
[21] Song Y, Shi S, Li J, et al. Directional skip-gram: Explicitly distinguishing left and right context for word embeddings[C]//Proceedings of Annual Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2018: 175-180.
[22] Levy O, Goldberg Y. Dependency-based word embeddings[C]//Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics, 2014: 302-308.
[23] Chen X, Xu L, Liu Z, et al. Joint learning of character and word embeddings[C]//Proceedings of the 24th International Joint Conference on Artificial Intelligence, San Francisco, CA: Morgan Kaufmann, 2015: 1236-1242.
{{custom_fnGroup.title_cn}}
脚注
{{custom_fn.content}}
基金
国家自然科学基金(61806115)
{{custom_fund}}