基于多源信息融合的分布式词表示学习

冶忠林,赵海兴,张科,朱宇

PDF(3156 KB)
PDF(3156 KB)
中文信息学报 ›› 2019, Vol. 33 ›› Issue (10) : 18-30.
知识表示与知识获取

基于多源信息融合的分布式词表示学习

  • 冶忠林1,2,3,4,赵海兴1,2,3,4,张科1,3,4,朱宇1,3,4
作者信息 +

Distributed Word Embedding via Multi-Source Information Fusion

  • YE Zhonglin1,2,3,4, ZHAO Haixing1,2,3,4, ZHANG Ke1,3,4, ZHU Yu1,3,4
Author information +
History +

摘要

分布式词表示学习旨在用神经网络框架训练得到低维、压缩、稠密的词语表示向量。然而,这类基于神经网络的词表示模型有以下不足: (1) 罕见词由于缺乏充分上下文训练数据,训练所得的罕见词向量表示不能充分地反映其在语料中的语义信息; (2) 中心词语的反义词出现于上下文时,会使意义完全相反的词却赋予更近的空间向量表示; (3) 互为同义词的词语均未出现于对方的上下文中,致使该类同义词学习得到的表示在向量空间中距离较远。基于以上三点,该文提出了一种基于多源信息融合的分布式词表示学习算法(MSWE),主要做了4个方面的改进: (1) 通过显式地构建词语的上下文特征矩阵,保留了罕见词及其上下文词语在语言训练模型中的共现信息可以较准确地反映出词语结构所投影出的结构语义关联; (2) 通过词语的描述或解释文本,构建词语的属性语义特征矩阵,可有效地弥补因为上下文结构特征稀疏而导致的训练不充分; (3) 通过使用同义词与反义词信息,构建了词语的同义词与反义词特征矩阵,使得同义词在词向量空间中具有较近的空间距离,而反义词则在词向量空间中具有较远的空间距离; (4) 通过诱导矩阵补全算法融合多源特征矩阵,训练得到词语低维度的表示向量。实验结果表明,该文提出的MSWE算法能够有效地从多源词语特征矩阵中学习到有效的特征因子,在6个词语相似度评测数据集上表现出了优异的性能。

Abstract

Distributed word embedding aims at using neural network framework to learn the low-dimension, compressed and dense representation vectors for words in corpus. This paper proposes a distributed word embedding based on multi-source information fusion (MSWE) . In the MSWE algorithm, the main improvements are focused on the following four aspects: (1) Through the explicit construction of context feature matrix, the co-occurrence of rare words and their context words can be retained in the language model, therefore, the structural semantic associations between words can be accurately reflected. (2) Through the descriptions and explanation texts of the words, the property semantic feature matrix of the words is constructed, which can effectively compensate the problem of the insufficient training due to the sparsity of the context. (3) The synonym and antonym matrix of the words are constructed, which makes the synonyms have a closer distance, and the antonyms have a farther distance in the word embedding space. (4) The multi-source feature matrices are integrated by the inductive matrix complement algorithm, and the various relationships of words are trained to get the low-dimensional embeddings. The experimental results show that the proposed MSWE algorithm shows an excellent performance on the six similarity evaluation datasets.

关键词

词表示学习 / 词表示 / 词嵌入 / 词向量 / 词特征学习

Key words

word representation learning / word representation / word embedding / word vector / word feature learning

引用本文

导出引用
冶忠林,赵海兴,张科,朱宇. 基于多源信息融合的分布式词表示学习. 中文信息学报. 2019, 33(10): 18-30
YE Zhonglin, ZHAO Haixing, ZHANG Ke, ZHU Yu. Distributed Word Embedding via Multi-Source Information Fusion. Journal of Chinese Information Processing. 2019, 33(10): 18-30

参考文献

[1] Bengio Y, Courville A, Vincent P. Representation learning: A review and new perspectives[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2013, 35(8): 1798-1828.
[2] 陈维政, 张岩, 李晓明. 网络表示学习[J]. 大数据, 2015, 1(3): 8-22.
[3] Baroni M, Dinu G, Kruszewski G. Don't Count, Predict! A systematic comparison of context-counting vs. context-predicting semantic vectors[C]//Proceedings of the meeting of the Association for Computational Linguistics, 2014:238-247.
[4] 刘知远, 孙茂松, 林衍凯, 等. 知识表示学习研究进展[J]. 计算机研究与发展, 2016, 53(2): 247-261.
[5] https://github.com/thunlp[OL].
[6] Bengio Y, Ducharme R, Viincent P, et al. A neural probabilistic language model[M].Innovations in Machine Learning, 2006:137-186.
[7] Hinton G E, Mcclealand J L, Rumelhart D E. Distributed representations[M].Encyclopedia of Cognitive Science, John Wiley & Sons, Ltd, 2006:77-109.
[8] Caliskanislama, Bryson J J, Narayanan A. Semantics derived automatically from language corpora necessarily contain human biases[J]. Science, 2016, 356(6334): 183-197.
[9] Mikolov T, Sutskever I, Chen K, et al. Distributed representations of words and phrases and their compositionality[C]//Proceedings of advances in Neural Information Processing Systems 26, arXiv: 1310.4546.
[10] Mikolov T, Chen K, Corrado G S, et al. Efficient estimation of word representations in vector space[C]//Proceedings of 2013 International Conference on Learning Representations, arXiv: 1301.3781
[11] Liu Y, Liu Z Y, Chuan T S, et al. Topical word embeddings[C]//Proceedings of the 29th AAAI Conference on Artificial Intelligence, 2015:2418-2424.
[12] Levy O, Goldbery Y. Dependency-based word embeddings[C]//Proceedings of the meeting of the Association for Computational Linguistics. 2014:302-308.
[13] Huang E H, Socher R, Manning C D, et al. Improving word representations via global context and multiple word prototypes[C]//Proceedings of Meeting of the Association for Computational Linguistics(Long Papers). Association for Computational Linguistics, 2012:873-882.
[14] Turian J, Ratinov L, Bengio Y. Word representations: A simple and general method for semi-supervised learning[C]//Proceedings of the Meeting of the Association for Computational Linguistics, 2010:384-394.
[15] Luo Y, Liu Z Y, Luan H B, et al. Online learning of interpretable word embeddings[C]//Proceedings of Conference on Empirical Methods in Natural Language Processing, 2015:1687-1692.
[16] Murpyh B, Talukdar P, Mitchell T. Learning effective and interpretable semantic models using non-negative sparse embedding[C]//Proceedings of COLING, 2012:1933-1950.
[17] Fyshe A, Talukdar P P, Murphy B. Interpretable semantic vectors from a joint model of brain-and text-Based meaning[C]//Proceedings of Meeting of the Association for Computational Linguistics, 2014:489-499.
[18] Miikkulainen R, Dyer M G. Natural language processing with modular networks and distributed lexicon[J].Cognitive Science, 1991, 15(3): 343-399.
[19] Morin F, Bengio Y. Hierarchical probabilistic neural network language model[C]//Proceedings of the 10th International Workshop on Artificial Intelligence and Statistics. 2005:246-252.
[20] Goodman J. Classes for fast maximum entropy training[C]//Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing, 2001:561-564.
[21] Mnih A, Hinton G. Three new graphical models for statistical language modeling[C]//Proceedings of the 24th International Conference on Machine Learning, 2007:641-648.
[22] Mnih A, Kavukcuoglu K. Learning word embeddings efficiently with noise-contrastive estimation[C]//Proceedings of Advances in Neural Information Processing Systems 26, 2013:2265-2273.
[23] Lilleberg J, Zhu Y, Zhang Y. Support vector machines and Word2Vec for text classification with semantic features[C]//Proceedings of IEEE International Conference on Cognitive Informatics & Cognitive Computing, 2015:136-140.
[24] Wang X, Wang M, Zhang Q. Realization of Chinese word segmentation based on deep learning method[C]//Proceedings of the International Conference on Green Energy and Sustainable Development. 2017:020150.
[25] Polpinij J, Srikanjanapert N, Sopon P. Word2Vec approach for sentiment classification relating to hotel reviews[M].Recent Advances in Information and Communication Technology, 2018:308-316.
[26] Su Z, Xu H, Zhang D, et al. Chinese sentiment classification using a neural network tool — Word2Vec[C]//Proceedings of International Conference on Multisensory Fusion and Information Integration for Intelligent Systems, 2014:1-6.
[27] Hill F, Cho K, Korhonen A, et al. Learning to understand phrases by embedding the dictionary[C]//Transactions of the Association for Computational Linguistics, 2016:17-30.
[28] Hu B, Lu Z, Li H, et al. Convolutional neural network architectures for matching natural language sentences[C]//Proceedings of Advances in Neural Information Processing Systems, 2014:2042-2050.
[29] Le Q, Mikolov T. Distributed representations of sentences and document[C]//Proceedings of the 31st International Conference on Machine Learning, 2014:1188-1196.
[30] Levy O, Goldbery Y. Neural word embedding as implicit matrix Factorization[C]//Proceedings of advances in Neural Information Processing Systems, 2014(3): 2177-2185.
[31] Levy O, Goldbery Y, Dagan I. Improving distributional similarity with lessons learned from word embeddings[J].Bulletin De La Société Botanique De France, 2015, 75(3): 552-555.
[32] Hamilton W L, Clark K, Leskovec J, et al. Inducing domain-specific sentiment lexicons from unlabeled corpora[C]//Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, 2016, 595-605.
[33] Hamilton W L, Leskovec J, Dan J. Diachronic word embeddings reveal statistical laws of semantic change[C]//Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, 2016:1489-1501.
[34] Chuch K W, Hanks P. Word association norms, mutual information, and lexicography[J].Computational Linguistics, 1990, 16(1): 22-29.
[35] Dagna I, Pereira F, Lee L. Similarity-based estimation of word co-occurrence probabilities[C]//Proceedings of Association for Computational Linguistics, 1994:272-278.
[36] Caron J. Experiments with LSA scoring: Optimal rank and basis[J]. History of Education Quarterly, 2000, 50(2): 182-203.
[37] Ntarajan N, Dhillon I S. Inductive matrix completion for predicting gene-disease associations[J]. Bioinformatics, 2014, 30(12): 60-68.
[38] Gabrilovich E, Dumais S, Horvitz E. Newsjunkie: Providing personalized newsfeeds via analysis of information novelty[C]//Proceedings of the WWW, 2004:482-490.
[39] Yeh E, Ramage D, Manning C D, et al. WikiWalk: Random walks on Wikipedia for semantic relatedness[C]//Proceedings of Association for Computational Linguistics, 2009:41-49.
[40] Zesch T, Muller C, Gurevych I. Using Wiktionary for computing semantic relatedness[C]//Proceedings of the AAAI, 2008.
[41] Zesch T, Gurevych I. Wisdom of crowds versus wisdom of linguists-measuring the semantic relatedness of words[J]. Journal of Natural Language Engineering, 2010, 16(01): 25-59.
[42] Pennington J, Socher R, Manning C. Glove: Global vectors for word representation[C]//Proceedings of Conference on Empirical Methods in Natural Language Processing, 2014:1532-1543
[43] Dean J, Ghemawat S. MapReduce: A flexible data processing tool[J].Communications of the Acm, 2010, 53(1): 72-77.
[44] MPI: A message-passing interface standard[DB/OL] http://ww.unixer.de/publications/img/nbc-proposal-rev-6.pdf.

基金

国家自然科学基金(11661069,61763041,61663041);长江学者和创新研究团队项目(IRT_15R40);中央高校基本科研业务费专项资金(2017TS045);青海省藏文信息处理与机器翻译重点实验室项目(2013-Z-Y17)
PDF(3156 KB)

737

Accesses

0

Citation

Detail

段落导航
相关文章

/