词义消歧一直是自然语言处理领域中的重要问题,该文将知网(HowNet)中表示词语语义的义原信息融入到语言模型的训练中。通过义原向量对词语进行向量化表示,实现了词语语义特征的自动学习,提高了特征学习效率。针对多义词的语义消歧,该文将多义词的上下文作为特征,形成特征向量,通过计算多义词词向量与特征向量之间相似度进行词语消歧。作为一种无监督的方法,该方法大大降低了词义消歧的计算和时间成本。在SENSEVAL-3的测试数据中准确率达到了37.7%,略高于相同测试集下其他无监督词义消歧方法的准确率。
Abstract
Word sense disambiguation (WSD) is a classical issues in nature language processing. In this paper, we trained a language model with the sememe information in HowNet that can represent word semantic, so as to learn the semantic features of words automatically and improve the efficiency of feature learning. Then, we represent words by vectors of sememes. Meanwhile, the contexts of the polysemes is used as features. And then we disambiguate the polysemant by computing the vectors’ cosine similarity between polysemes and feature. We choose SENSEVAL-3 as test set, and achieve 37.7% in precision, which is better than other unsupervised method in the same test data.
Key words word embedding; HowNet; WSD; unsupervised methods
关键词
词向量 /
《知网》 /
词义消歧 /
无监督方法
{{custom_keyword}} /
Key words
word embedding /
HowNet /
WSD /
unsupervised methods
{{custom_keyword}} /
{{custom_sec.title}}
{{custom_sec.title}}
{{custom_sec.content}}
参考文献
[1] 董振东, 董强. 《知网》.[DB]http://www.keenage.com
[2] 刘群, 李素建. 基于《知网》的词汇语义相似度的计算[C].第三届汉语词汇语义学研讨会. 台北, 2002:59-76.
[3] 余晓峰, 刘鹏远, 赵铁军. 一种基于《知网》的汉语词语词义消歧方法[C]. 第二届全国学生计算语言学研讨会, 北京: 中国中文信息学会, 2004.
[4] 车超, 金博, 滕弘飞, 等. 基于义原关系的多策略汉语词义消歧方法[J]. 大连理工大学学报, 2010,50(4): 603-608.
[5] 杨尔弘,张国清,张永奎. 基于义原同现频率的汉语词义排歧方法[J].计算机研究与发展, 2001,38(7): 833-838.
[6] 张明宝,马静. 一种基于知网的中文词义消歧算法[J].计算机技术与发展,2009,19(2):9-11,15.
[7] 于东, 荀恩东. 基于Word Embedding语义相似度的字母缩略术语消歧[J]. 中文信息学报, 2014, 28(5):51-59.
[8] Mikolov T. Word2vec Project[DB/OL]. http://code.google.com/p/word2vec/.
[9] Mikolov T, Kai Chen, Greg Corrado, et al. Efficient estimation of word representations in vector space[C]//Proceedings of the ICLR Workshop, 2013.
[10] Mikolov T, Yih W, Zweig G. Linguistic regularities in continuous space word representations[C]//Proceedings of the HLT-NAACL. 2013.
[11] Mikolov T, Sutskever I, Chen K, et al. Distributed representations of words and phrases and their compositionality[C]//Proceedings of the Advances in Neural Information Processing Systems. 2013: 3111-3119.
[12] Huang E H, Socher R, Manning C D, et al. Improving word representations via global context and multiple word prototypes[C]//Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Volume 1. Association for Computational Linguistics, 2012:873-882.
[13] Chen X, Liu Z, Sun M. A unified model for word sense representation and disambiguation[C]//Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing. 2014: 1025-1035.
[14] Mihalcea R, Corley C, Strapparava C. Corpus-based and knowledge-based measures of text semantic similarity[C]//Proceedings of the American Association for Artificial Intelligence MA, 2006.
[15] Niu Z Y, Ji D H, Tan C L. Optimizing feature set for Chinese word sense disambiguation[C]//Proceedings of Senseval-3, Third International Workshop on Evaluating Word Sense Disambiguation Systems. 2004.
[16] Ke Cai, Xiaodong Shi, Yidong Chen, et al. Chinese Word Sense Induction based on Hierarchical Clustering Algorithm[C]//Proceedings of the CLP, 2010.
[17] Huang Heyan, Yang Zhizhuo, Jian Ping. Unsupervised Word Sense Disambiguation Using Neighborhood Knowledge[C]//Proceedings of the 25th Pacific Asia Conference on Language, Information and Computation,2011: 333-342
[18] 北京语言大学汉语语料库[DB].http://www.bcc.blcu.edu.cn/
[19] Li W, Lu Q, Li W. Integrating Collocation Features in Chinese Word Sense Disambiguation[C]//Proceedings of the Fourth Sighan Workshop on Chinese Language Processing. 2005: 87-94.
{{custom_fnGroup.title_cn}}
脚注
{{custom_fn.content}}
基金
国家自然科学基金(61300081,61170162),北京语言大学研究生创新基金项目(中央高校基本科研业务费专项资金)(15YCX100)
{{custom_fund}}