该文提出基于Word Embedding的歧义词多个义项语义表示方法,实现基于知识库的无监督字母缩略术语消歧。方法分两步聚类,首先采用显著相似聚类获得高置信度类簇,构造带有语义标签的文档集作为训练数据。利用该数据训练多份Word Embedding模型,以余弦相似度均值表示两个词之间的语义关系。在第二步聚类时,提出使用特征词扩展和语义线性加权来提高歧义分辨能力,提高消歧性能。该方法根据语义相似度扩展待消歧文档的特征词集合,挖掘聚类文档中缺失的语义信息,并使用语义相似度对特征词权重进行线性加权。针对25个多义缩略术语的消歧实验显示,特征词扩展使系统F值提高约4%,使用语义线性加权后F值再提高约2%,达到89.40%。
Abstract
This paper introduces a knowledge based unsupervised method for acronym term disambiguation. Word embedding is used for acronym term semantic representation. In the first stage of disambiguation, significantly similar documents are clustered and used as training data. Each cluster corresponds to an interpretation of an acronym term, so it can be seen as a semantic tag. Then the word embedding is trained for several times and semantic relation between two words can be calculated by average of cosine similarity of their vectors. In the second stage, the paper proposes to use feature word expansion and linear weighted semantic similarity to improve system performance. By calculating semantic similarities between documents and interpretations, implicit semantics can be mined as new feature words; and the feature words are linearly weighted by their semantic similarities with specific interpretation. Experimental results on 25 acronym terms show that, feature word expansion improves system F score by 4% and semantic weight gains higher performance by 2%, which yielding a final system F score of 89.40%.
关键词
字母缩略术语 /
术语消歧 /
Word Embedding /
语义相似度
{{custom_keyword}} /
Key words
acronym term /
term disambiguation /
word embedding /
semantic similarity
{{custom_keyword}} /
{{custom_sec.title}}
{{custom_sec.title}}
{{custom_sec.content}}
参考文献
[1] 王瑞琴,孔繁胜. 无监督词义消歧研究[J]. 软件学报, 2009,20(8): 2138-2152.
[2] Banerjee S, Pedersen T. An adapted Lesk algorithm for word sense disambiguation using WordNet [C]//Proceedings of the 3rd International Conference on Intelligent Text Processing and Computational Linguistics, Mexico City, 2002: 17-23.
[3] 张刚,刘挺,卢志茂等. 隐马尔可夫模型和HowNet在汉语词义标注中的应用[J]. 计算机应用研究, 2004,10(增刊): 67-69.
[4] Collobert R, Weston J. A unified architecture for na-tural language processing: Deep neural networks with multi-task learning [C]//Proceedings of the 25th International Conference on Machine Learning, Helsinki, 2008: 160-167.
[5] Mikolov T, Chen K, Corrado G, et al. Efficient Estimation of Word Representations in Vector Space[C]//Proceedings of Workshop at ICLR, 2013.
[6] Mikolov T, Sutskever I, Chen K, et al. Distributed Representations of Words and Phrases and their Composi-tionality[C]//Proceedings of NIPS, 2013.
[7] Schütze H. Automatic word sense discrimination [J]. Computational Linguistics, 1998, 24(1): 97-123.
[8] 鲁松,白硕,黄雄. 基于向量空间模型中义项词语的无导词义消歧[J]. 软件学报, 2002,13(6): 1082-1089.
[9] 何径舟, 王厚峰. 基于特征选择和最大熵模型的汉语词义消歧[J]. 软件学报, 2010,21(6): 1287-1295.
[10] Mann G, Yarosky D. Unsupervised Personal Name Disambiguation [C]//Proceedings of CoNLL-2003, Edmonton, 2003: 33-40.
[11] 丁海波, 肖桐, 朱靖波. 基于多阶段的中文人名消歧聚类技术的研究[C]//第六届全国信息检索学术会议, 牡丹江, 2010: 316-324.
[12] 李广一, 王厚峰. 基于多步聚类的汉语命名实体识别和歧义消解[J]. 中文信息学报, 2013, 27(5): 29-34.
[13] Z Peng, L Sun, X Han. SIR-NERD: A Chinese Named Entity Recognition and Disambiguation System using a Two-Stage Method[C]//Proceedings of the 2nd CIPS-SIGHAN Joint Conference on Chinese Language Processing, Tianjin, 2012: 115-120.
[14] J Liu, R Xu, Q Lu, et al. Explore Chinese Encyclopedic Knowledge to Disambiguate Person Names[C]//Proceedings of the 2nd CIPS-SIGHAN Joint Conference on Chinese Language Processing, Tianjin, 2012.
[15] 杨欣欣, 李培峰, 朱巧明. 基于查询扩展的人名消歧[J]. 计算机应用, 2012, 32(9): 2488-2490.
[16] H Liu, Y Lussier, C Friedman. Disambiguating ambi-guous biomedical terms in biomedical narrative text: An unsupervised method [J]. Journal of Biomedical Informatics, 2001, 34: 249-261.
[17] Stevenson M, Yikun G, Abdulaziz A A, et al. Dis-ambiguation of Biomedical Abbreviations[C]//Proceedings of the Workshop on BioNLP, Boulder, 2009: 71-79.
[18] 张榕, 宋柔. 基于互联网的汉语术语定义提取研究[C]//全国第八届计算语言学联合学术会议, 南京, 2005.
{{custom_fnGroup.title_cn}}
脚注
{{custom_fn.content}}
基金
国家自然科学基金(61300081,61170162);国家科技支撑项目(2012BAH16F00);北京语言大学中央高校基本科研业务专项资金(14YJ030005)
{{custom_fund}}