车超,滕弘飞,. 伪实例与人工标注实例相结合的词义消歧方法[J]. 中文信息学报, 2009, 23(6): 31-39.
CHE Chao, TENG Hongfei,. Combining Pseudo-Samples and Manually-Tagged Samples for Word Sense Disambiguation. , 2009, 23(6): 31-39.
Combining Pseudo-Samples and Manually-Tagged Samples for Word Sense Disambiguation
CHE Chao1, TENG Hongfei1,2
1. Department of Compute Science & Engineering, Dalian University of Technology, Dalian, Liaoning 116024, China; 2. School of Mechanical Engineering, Dalian University of Technology, Dalian, Liaoning 116024, China
Abstract:The corpus-based method for word sense disambiguation (WSD) suffers from “knowledge acquisition bottleneck” problem. The automatic lexical sample acquisition method based on equivalent pseudo-words (EPs) is an effective way to solve of this problem. However, some pseudo-samples collected by EPs have low quality and the EPs can not be acquired when the ambiguous word has few monosemous synonyms. This paper proposes a WSD method combining pseudo-samples and man-acquired samples. The method calculates the sentence similarity with the context of the ambiguous word to remove pseudo-samples with low quality. Moreover, the method utilizes the manually-tagged corpus to get the sense distribution probability and provide samples for the ambiguous words that have little monosemous synonym. Our method achieves an average F-measure of 0.79 through the WSD experiments performed on Senseval-3 Chinese lexical sample task. Key words computer application; Chinese information processing; word sense disambiguation; HowNet; equivalent pseudo-words; Bayesian classifier; automatic sample acquisition;
[1] 李娟子. 汉语词义消歧方法研究[D]. 北京: 清华大学, 1999. [2] Ide N., Veronis J. Introduction to the special issue on word sense disambiguation: the state of the art[J]. Computational Linguistics, 1998, 24(1): 1-40. [3] 全昌勤, 何婷婷, 姬东鸿, 等. 从搭配知识获取最优种子的词义消歧方法[J]. 中文信息学报, 2005, 19(1): 30-35. [4] Leacock Claudia, Chodorow Martin, Miller A. George. Using corpus statistics and WordNet Relations for Sense Identification[J]. Computational Linguistics, 1998, 24(1): 147-165. [5] Mihalcea Rada, Moldovan Dan. An automatic method for generating sense tagged corpora[C]//Proceedings of the American Association for Artificial Intelligence. Orlando, U.S.A:1999. [6] Agirre E., Martinez D. Unsupervised word sense disambiguation based on automatically retrieved examples: The importance of bias[C]//Proceedings of EMNLP 2004. Barcelona, Spain:2004. [7] 鲁松, 白硕, 黄雄. 基于向量空间模型中义项词语的无导词义消歧[J]. 软件学报, 2002, 13(6): 1082-1089. [8] 陈浩, 何婷婷, 姬东鸿. 基于k-means聚类的无导词义消歧[J]. 中文信息学报, 2005, 19(4): 10-16. [9] Wang Xiaojie, Matsumoto Yuji. Improving word sense disambiguation by pseudo-samples[C]//Hainan Island, China:Springer Verlag, Heidelberg, D-69121, Germany, 2005: 386-395. [10] Lu Zhimao, Wang Haifeng, Yao Jianmin, et al. An Equivalent Pseudoword Solution to Chinese Word Sense Disambiguation[C]//Proceedings of the 21st International Conference on Computational Linguistics and 44th Annual Meeting of the ACL. Sydney:ACL, 2006: 457-464. [11] 杨宇娜. 基于统计的中文词义消歧技术研究[D]. 哈尔滨: 哈尔滨工业大学, 2006. [12] 郭宇航, 车万翔, 刘挺. 基于语言模型验证的词义消歧语料获取[J]. 中文信息学报, 2008, 22(6): 38-42. [13] 董振东, 董强. 知网[EB/OL]. [2006-8] http://www.keenage.com. [14] 金博, 史彦军, 滕弘飞. 基于语义理解的文本相似度算法[J]. 大连理工大学学报, 2005, 45(2): 291-297. [15] 车万翔, 刘挺, 秦兵, 等. 面向双语句对检索的汉语句子相似度计算[C]//全国第七届计算语言学联合学术会议. 北京:清华大学出版社, 2003: 81-88. [16] ACL. SENSEVAL-3[EB/OL]. [2006-8] http://www.senseval.org/.