共现词提取在信息挖掘和自然语言处理中有着十分重要的地位。而传统的共现词提取方法仅仅局限在单一的一种统计量上,其结果十分不精确,需要人工再进行整理。本文提出了一种基于词汇吸引与排斥模型的共现词提取算法,并通过将多种常用统计量进行组合,改进了算法的效果。在开放测试环境下,所提取的共现词其用户感兴趣度为60.87%。将该算法应用于基于Web的共现词检索系统,在速度和共现词的提取精度上均取得了比较好的效果。
Abstract
Co-occurrence word retrieval is very important in information mining and natural language processing. But traditional co-occurrence word retrieval methods used only a single statistic method , so the result is very imprecise , and needs lots of manual collation. In this paper we present a co-occurrence words extraction algorithm based on the lexical attraction and repulsion model , and combine some common statistical methods with the algorithm to improve its effect. In the open test , our system’s Interesting performance is 60.87%. We show good performance in speed and precision when applied the algorithm on a co-occurrence search system based on web.
关键词
计算机应用 /
中文信息处理 /
共现词 /
词汇吸引与排斥模型 /
共现距离
{{custom_keyword}} /
Key words
computer application /
Chinese information processing /
co-occurrence /
lexical attraction and repulsion model /
co-occurrence distance
{{custom_keyword}} /
{{custom_sec.title}}
{{custom_sec.title}}
{{custom_sec.content}}
参考文献
[1] Ying Ding , IR and AI. Using Co-occurrence Theory to Generate Lightweight Ontologies[A] . Proceedings of 12th International Workshop on Database and Expert Systems Applications[C] , Pages :961 - 965 , Sept. ,2001.
[2] 吴光远,何丕廉,等. 基于向量空间模型的词共现研究及其在文本分类中的应用[J] . 计算机应用 ,2003 , 23 (6) : 138 - 145.
[3] El-Sayed Atlam , A New Method for Construction Field Association Terms Using Co-occurrence Words and Declinable Words Information[A] . Proceedings of 2002 IEEE International Conference on Systems , Man and Cybernetics[C] , Volume 4 ,Pages :5 , Oct. 2002 .
[4] Yuen-Hsien Tseng , Fast Co-occurrence Thesaurus Construction for Chinese News[A] . Proceedings of 2001 IEEE International Conference on Systems , Man , and Cybernetics[C] , Volume 2 , Pages :853 - 858 , Oct. 2001.
[5] Doug Beeferman , Adam Berger , John Lafferty. A Model of Lexical Attraction and Repulsion[A] . Proceedings of the 35th Annual Meeting of the Association for Computational Linguistics. [C] , Pages :373 - 380 , 1997.
[6] 王丽坤,王宏,等. 文本挖掘及其关键技术与方法[J] . 计算机科学, 2002 , 29 (12) : 12 - 19.
[7] 许伟,黄昌宁,等. 基于语料库的语言建模[J] . 清华大学学报, 1997 , 37 (3) : 71 - 75.
[8] 罗盛芬,孙茂松. 基于字串内部结合紧密度的汉语自动抽词实验研究[J]. 中文信息学报, 2003 , 17(3) : 9 - 14.
[9] Ido Dagan , Shaul Marcus. Contextual word similarity and estimation from sparse data[J] . Computer Speech and Language , Vol. 9 ,Pages :123 - 152 ,1995. 9.
{{custom_fnGroup.title_cn}}
脚注
{{custom_fn.content}}
基金
福建省自然科学基金资助项目(A0310009);福建省重点科技资助项目(2001J005)
{{custom_fund}}