为扩展分词词典,提高分词的准确率,本文提出了一种基于信息熵的中文高频词抽取算法,其结果可以用来识别未登录词并扩充现有词典。我们首先对文本进行预处理,将文本中的噪音字和非中文字符转化为分隔符,这样文本就可以被视为用分隔符分开的中文字符串的集合,然后统计这些中文字符串的所有子串的相关频次信息,最后根据这些频次信息计算每一个子串的信息熵来判断其是否为词。实验证明,该算法不仅简单易行,而且可以比较有效地从文本中抽取高频词,可接受率可达到91.68%。
Abstract
Targeting at extending the dictionary forword segmentation so as to improve its accuracy, this paper presents a high-frequency Chinese word extraction algorithm based on information entropy. We firstly transform noisy words and characters to separators, thus a text can be viewed as a Chinese string collection isolated by separators. Then we compute the frequencies of all the substrings of these Chinese strings. Finally, we judge whether each substring is a word by computing its information entropy. Preliminary experiments show that this simple algorithm is effective in extracting high-frequency Chinese words, with the accept rate up to 91.68%.
关键词
人工智能 /
自然语言处理 /
分词 /
中文抽词 /
信息熵 /
高频词
{{custom_keyword}} /
Key words
artificial intelligence /
natural language processing /
Chinese word segmentation /
Chinese word extraction /
information entropy /
high-frequency Chinese words
{{custom_keyword}} /
{{custom_sec.title}}
{{custom_sec.title}}
{{custom_sec.content}}
参考文献
[1] 罗盛芬,孙茂松. 基于字串内部结合紧密度的汉语自动抽词实验研究[J]. 中文信息学报, 2003, 17 (3) : 9 - 14.
[2] 李文花,郑家恒. 基于构词法的网络新词自动识别初探[J]. 山西大学学报, 2002, 25 (2) : 115 - 119.
[3] 邹纲,刘洋,刘群,等.面向Internet的中文新词语检测[J]. 中文信息学报, 2004, 18 (6) : 1 - 9.
[4] Keh-Jiann Chen, Wei-Yun Ma. Unknown Word Extraction for Chinese documents[A]. Proceedings of COLING[C]. Taiwan: Association for Computational Linguistics, 2002, 169 - 175.
[5] R. Sproat, C. Shih. A statistical method for finding word boundaries in Chinese text[J]. Computer Processing of Chinese and Oriental Languages, 1990, Vol. 4, No. 4, 336 - 351.
[6] Xianping Ge, Wanda Pratt, Padhraic Smyth. Discovering Chinese Words from Unsegmented Text[A]. SIGIR [C]. Berkeley: ACM, 1999, 271 - 272.
[7] Sun Maosong, Shen Dayang, Benjamin K Tsou. Chinese Word Segmentation without Using Lexicon and Handcrafted Training Data[A]. Proceedings of the 36th annual meeting on Association for Computational Linguistics [C]. Montreal: Association for Computational Linguistics, 1998, 1265 - 1271.
[8] 金翔宇,孙正兴,张福炎. 一种中文文档的非受限无词典抽词方法[J]. 中文信息学报, 2001, 15 (6) : 33 - 39.
[9] 韩客松,王永成,陈桂林1无词典高频字串快速提取和统计算法研究[J]. 中文信息学报, 2001, 15 (2) : 23 - 30.
[10] 韩洁,周勇,刘少辉,等. 基于WWW的未登录词识别研究[J]. 计算机科学, 2002, 29 (12) : 155 - 156.
[11] 刘月华,潘文娱,等. 实用现代汉语语法[M]. 北京:外语教学与研究出版社, 1983.
[12] JY Nie, ML Hannan, W Jin. Unknown Word Detection and Segmentation of Chinese using Statistical and heuristic Knowledge[J]. Communications of COLIPS, 1995,Vol. 5, 47 - 57.
[13] 王还,常宝儒. 现代汉语频率词典[M]. 北京:北京语言学院出版社, 1986.
[14] 李荣陆. 中文文本分类语料[DB] , http://www.nlp.org.cn/docs/download.php?doc_id =281.
{{custom_fnGroup.title_cn}}
脚注
{{custom_fn.content}}