一种基于信息熵的中文高频词抽取算法

任禾,曾隽芳

PDF(126 KB)
PDF(126 KB)
中文信息学报 ›› 2006, Vol. 20 ›› Issue (5) : 42-45,92.

一种基于信息熵的中文高频词抽取算法

  • 任禾,曾隽芳
作者信息 +

A Chinese Word Extraction Algorithm Based on Information Entropy

  • REN He,ZENG Jun-fang
Author information +
History +

摘要

为扩展分词词典,提高分词的准确率,本文提出了一种基于信息熵的中文高频词抽取算法,其结果可以用来识别未登录词并扩充现有词典。我们首先对文本进行预处理,将文本中的噪音字和非中文字符转化为分隔符,这样文本就可以被视为用分隔符分开的中文字符串的集合,然后统计这些中文字符串的所有子串的相关频次信息,最后根据这些频次信息计算每一个子串的信息熵来判断其是否为词。实验证明,该算法不仅简单易行,而且可以比较有效地从文本中抽取高频词,可接受率可达到91.68%。

Abstract

Targeting at extending the dictionary forword segmentation so as to improve its accuracy, this paper presents a high-frequency Chinese word extraction algorithm based on information entropy. We firstly transform noisy words and characters to separators, thus a text can be viewed as a Chinese string collection isolated by separators. Then we compute the frequencies of all the substrings of these Chinese strings. Finally, we judge whether each substring is a word by computing its information entropy. Preliminary experiments show that this simple algorithm is effective in extracting high-frequency Chinese words, with the accept rate up to 91.68%.

关键词

人工智能 / 自然语言处理 / 分词 / 中文抽词 / 信息熵 / 高频词

Key words

artificial intelligence / natural language processing / Chinese word segmentation / Chinese word extraction / information entropy / high-frequency Chinese words

引用本文

导出引用
任禾,曾隽芳. 一种基于信息熵的中文高频词抽取算法. 中文信息学报. 2006, 20(5): 42-45,92
REN He,ZENG Jun-fang. A Chinese Word Extraction Algorithm Based on Information Entropy. Journal of Chinese Information Processing. 2006, 20(5): 42-45,92

参考文献

[1] 罗盛芬,孙茂松. 基于字串内部结合紧密度的汉语自动抽词实验研究[J]. 中文信息学报, 2003, 17 (3) : 9 - 14.
[2] 李文花,郑家恒. 基于构词法的网络新词自动识别初探[J]. 山西大学学报, 2002, 25 (2) : 115 - 119.
[3] 邹纲,刘洋,刘群,等.面向Internet的中文新词语检测[J]. 中文信息学报, 2004, 18 (6) : 1 - 9.
[4] Keh-Jiann Chen, Wei-Yun Ma. Unknown Word Extraction for Chinese documents[A]. Proceedings of COLING[C]. Taiwan: Association for Computational Linguistics, 2002, 169 - 175.
[5] R. Sproat, C. Shih. A statistical method for finding word boundaries in Chinese text[J]. Computer Processing of Chinese and Oriental Languages, 1990, Vol. 4, No. 4, 336 - 351.
[6] Xianping Ge, Wanda Pratt, Padhraic Smyth. Discovering Chinese Words from Unsegmented Text[A]. SIGIR [C]. Berkeley: ACM, 1999, 271 - 272.
[7] Sun Maosong, Shen Dayang, Benjamin K Tsou. Chinese Word Segmentation without Using Lexicon and Handcrafted Training Data[A]. Proceedings of the 36th annual meeting on Association for Computational Linguistics [C]. Montreal: Association for Computational Linguistics, 1998, 1265 - 1271.
[8] 金翔宇,孙正兴,张福炎. 一种中文文档的非受限无词典抽词方法[J]. 中文信息学报, 2001, 15 (6) : 33 - 39.
[9] 韩客松,王永成,陈桂林1无词典高频字串快速提取和统计算法研究[J]. 中文信息学报, 2001, 15 (2) : 23 - 30.
[10] 韩洁,周勇,刘少辉,等. 基于WWW的未登录词识别研究[J]. 计算机科学, 2002, 29 (12) : 155 - 156.
[11] 刘月华,潘文娱,等. 实用现代汉语语法[M]. 北京:外语教学与研究出版社, 1983.
[12] JY Nie, ML Hannan, W Jin. Unknown Word Detection and Segmentation of Chinese using Statistical and heuristic Knowledge[J]. Communications of COLIPS, 1995,Vol. 5, 47 - 57.
[13] 王还,常宝儒. 现代汉语频率词典[M]. 北京:北京语言学院出版社, 1986.
[14] 李荣陆. 中文文本分类语料[DB] , http://www.nlp.org.cn/docs/download.php?doc_id =281.
PDF(126 KB)

901

Accesses

0

Citation

Detail

段落导航
相关文章

/