汉语自动分词是汉语信息处理的前提,词典是汉语自动分词的基础,分词词典机制的优劣直接影响到中文分词的速度和效率。本文首先分析了分词词典机制在中文分词中的重要性及已有的三种典型词典机制,并在此基础上根据汉语中双字词语较多的特点提出了一种新的分词词典机制——双字哈希机制,在不提升已有典型词典机制空间复杂度与维护复杂度的情况下,提高了中文分词的速度和效率。
Abstract
Chinese word segmentation is the preparation for Chinese Information Processing. As one basic component of Chinese word segmentation systems , the dictionary mechanism influences the speed and efficiency of segmentation significantly. In this paper , we provide a new dictionary mechanism named double-character-hash-indexing (DCHI) . Compared with existing typical dictionary mechanisms (i.e. binary-seek-by-word , TRIE indexing tree and binary-seek-by-characters) , DCHI improves the speed and efficiency of segmentation without increasing the space and time complication and maintenance difficulty.
关键词
计算机应用 /
中文信息处理 /
中文分词 /
双字哈希
{{custom_keyword}} /
Key words
computer application /
Chinese information processing /
Chinese word segmentation /
dictionary mechanism /
double character hash indexing
{{custom_keyword}} /
{{custom_sec.title}}
{{custom_sec.title}}
{{custom_sec.content}}
参考文献
[1] 孙茂松,邹嘉彦. 汉语自动分词中的若干理论问题[J] . 语言文字应用,1995 , (4) .
[2] 梁南元. 书面汉语自动分词系统——CDWS [J] . 中文信息学报,1987 ,2 (2) .
[3] 马晏. 基于评价的汉语自动分词系统的研究与实现[A] . 见:语言信息处理专论,北京:清华大学出版社,1996.
[4] 孙茂松,左正平,黄昌宁. 汉语自动分词词典机制的实验研究[J] . 中文信息学报. 2000 ,14 (1) .
[5] 严蔚敏,吴伟民. 数据结构[M] . 北京:清华大学出版社,1992.
[6] Choi A , Cheng C H , Ko Y L. Word extraction from Chinese documents by occurrence counts [A] . 1988 International Conference on Computer Processing of Chinese and Oriental Languages , Toronto , Canada : 488 - 491.
[7] Fan C K, Tsai W H. Automatic word identification in Chinese sentences by the relaxation technique [J] . Computer Processing of Chinese and Oriental Languages , 1988 , 4 (1) :33 - 56.
{{custom_fnGroup.title_cn}}
脚注
{{custom_fn.content}}
基金
教育部专项基金资助项目(2001BA101A12-02);973计划资助(2002CB312006)
{{custom_fund}}