本文提出了一种能更准确的反映两个汉字串之间相关程度的新概念——黏结度,并给出了其计算方法。该方法把需要计算相关程度的汉字串放在一个大环境中进行讨论,通过加入上下文信息来提高分词的准确度;另外,该方法在引用汉字词频时,增加了对动态词频的考虑,可以自动识别未登陆的专业词汇。文中同时给出了黏结度在分词领域中的应用实例。通过与前人提出的相关信息的方法相比较,这种计算方法能够解决分词中一些难于解决的问题并提高分词的精确度。
Abstract
In this paper we put forward a new concept , the degree of cohering of Chinese strings , and it's computation. It's value reflects how close the two strings are interrelated. This method completely considered the environment of the Chinese strings and the local-use-frequency of the words. Its definition and the examples of applying it in word segmentation are presented。Compared with the method of mutual information the predecessors had put forward , this method can solve some difficult problems in word segmentation and improves the precision.
关键词
计算机应用 /
中文信息处理 /
黏结度 /
相关信息 /
分词
{{custom_keyword}} /
Key words
computer application /
Chinese information processing /
the degree of cohering /
the mutual information /
word segmentation
{{custom_keyword}} /
{{custom_sec.title}}
{{custom_sec.title}}
{{custom_sec.content}}
参考文献
[1] Sun Maosong , Shen Dayang. Chinese Word Segmentation without Using Lexicon and Hand-crafted Training Data. Proceedings of the 36th Annual Meeting of Association of Computational Linguistics and the 17th International Conference on Computational Linguistics ,1265 - 1271 ,Montreal.
[2] Mao-yuan Zhang ,Zheng-ding Lu ,Chun-yan Zou ,2003 ,A Chinese word segmentation based on language situation in processing ambiguous words.
[3] 冯志伟. 论歧义结构的潜在性[J] . 中文信息学报,1995 ,9 (4) :14 - 24.
[4] 周强. 基于语料库和面向统计学的自然语言处理技术[J] . 计算机科学,1995 , (4) .
[5] 孙茂松,左正平,邹嘉彦. 高频最大交集型歧义切分字段在汉语自动分词中的作用[J] . 中文信息学报, 1999 ,13 (1) :27 - 34.
[6] 刘源. 现代汉语常用词词频词典(音序部分) [M] . 北京:宇航出版社.
[7] 周强. 规则和统计相结合的汉语词类标注方法[J] . 中文信息学报,1995 ,9 (3) :1 - 10.
[8] 国家技术监督局. 中华人民共和国国家标准—信息处理用现代汉语分词规范(GB/T 13715 - 92) [S] ,中国标准出版社,1993年第一版.
{{custom_fnGroup.title_cn}}
脚注
{{custom_fn.content}}
基金
湖南省自科基金资助项目(02JJY2092)
{{custom_fund}}