根据中文古籍信息检索技术的需求,本文在大规模语料库上对古汉语进行了统计分析。首先给出了在信息处理中多个专用语料库的动态知识合并的方法。在此基础上,对三千五百万字的中文古籍语料进行了统计分析,总结出在古汉语中,汉字在高频字上集中分布而在低频字上相当散布,且总体变化成指数递减的规律,并对二元语法进行了分析。然后分别与现代汉语的单字及双字进行比较,得出相应结论,并按照使用频度,把古汉语的汉字进行了分类。最后,这些统计学习到的知识,在中文古籍信息检索系统中得到了实际的应用。
Abstract
Based on the need of information retrieval technology on Chinese ancient books ,we made the statistical analyses of the ancient Chinese on a large-scale corpus. Firstly ,we propose a method to cooperate corpus on different fields. In this method ,we analyzed the statistics of ancient Chinese on more than 35,000,000 characters. It shows that the common used characters are concentrated but the remaining is diffused with the decreasing speed of exponential. Then we give some more analyses on bigrams. Comparisons are made between modern Chinese and ancient Chinese. Conclusions are got and Chinese characters are divided into four different parts according with the usage frequency. Finally ,these statistics are used in the information retrieval system of ancient Chinese books.
关键词
信息检索 /
古籍检索 /
字频统计 /
二元语法 /
中文信息处理
{{custom_keyword}} /
Key words
information retrieval /
ancient Chinese retrieval /
character statistical analysis /
digram /
Chinese information processing
{{custom_keyword}} /
{{custom_sec.title}}
{{custom_sec.title}}
{{custom_sec.content}}
参考文献
[1] David A. Grossman & Ophir Frider , Information Retrieval :Algorithms and Heuristics ,Kluwer Academic Publishers ,1998
[2] 冯志伟. 现代汉字和计算机. 北京:北京大学出版社,1989
[3] 罗式胜. 文献计量学引论. 北京:书目文献出版社,1986
[4] 冯志伟. 汉字的极限熵. 语文建设通讯(香港) ,1995 ,50期
[5] Zipf . G. K. Human Behavior and the Principle of Least-Effort . Addison Wesley Press ,Cambridge ,Massachusetts ,1949
[6] 赵国玺. 古文的标点断句和翻译. 沈阳:东北师范大学出版社,1988
[7] 屈月英. 安子介的汉字学术思想. 计算机时代的汉语和汉字研究学术讨论会,清华大学,北京,1995年12月
[8] Nie. J-Y, Ren F. Chinese information retrieval : using characters or words Information Processing and Management ,1999
{{custom_fnGroup.title_cn}}
脚注
{{custom_fn.content}}
基金
国家重点基础研究(973)(G1998030509);自然科学基金项目:(69836040)
{{custom_fund}}