随着社会的飞速发展,新词语不断地在日常生活中涌现出来。搜集和整理这些新词语,是中文信息处理中的一个重要研究课题。本文提出了一种自动检测新词语的方法,通过大规模地分析从Internet上采集而来的网页,建立巨大的词和字串的集合,从中自动检测新词语,而后再根据构词规则对自动检测的结果进行进一步的过滤,最终抽取出采集语料中存在的新词语。根据该方法实现的系统,可以寻找不限长度和不限领域的新词语,目前正应用于《现代汉语新词语信息(电子)词典》的编纂,在实用中大大的减轻了人工查找新词语的负担。
Abstract
With the fast development of the society ,more and more new words come out in our life. It is one of the important topics in Chinese natural language processing to collect those new words. A method is presented for detecting these new words automaitcally in this paper. Through analysing webpages grabbed from the Internet , a large word and string set is built , which new words are detected from and filtered by rules. At last new words which exist in the webpages grabbed are extracted. The system built in this way can find new words in any length and in any field. Now it is applying to the compilation of Modern Chinese New Word Information Dictionary. It reduced human labor a lot in practise.
关键词
计算机应用 /
中文信息处理 /
新词语 /
自动检测
{{custom_keyword}} /
Key words
computer application /
Chinese language processing /
new word /
automatic detection
{{custom_keyword}} /
{{custom_sec.title}}
{{custom_sec.title}}
{{custom_sec.content}}
参考文献
[1] 张德鑫. 水至清则无鱼——我的新生词语规范观[J] . 北京大学学报(哲社版) ,2000 ,200005 :106 - 119 , http://www.hubce.edu.cn/cbb/qwjs/lib/33118.html.
[2] 亢世勇,等.《新词语大词典》前言.
[3] 高永伟. 英语国家对新词的研究[N] . 译者文苑,1999 ,http://www.cn-trans.com/cm-23.htm.
[4] Hua-Ping ZHANG, Qun LIU. et al , Chinese Name Entity Recognition Using Role Model [J] . Special issue“Word Formation and Chinese Language processing”of the International Journal of Computational Linguistics and Chinese Language Processing , 2003 , 8 (2) :29 - 60.
[5] 郑家恒,杜永萍,宋礼鹏,农业病虫害词汇获取方法初探[A] . 孙茂松,陈群秀. 语言计算与基于内容的文本处理[C] . 北京:清华大学出版社,2003 ,61 - 66.
[6] 郑家恒,李文花. 基于构词法的网络新词自动识别初探[J] . 山西大学学报(自然科学版) ,2002 ,25 (2) :115 - 119.
[7] 韩客松,王永成,陈桂林. 无词典高频字串快速提取和统计算法研究[J] . 中文信息学报,2001 ,15 (2) :23 - 30.
[8] 刘挺,吴岩,王开铸. 串频统计和词形匹配相结合的汉语自动分词系统[J] . 中文信息学报,1998 ,12 (1) :17 - 25.
[9] Craig G. Nevill - Manning , Ian H.Witten. Identifying Hierarchical Structure in Sequences :A linear - time algorithm [J] . Journal of Artificial Intelligence Research , 1997 , 7 :67 - 82
[10] 沈丽琴,施勤,柴海新. 自动新词提取方法和系统,IBM公司专利,申请号:00126471.0
[11] 黄萱菁,吴立德,王文欣,et al. 基于机器学习的无需人工编制词典的切词系统[J] . 模式识别与人工智能,1996 ,9 (4) :297 - 308
[12] 周正宇,李宗葛. 一种新的基于统计的词典扩展方法[J] . 中文信息学报. 2001 ,15 (5) :46 - 51.
{{custom_fnGroup.title_cn}}
脚注
{{custom_fn.content}}