兼类词的词类排歧是汉语语料词性标注中的难点问题,它严重影响语料的词性标注质量。针对这一难点问题,本文提出了一种兼类词词性标注的自动校对方法。它利用数据挖掘的方法从正确标注的训练语料中挖掘获取有效信息,自动生成兼类词词性校对规则,并应用获取的规则实现对机器初始标注语料的自动校对,从而提高语料中兼类词的词性标注质量。分别对50万汉语语料做封闭测试和开放测试,结果显示,校对后语料的兼类词词性标注正确率分别可提高11.32%和5.97%。
Abstract
The disambiguation of multi-category words is one of the difficulties in part-of-speech tagging of Chinese text , which affects the processing quality of corpora greatly. Aiming at this question , the paper describes an approach to correcting the part-of-speech tagging of multi-category words automatically. It acquires correction rules for the part-of-speech tagging of multi-category words from right-tagged corpora based on the rough sets and data mining , and then corrects the corpora based on these rules automatically. According to the results of close-test and open-test on the corpus of 500,000 Chinese characters , the accuracy of multi-category words' part-of-speech tagging can be increased by 11.32% and 5.97% respectively.
关键词
计算机应用 /
中文信息处理 /
兼类词 /
汉语词性标注 /
自动校对 /
粗糙集
{{custom_keyword}} /
Key words
computer application /
Chinese information processing /
multi-category word /
Chinese part-of-speech tagging /
automatic correction /
rough sets
{{custom_keyword}} /
{{custom_sec.title}}
{{custom_sec.title}}
{{custom_sec.content}}
参考文献
[1] 史忠植. 知识发现[M]. 北京:清华大学出版社,2002.
[2] ZDZISLAW PAWLAK. Rough Sets-Theoretical Aspects of Reasoning about Data[M] . Kluwer Academic Publisher ,1991.
[3] Eric Brill. Unsupervised Learning of Disambiguation Rules for Part of Speech Tagging[A] . In : Yarowsky D. Churchk. Proceeding of 3rd Workshop on Very Large Corpus[C] . Cambridge , Massachusetts , USA , 1995 ,1 - 13.
[4] 李晓黎,史忠植. 用数据采掘方法获取汉语词性标注规则[J]. 计算机研究与发展. 2000 ,37 (12) : 1409 - 1414.
[5] 朱靖波,张玥杰,姚天顺. 一种短语结构规则的自动获取方法[J]. 计算机研究与发展. 1999 , 36 (5) :601 - 607.
{{custom_fnGroup.title_cn}}
脚注
{{custom_fn.content}}
基金
国家863高技术研究发展计划资助(2001AA114031)
{{custom_fund}}