汉语语料词性标注自动校对方法的研究

钱揖丽,郑家恒

PDF(279 KB)
PDF(279 KB)
中文信息学报 ›› 2004, Vol. 18 ›› Issue (2) : 31-36.

汉语语料词性标注自动校对方法的研究

  • 钱揖丽,郑家恒
作者信息 +

Research on the Method of Automatic Correction of Chinese Part-of-Speech Tagging

  • QIAN Yi-li,ZHENG Jia-heng
Author information +
History +

摘要

兼类词的词类排歧是汉语语料词性标注中的难点问题,它严重影响语料的词性标注质量。针对这一难点问题,本文提出了一种兼类词词性标注的自动校对方法。它利用数据挖掘的方法从正确标注的训练语料中挖掘获取有效信息,自动生成兼类词词性校对规则,并应用获取的规则实现对机器初始标注语料的自动校对,从而提高语料中兼类词的词性标注质量。分别对50万汉语语料做封闭测试和开放测试,结果显示,校对后语料的兼类词词性标注正确率分别可提高11.32%和5.97%。

Abstract

The disambiguation of multi-category words is one of the difficulties in part-of-speech tagging of Chinese text , which affects the processing quality of corpora greatly. Aiming at this question , the paper describes an approach to correcting the part-of-speech tagging of multi-category words automatically. It acquires correction rules for the part-of-speech tagging of multi-category words from right-tagged corpora based on the rough sets and data mining , and then corrects the corpora based on these rules automatically. According to the results of close-test and open-test on the corpus of 500,000 Chinese characters , the accuracy of multi-category words' part-of-speech tagging can be increased by 11.32% and 5.97% respectively.

关键词

计算机应用 / 中文信息处理 / 兼类词 / 汉语词性标注 / 自动校对 / 粗糙集

Key words

computer application / Chinese information processing / multi-category word / Chinese part-of-speech tagging / automatic correction / rough sets

引用本文

导出引用
钱揖丽,郑家恒. 汉语语料词性标注自动校对方法的研究. 中文信息学报. 2004, 18(2): 31-36
QIAN Yi-li,ZHENG Jia-heng. Research on the Method of Automatic Correction of Chinese Part-of-Speech Tagging. Journal of Chinese Information Processing. 2004, 18(2): 31-36

参考文献

[1] 史忠植. 知识发现[M]. 北京:清华大学出版社,2002.
[2] ZDZISLAW PAWLAK. Rough Sets-Theoretical Aspects of Reasoning about Data[M] . Kluwer Academic Publisher ,1991.
[3] Eric Brill. Unsupervised Learning of Disambiguation Rules for Part of Speech Tagging[A] . In : Yarowsky D. Churchk. Proceeding of 3rd Workshop on Very Large Corpus[C] . Cambridge , Massachusetts , USA , 1995 ,1 - 13.
[4] 李晓黎,史忠植. 用数据采掘方法获取汉语词性标注规则[J]. 计算机研究与发展. 2000 ,37 (12) : 1409 - 1414.
[5] 朱靖波,张玥杰,姚天顺. 一种短语结构规则的自动获取方法[J]. 计算机研究与发展. 1999 , 36 (5) :601 - 607.

基金

国家863高技术研究发展计划资助(2001AA114031)
PDF(279 KB)

867

Accesses

0

Citation

Detail

段落导航
相关文章

/