基于规则与统计相结合的中文文本自动查错模型与算法

张仰森,曹元大,俞士汶

PDF(322 KB)
PDF(322 KB)
中文信息学报 ›› 2006, Vol. 20 ›› Issue (4) : 3-9,57.

基于规则与统计相结合的中文文本自动查错模型与算法

  • 张仰森1,3,曹元大2,俞士汶1
作者信息 +

A Hybrid Model of Combining Rule-based and Statistics-based Approaches for Automatic Detecting Errors in Chinese Text

  • ZHANG Yang-sen1,3,CAO Yuan-da2,YU Shi-wen1
Author information +
History +

摘要

中文文本自动校对是自然语言处理领域具有挑战性的研究课题。本文提出了一种规则与统计相结合的中文文本自动查错模型与算法。根据正确文本分词后单字词的出现规律以及“非多字词错误”的概念,提出一组错误发现规则,并与针对分词后单字散串建立的字二元、三元统计模型和词性二元、三元统计模型相结合,建立了文本自动查错模型与实现算法。通过对30篇含有578个错误测试点的文本进行实验,所提算法的查错召回率为86.85%、准确率为69.43% ,误报率为30.57%。

Abstract

Chinese text automatic proofreading is an important research subject in NLP. A hybrid model based on the combination of rules and statistics are proposed in this article. According to the distribution of Chinese single-character after word segmentation in Chinese text and the conception of “non-multi-characterword error”, we proposed a group of rules to find errors in texts, to construct the automatic error-detection model and to implement its algorithm by combining the scattered single-character Bigram models, part-of-speech Bigram and Trigram models. Our experiment for the 30 texts that contain 578 error test points shows that the recall rate is 86.85% and accuracy rate is 69.43% , distorting rate is 30.57%.

关键词

计算机应用 / 中文信息处理 / 中文文本自动查错 / 规则与统计相结合 / 非多字词错误 / 真多字词错误

Key words

Computer application / Chinese information processing / Chinese text automatic error-detecting / Combing rule-based and statistics-based approaches / non-multi-character word error / real-multi-character word error

引用本文

导出引用
张仰森,曹元大,俞士汶. 基于规则与统计相结合的中文文本自动查错模型与算法. 中文信息学报. 2006, 20(4): 3-9,57
ZHANG Yang-sen,CAO Yuan-da,YU Shi-wen. A Hybrid Model of Combining Rule-based and Statistics-based Approaches for Automatic Detecting Errors in Chinese Text. Journal of Chinese Information Processing. 2006, 20(4): 3-9,57

参考文献

[1] Chao-Huang Chang. A Pilot Study on Automatic Chinese Spelling Error Correction [J]. Communication of COLIPS, 1994, 4 (2) : 143 - 149.
[2] 张仰森,丁冰青. 基于二元接续关系检查的字词级自动查错方法[J]. 中文信息学报, 2001, 15 (3) : 36 - 43.
[3] Lei zhang, Ming zhou, Changning Huang, Haihua Pan. Automatic detecting correcting errors in Chinese text by an approximate word-matching algorithm [A]. Microsoft Research China Paper Collection [C] , 2000. 9, Vol. 1: 135 - 141.
[4] 罗卫华,罗振声. 中文文本自动校对技术研究[J]. 计算机研究与发展, 2004, 41 (1) : 244 - 249.
[5] 骆卫华,罗振声,龚小谨. 文文本自动校对的语义级查错研究[J]. 计算机工程与应用. 2003, 39 (12) : 115 - 118.
[6] 龚小谨,罗振声. 中文文本自动校对中的语法错误检查[J]. 计算机工程与应用. 2003, 39 (8) : 98 - 100.
[7] Li Jianhua, Wang xiaolong. Combining Trigram and Automatic Weight Distribution in Chinese Spelling Error Correction[J]. Journal of Computer science and technology. 2002, Vol. 17 (6) : 915 - 923.
[8] 张磊,周明,黄昌宁. 中文文本自动校对[J]. 语言文字应用, 2001. 2, (1) : 19 - 25.
[9] 张仰森,曹元大. 基于统计的纠错建议给出算法及其实现[J]. 计算机工程, 2004, 30 (11) : 106 - 109.

基金

国家973项目资助(2004CB318102);国家863计划资助(2001AA114210,2002AA117010);中国博士后基金项目资助(2005038026)
PDF(322 KB)

Accesses

Citation

Detail

段落导航
相关文章

/