一个用于OCR输出的中文文本的拼写校对系统

李蓉

PDF(1500 KB)
PDF(1500 KB)
中文信息学报 ›› 2009, Vol. 23 ›› Issue (5) : 92-98.
综述

一个用于OCR输出的中文文本的拼写校对系统

  • 李蓉
作者信息 +

A Chinese Spelling Check System for the OCR Output

  • LI Rong
Author information +
History +

摘要

该文描述了一个处理OCR输出的中文文本的拼写校正系统。使用一个大的正负语料库来建立错误模式库;负语料库中包含OCR识别错误,而正语料库中为对错误进行了编改后的正确文本。首先应用句子匹配算法从正负语料库中提取匹配的句子;然后使用比较算法从匹配的两个句子中提取不同的字符;若两个句子存在不同,则使用错词提取算法来获得错误词和对应的校正词,并以如下三元组的形式保存(校正词, 错词, 出现次数)。用上述算法运行整个正负语料库之后,可获得错误模式的集合,由此建立错误模式库。错误模式可看作是校正规则,用于校正文本中和模式中与“错词”相同形式的错误。根据“错词”的长度将错误模式分为两类,一类为“错词”的长度大于两个字符,可直接应用错误模式规则进行校正;另一类为“错词”的长度等于两个字符,需使用验证算法确定是否当前的模式需要被校正。以上方法是为同方光盘公司开发的THOCR中文校对系统的核心算法,其中正负语料库来自公司在期刊网建设中的积累。由于算法所获得的错误模式均来自真实的OCR识别文本,所以校对效果较好。结尾部分给出了本校对系统的实验结果。

Abstract

This paper describes a spelling check system for OCR output of Chinese text. A large training corpus is used to set up an error-pattern database. At first, the correct sentence and the sentence with errors are matched with the different characters between them extracted. Then an error-word extracting algorithm is executed to get the the error-patterns in the form of (correction-word , error-word, count). In such built error-pattern database, every error pattern in it can be considered as a rule for correcting errors. We further apply the error-pattern database according to the length of the error-worddirect application if the length is larger than two characters otherwise a verification algorithm will be applied. The above method lies in the core of the THOCR spelling check system, and experimental results are provided.
Key words computer application; Chinese information processing; spelling check; training corpus; learning algorithm

关键词

计算机应用 / 中文信息处理 / 错误校对 / 正负语料 / 学习算法

Key words

computer application / Chinese information processing / spelling check / training corpus / learning algorithm

引用本文

导出引用
李蓉. 一个用于OCR输出的中文文本的拼写校对系统. 中文信息学报. 2009, 23(5): 92-98
LI Rong. A Chinese Spelling Check System for the OCR Output. Journal of Chinese Information Processing. 2009, 23(5): 92-98

参考文献

[1] Andrew R. Golding. A Bayesian hybrid method for context-sensitive spelling correction[C]//David Yarowsky and Kenneth Church. In Proc. Third Workshop on Very Large Corpora. Cambridge, Massachusetts,USA: Morgan Kaufmann Publishers, 1995:39-53.
[2] Andre R. Golding, Yves Schabes. Combining trigram-based and feature-based methods for context-sensitive Spelling Correction[C]//Proc.34th Annual Meeting of the Association for Computational Linguistics. Santa Cruz, USA: Morgan Kaufmann Publishers,1996: 71-78.
[3] Xiao tong, David A.Evans. A statistical approach to automatic OCR error correction in context[DB/OL].1996. http://acl.ldc.upenn.edu/W/W96/W96-0108.pdf
[4] Andre R. Golding, Dan Roth. A winnow-based approach to context-senstive spelling correction[J]. Machine Learning,1999, 34 (1-3): 107-130.
[5] 张仰森,丁冰青. 基于二元接续关系检查的字词级自动查错方法[J].中文信息学报,2001,15(3):36-43.
[6] 于勐,姚天顺.一种混合的中文文本校对方法[J].中文信息学报,1998,12(2):32-37.
[7] Yuen-Hsien Tseng. Error Correction in a Chinese OCR Test Collection[C]//Proceedings of the 25th International ACM SIGIR Conference on Research and Development in Information Retrieval. Tampere, Finland :ACM , 2002:429-430.
[8] Zhang Lei, Zhou Ming, Huang Changning, et al. Multifeature-based approach to automatic error detection and correction of Chinese text[DB/OL].1999. http://www.math.ryukoku.ac.jp/~qma/activity/NLPNN99/CONTENTS/zhang.pdf.
[9] Zhang ZhaoHuang. A Pilot Study on Automatic Chinese Spelling Error Correction[J].Communications of COLIPS,1994,4(2):143-149.
[10] 孙才,汉语文本校对字词级查错纠错的研究[D].北京:清华大学硕士学位论文,1997.
[11] Masaaki NAGATA: Japanese OCR Error Correction Using Character Shape Similiarity and Statistical language model[C]// Proceedings of the 17th international conference on Computational linguistics. Montreal, Quebec, Canada: Association for Computational Linguistics,1998:922-928.
[12] Tetsuo ARAKI, Satoru IKEHARA, Nobuyuki TSUKAHARA, et al: An evaluation to detect and correct erroneous characters wrongly substitute, deleted and inserted in Japanese and English sentences using Markov models[C]//Proceedings of the 15th conference on Computational linguistics. Kyoto, Japan: Association for Computational Linguistics,1994:187-193 .
[13] Kazem Taghva, Julie Borsack, Allen Condit: An expert system for automatically correcting OCR Output[C]// Proceedings of the IS&T/SPIE 1994 International Symposium on Electronic Imaging Science and Technology. California, USA: Springfield, 1994:270-278.
[14] 王虹,张仰森. 基于词性预测的中文文本自动查错研究[J].贵州师范大学学报(自然科学版)2001,19(2):72-75.
[15] 李蓉.扫描识别中文文本自动校对技术和引文自动标引技术研究[R].北京:清华大学,2005.

PDF(1500 KB)

Accesses

Citation

Detail

段落导航
相关文章

/