该文描述了一个处理OCR输出的中文文本的拼写校正系统。使用一个大的正负语料库来建立错误模式库;负语料库中包含OCR识别错误,而正语料库中为对错误进行了编改后的正确文本。首先应用句子匹配算法从正负语料库中提取匹配的句子;然后使用比较算法从匹配的两个句子中提取不同的字符;若两个句子存在不同,则使用错词提取算法来获得错误词和对应的校正词,并以如下三元组的形式保存(校正词, 错词, 出现次数)。用上述算法运行整个正负语料库之后,可获得错误模式的集合,由此建立错误模式库。错误模式可看作是校正规则,用于校正文本中和模式中与“错词”相同形式的错误。根据“错词”的长度将错误模式分为两类,一类为“错词”的长度大于两个字符,可直接应用错误模式规则进行校正;另一类为“错词”的长度等于两个字符,需使用验证算法确定是否当前的模式需要被校正。以上方法是为同方光盘公司开发的THOCR中文校对系统的核心算法,其中正负语料库来自公司在期刊网建设中的积累。由于算法所获得的错误模式均来自真实的OCR识别文本,所以校对效果较好。结尾部分给出了本校对系统的实验结果。
Abstract
This paper describes a spelling check system for OCR output of Chinese text. A large training corpus is used to set up an error-pattern database. At first, the correct sentence and the sentence with errors are matched with the different characters between them extracted. Then an error-word extracting algorithm is executed to get the the error-patterns in the form of (correction-word , error-word, count). In such built error-pattern database, every error pattern in it can be considered as a rule for correcting errors. We further apply the error-pattern database according to the length of the error-worddirect application if the length is larger than two characters otherwise a verification algorithm will be applied. The above method lies in the core of the THOCR spelling check system, and experimental results are provided.
Key words computer application; Chinese information processing; spelling check; training corpus; learning algorithm
关键词
计算机应用 /
中文信息处理 /
错误校对 /
正负语料 /
学习算法
{{custom_keyword}} /
Key words
computer application /
Chinese information processing /
spelling check /
training corpus /
learning algorithm
{{custom_keyword}} /
{{custom_sec.title}}
{{custom_sec.title}}
{{custom_sec.content}}
参考文献
[1] Andrew R. Golding. A Bayesian hybrid method for context-sensitive spelling correction[C]//David Yarowsky and Kenneth Church. In Proc. Third Workshop on Very Large Corpora. Cambridge, Massachusetts,USA: Morgan Kaufmann Publishers, 1995:39-53.
[2] Andre R. Golding, Yves Schabes. Combining trigram-based and feature-based methods for context-sensitive Spelling Correction[C]//Proc.34th Annual Meeting of the Association for Computational Linguistics. Santa Cruz, USA: Morgan Kaufmann Publishers,1996: 71-78.
[3] Xiao tong, David A.Evans. A statistical approach to automatic OCR error correction in context[DB/OL].1996. http://acl.ldc.upenn.edu/W/W96/W96-0108.pdf
[4] Andre R. Golding, Dan Roth. A winnow-based approach to context-senstive spelling correction[J]. Machine Learning,1999, 34 (1-3): 107-130.
[5] 张仰森,丁冰青. 基于二元接续关系检查的字词级自动查错方法[J].中文信息学报,2001,15(3):36-43.
[6] 于勐,姚天顺.一种混合的中文文本校对方法[J].中文信息学报,1998,12(2):32-37.
[7] Yuen-Hsien Tseng. Error Correction in a Chinese OCR Test Collection[C]//Proceedings of the 25th International ACM SIGIR Conference on Research and Development in Information Retrieval. Tampere, Finland :ACM , 2002:429-430.
[8] Zhang Lei, Zhou Ming, Huang Changning, et al. Multifeature-based approach to automatic error detection and correction of Chinese text[DB/OL].1999. http://www.math.ryukoku.ac.jp/~qma/activity/NLPNN99/CONTENTS/zhang.pdf.
[9] Zhang ZhaoHuang. A Pilot Study on Automatic Chinese Spelling Error Correction[J].Communications of COLIPS,1994,4(2):143-149.
[10] 孙才,汉语文本校对字词级查错纠错的研究[D].北京:清华大学硕士学位论文,1997.
[11] Masaaki NAGATA: Japanese OCR Error Correction Using Character Shape Similiarity and Statistical language model[C]// Proceedings of the 17th international conference on Computational linguistics. Montreal, Quebec, Canada: Association for Computational Linguistics,1998:922-928.
[12] Tetsuo ARAKI, Satoru IKEHARA, Nobuyuki TSUKAHARA, et al: An evaluation to detect and correct erroneous characters wrongly substitute, deleted and inserted in Japanese and English sentences using Markov models[C]//Proceedings of the 15th conference on Computational linguistics. Kyoto, Japan: Association for Computational Linguistics,1994:187-193 .
[13] Kazem Taghva, Julie Borsack, Allen Condit: An expert system for automatically correcting OCR Output[C]// Proceedings of the IS&T/SPIE 1994 International Symposium on Electronic Imaging Science and Technology. California, USA: Springfield, 1994:270-278.
[14] 王虹,张仰森. 基于词性预测的中文文本自动查错研究[J].贵州师范大学学报(自然科学版)2001,19(2):72-75.
[15] 李蓉.扫描识别中文文本自动校对技术和引文自动标引技术研究[R].北京:清华大学,2005.
{{custom_fnGroup.title_cn}}
脚注
{{custom_fn.content}}