Abstract:This paper describes a spelling check system for OCR output of Chinese text. A large training corpus is used to set up an error-pattern database. At first, the correct sentence and the sentence with errors are matched with the different characters between them extracted. Then an error-word extracting algorithm is executed to get the the error-patterns in the form of (correction-word , error-word, count). In such built error-pattern database, every error pattern in it can be considered as a rule for correcting errors. We further apply the error-pattern database according to the length of the error-worddirect application if the length is larger than two characters otherwise a verification algorithm will be applied. The above method lies in the core of the THOCR spelling check system, and experimental results are provided. Key words computer application; Chinese information processing; spelling check; training corpus; learning algorithm
[1] Andrew R. Golding. A Bayesian hybrid method for context-sensitive spelling correction[C]//David Yarowsky and Kenneth Church. In Proc. Third Workshop on Very Large Corpora. Cambridge, Massachusetts,USA: Morgan Kaufmann Publishers, 1995:39-53. [2] Andre R. Golding, Yves Schabes. Combining trigram-based and feature-based methods for context-sensitive Spelling Correction[C]//Proc.34th Annual Meeting of the Association for Computational Linguistics. Santa Cruz, USA: Morgan Kaufmann Publishers,1996: 71-78. [3] Xiao tong, David A.Evans. A statistical approach to automatic OCR error correction in context[DB/OL].1996. http://acl.ldc.upenn.edu/W/W96/W96-0108.pdf [4] Andre R. Golding, Dan Roth. A winnow-based approach to context-senstive spelling correction[J]. Machine Learning,1999, 34 (1-3): 107-130. [5] 张仰森,丁冰青. 基于二元接续关系检查的字词级自动查错方法[J].中文信息学报,2001,15(3):36-43. [6] 于勐,姚天顺.一种混合的中文文本校对方法[J].中文信息学报,1998,12(2):32-37. [7] Yuen-Hsien Tseng. Error Correction in a Chinese OCR Test Collection[C]//Proceedings of the 25th International ACM SIGIR Conference on Research and Development in Information Retrieval. Tampere, Finland :ACM , 2002:429-430. [8] Zhang Lei, Zhou Ming, Huang Changning, et al. Multifeature-based approach to automatic error detection and correction of Chinese text[DB/OL].1999. http://www.math.ryukoku.ac.jp/~qma/activity/NLPNN99/CONTENTS/zhang.pdf. [9] Zhang ZhaoHuang. A Pilot Study on Automatic Chinese Spelling Error Correction[J].Communications of COLIPS,1994,4(2):143-149. [10] 孙才,汉语文本校对字词级查错纠错的研究[D].北京:清华大学硕士学位论文,1997. [11] Masaaki NAGATA: Japanese OCR Error Correction Using Character Shape Similiarity and Statistical language model[C]// Proceedings of the 17th international conference on Computational linguistics. Montreal, Quebec, Canada: Association for Computational Linguistics,1998:922-928. [12] Tetsuo ARAKI, Satoru IKEHARA, Nobuyuki TSUKAHARA, et al: An evaluation to detect and correct erroneous characters wrongly substitute, deleted and inserted in Japanese and English sentences using Markov models[C]//Proceedings of the 15th conference on Computational linguistics. Kyoto, Japan: Association for Computational Linguistics,1994:187-193 . [13] Kazem Taghva, Julie Borsack, Allen Condit: An expert system for automatically correcting OCR Output[C]// Proceedings of the IS&T/SPIE 1994 International Symposium on Electronic Imaging Science and Technology. California, USA: Springfield, 1994:270-278. [14] 王虹,张仰森. 基于词性预测的中文文本自动查错研究[J].贵州师范大学学报(自然科学版)2001,19(2):72-75. [15] 李蓉.扫描识别中文文本自动校对技术和引文自动标引技术研究[R].北京:清华大学,2005.