张宏涛,龙翀,朱小燕,孙俊. 印刷体汉字识别后处理方法的研究[J]. 中文信息学报, 2009, 23(6): 67-72.
ZHANG Hongtao, LONG Chong, ZHU Xiaoyan, SUN Jun. Post-Processing Approach for Printed Chinese Character Recognition. , 2009, 23(6): 67-72.
Post-Processing Approach for Printed Chinese Character Recognition
ZHANG Hongtao1, LONG Chong1, ZHU Xiaoyan1, SUN Jun2
1. State Key Laboratory of Intelligent Technology and Systems, Tsinghua National Laboratory for Information Science and Technology, Department of Computer Science and Technology, Tsinghua University, Beijing 100084, China; 2. Information Technology Laboratory, Fujitsu R&D Center Co. Ltd., Beijing 100016, China
Abstract:In Chinese OCR post-processing, the high-order Chinese n-gram language models, such as word based tri-gram and four-gram is still a challenging issue because of the data sparseness issue and large memory cost led by big model size. In this paper, we focus on the post-processing of printed Chinese character recognition and propose a byte-based language model. By choosing byte as the representing unit of language model, we achieve a remarkable reduction of model size which overcomes the sparseness problem to a great extent. The experimental results show that the new language model based on byte works very well with higher performance and lowest time and space costs. For the test set with segmentation errors, the recognition accuracy increases from 88.67% to 98.32%, which means 85.18% error reduction. Compared with the system using traditional word based tri-gram, the new system saves 95% time cost and nearly 98% memory cost at almost no cost in the accuracy performance. Key wordscomputer application; Chinese information processing; Chinese character recognition; OCR; language model; post-processing
[1] 龙翀,庄丽,朱小燕,等.手写中文地址识别后处理方法的研究[J].中文信息学报, 2006, 20(6): 69-74. [2] 刘昌平,钱跃良,张永慧,等.863手写汉字识别测试平台[J].中文信息学报,2000,14(2) : 2-6. [3] A.L.Koefich, R. Sabourin, C.Y.Suen. Large vocabulary off-line handwriting recognition: A survey[J]. Pattern Analysis & Applications, 2003: 97-121. [4] 夏莹,马少平,常新功,等.基于统计的汉字文本自动后处理方法[J].模式识别与人工智能, 1996, 9(2): 172-178. [5] P-K Wong, C Chan. Post-processing statistical language models for a handwritten Chinese character recognizer[J]. IEEE Trans on System, Man and Cybernetics, 1999, 29(2): 286-291. [6] F. Jelinek. Self-organized language modeling for speech Recognition[C]//A. Waibel, K.- F. Lee (eds.). Readings in Speech Recognition: Mor- gan Kaufman Publishers, 1991, 450-506, [7] K. Seymore, R. Rosenfeld. Scalable backoff language models[C]//Proc. ICSLP, Philadelphia:IEEE,1996: 1: 232-235. [8] A. Stolcke. Entropy-based Pruning of Backoff Language Models[C]//Proc. DARPA News Transcription and Understanding Workshop, Lansdowne, VA: 1998: 270-274. [9] P. F. Brown,V. J. DellaPietra, P. V. deSouza, et al. Class-based n-gram models of natural language[J]. Computational Linguistics: 1990 (18), 467-479. [10] Li Baoli, Chen Yuzhong, Bai Xiaojing, Yu Shiwen. Experimental Study on Representing Units in Chinese Text Categorization [C]//CICLing 2003: LNCS 2588: Springer,2003, 602-614. [11] J Gao, J Goodman, M Li, KF Lee. Toward a unified approach to statistical language modeling for Chinese [C]//ACM Transactions on Asian Language Information Processing, 2002, 1(1): 3-33. [12] Rosenfeld R. Maximum entropy approach to adaptive statistical language modeling[J]. Computer Speech and Language, 1996, 10(3): 187-228(42). [13] 李元祥,丁晓青,吴佑寿.一种基于字词结合的汉字识别上下文处理新方法[J].计算机研究与发展,2002,39(7): 838-842.