后处理技术是汉字识别系统的重要组成部分。传统的识别后处理技术在很大程度上依赖于所训练的统计语言模型,没有考虑所处理文本的特殊性;而且没有利用识别器的动态识别特性。本文利用部分校对过的正确本文信息,一方面可以构建自适应语言模型,及时发现所处理文本的语言特点;另一方面可以利用识别器的动态识别特性,以修正候选字集;从而使得后续文本的识别后处理具有自适应性。40 万字的数据测试表明:这种方法的文本平均错误率较传统的后处理方法下降35.24%了,可以大大减轻数据录入人员的工作量,具有较高的实用价值。
Abstract
Post-processing is a key component of Chinese character recognition system. Conventional post-processing methods ,which to a large extent rely on statistical language model ,can’t track dependencies within an article. They also can’t take the dynamic idiosyncrasy of recognizer into account . This paper presents a novel adaptive post-processing method that utilizes the partly corrected texts. These texts can be used to construct adaptive language model and to obtain the idiosyncrasy of recognizer which can help dynamically adjust candidates set . The method makes the post-processing of successive documents recognition be of adaptability. Experiments on about 400000 Chinese characters show that the proposed method has 35.24% error reduction rate in average ,compared with the conventional post-processing method. This method can efficiently reduce the workload in the case of large-scale data input and has higher practicability.
关键词
汉字识别 /
后处理 /
语言模型 /
自适应 /
修正候选字集
{{custom_keyword}} /
Key words
Chinese character recognition /
post-processing /
language model /
adaptation /
candidate set modification
{{custom_keyword}} /
{{custom_sec.title}}
{{custom_sec.title}}
{{custom_sec.content}}
参考文献
[1] 叶志远. 第五届全国印刷体汉字识别评测. 见:高文,钱跃良. 第四届中国计算机智能接口与智能应用学术会议论文集. 北京:电子工业出版社,1999 ,40 - 44
[2] 夏莹,马少平,常新功等. 基于统计的汉字文本自动后处理方法. 模式识别与人工智能,1996 ,9 (2) : 172 - 178
[3] Tung C H ,Lee H J . Increasing Character Recognition Accuracy by Detection and Correction of Erroneously Identified Characters. Pattern Recognition ,1994 ,27 (9) :1259 - 1265
[4] Tong Xiang ,Evans D A. A Statistical Approach to Automatic OCR Error Correction in Context . In Proceedings of 4th Workshop on Very Large Corpus. Denmark ,1996. 88 - 100
[5] 李元祥,丁晓青,刘长松. 基于HMM的汉语文本识别后处理研究. 中文信息学报,1999 ,13 (4) :29 - 34
[6] Rosenfeld R. A Maximum Entropy Approach to Adaptive Statistical Language Modeling. Computer Speech and Language ,1996 (10) :187 - 228
[7] Placeway P ,Schwartz R , Fung P et al . The Estimation of Powerful Language Models from Small and Large Corpora. In : Proceedings of the International Conference on Acoustics ,Speech ,Signal and Processing. USA ,1993 ,2 :33 - 36
[8] Kuhn R ,Mori R de. A Cache-Based Natural Language Model for Speech Recognition. IEEE Trans. on PAMI ,1990 ,12 (6) : 570 - 583
[9] 郭宏. 提高印刷体汉字识别鲁棒性的研究[博士学位论文] . 北京:清华大学,1997
{{custom_fnGroup.title_cn}}
脚注
{{custom_fn.content}}
基金
国家“863”高技术计划(项目863-306-ZT03-03-1);国家自然科学基金(项目69972024)
{{custom_fund}}