多文种环境下汉字内码识别算法的研究

李培峰,朱巧明,钱培德

PDF(313 KB)
PDF(313 KB)
中文信息学报 ›› 2004, Vol. 18 ›› Issue (2) : 74-80.

多文种环境下汉字内码识别算法的研究

  • 李培峰1,2,朱巧明1,钱培德1
作者信息 +

Research of Han Character Internal Codes Recognition Algorithm in the Multi-lingual Environment

  • LI Pei-feng1,2,ZHU Qiao-ming1,QIAN Pei-de1
Author information +
History +

摘要

汉字内码向ISO/IEC 10646过渡是实现计算机用文字编码统一的必然趋势,但目前在一段时间内仍将存在多种汉字内码并存的情况,所以实现汉字内码的自动识别是保证汉字多内码并存的关键。本文主要探讨了如何在多内码并存的多文种环境中实现汉字内码自动识别的问题,并提供了多种汉字内码识别算法,包括基于内码分布、标点符号特征、字频特征和语义特征的识别算法等。在此基础上,本文对不同的识别算法进行分析和评估。在对目标样本的测试中,以上算法的识别率最高可以达到99.9%以上。

Abstract

It's a general tendency that the Han Character Internal Codes used in computer should transfer to ISO/IEC 10646 , but there are multi-Han Character Internal Codes used in the computer now , and this instance will stand a long time. So how to realize the Han Character Internal Codes auto recognition is the key to build a Multi-lingual Environment . This paper mainly discusses the Han Character Internal Codes recognition algorithms in the Multi-lingual Environment , and provides four recognition algorithms , such as Internal Code Bound Recognition Algorithm, Interpunction Recognition Algorithm , Han Character Frequency Recognition Algorithm and Semantic Recognition Algorithm. This paper also evaluates the algorithms mentioned in this paper , and the rate of Recognition can reach 99.9% used these recognition algorithms on the test documents.

关键词

计算机应用 / 中文信息处理 / 多文种环境 / 汉字内码 / 识别算法

Key words

computer application / Chinese information processing / multi-lingual environment / han character internal code / recognition algorithm

引用本文

导出引用
李培峰,朱巧明,钱培德. 多文种环境下汉字内码识别算法的研究. 中文信息学报. 2004, 18(2): 74-80
LI Pei-feng,ZHU Qiao-ming,QIAN Pei-de. Research of Han Character Internal Codes Recognition Algorithm in the Multi-lingual Environment. Journal of Chinese Information Processing. 2004, 18(2): 74-80

参考文献

[1] International Organization for Standardization (ISO) , Universal Multiple-Octet Coded Character Set (UCS) [S] : , International Standard , Ref. No. ISO/IEC 10646 - 1 :1993 (E) / 10646 - 1 :2000 (E) / 10646 - 2 :2001 (E) .
[2] 朱巧明. 汉字信息处理基础[M]. 清华大学出版社,1997 ,5~6
[3] 张轴材. ISO/IEC 10646 - 1 and Unicode标准与实现[R]. Character Code & Data To Come 研讨会,1996
[4] Unicode ,www.unicode.org/versions/Unicode4.0.0/appC.pdf [EB] ,2003

基金

江苏省高校自然科学基金项目资助(01kjb520001)
PDF(313 KB)

Accesses

Citation

Detail

段落导航
相关文章

/