多语种文本图像中的文字语种辨识方法的研究

朴明姬,崔荣一

PDF(2595 KB)
PDF(2595 KB)
中文信息学报 ›› 2017, Vol. 31 ›› Issue (2) : 220-225.
语音与文字

多语种文本图像中的文字语种辨识方法的研究

  • 朴明姬,崔荣一
作者信息 +

An Approach to Script Identification in Image with Multi-lingual Texts

  • PIAO Mingji, CUI Rongyi
Author information +
History +

摘要

本文针对汉字、朝鲜文字和英文单词混合的文本图像提出了基于主成分分析技术以文字为单位进行文种辨识的方法。首先,通过主成分分析方法构造特征空间,并且把分割的文字映射到此空间得到重构图像;其次,计算原图像和重构图像的水平和垂直方向直方图的相对熵;最后,根据原图像和重构图像之间的欧式距离和相对熵来判别文字语种。实验表明,本文提出的方法在没有分割错误的情况下,能获得99.78%的识别准确率,有效地解决了在汉、朝、英三种文字混合构成的文档图像中文种辨识问题。

Abstract

A PCA based character level script identification method is proposed to identify Korean, Chinese and English scripts in a image. First, the space of eigenvectors is constructed by using PCA, and the segmented character was reconstructed by projecting into the space. Second, relative entropy of vertical and horizontal histograms between the original and the reconstructed image is calculated. Finally, according to Euclidean distance and relative entropy between the original and the reconstructed image, the script is identified. The experiment results show that the proposed method achieves 99.78% accuracy under fully correct wrong segmentation, which successfully addresses the script identification problem in Korean, Chinese and English multi-lingual document image.

关键词

文种辨识 / 主成分分析 / 相对熵 / 欧式距离 / 文字分割

Key words

script identification / principal component analysis / relative entropy / Euclidean distance / character segmentation

引用本文

导出引用
朴明姬,崔荣一. 多语种文本图像中的文字语种辨识方法的研究. 中文信息学报. 2017, 31(2): 220-225
PIAO Mingji, CUI Rongyi. An Approach to Script Identification in Image with Multi-lingual Texts. Journal of Chinese Information Processing. 2017, 31(2): 220-225

基金

吉林省科技发展计划项目(20140101186JC);国家语委2015年度科研立项项目(教语信司函〔2015〕21号)
PDF(2595 KB)

602

Accesses

0

Citation

Detail

段落导航
相关文章

/