Layout Analysis for Historical Tibetan Documents Based on Convolutional Denoising Autoencoder
ZHANG Xiqun1,2, MA Longlong3, DUAN Lijuan1,4, LIU Zeyu3, WU Jian3
1.Faculty of Information Technology, Beijing University of Technology, Beijing 100124, China; 2.Beijing Key Laboratory of Trusted Computing, Beijing 100124, China; 3.Chinese Information Processing Laboratory, Institute of Software, Chinese Academy of Sciences, Beijing 100190, China; 4.Beijing Key Laboratory on Integration and Analysis of Large-scale Stream Data, Beijing 100124, China
Abstract:The digitalization of historical documents attract increasing research interests in recent years. Focusing on layout analysis, the essential step in digitizing historical documents, this paper proposes a convolutional denoising auto-encoder approach to historical Tibetan documents. Firstly, the document images are clustered into superpixel blocks. Then, we use the convolutional autoencoder to extract features from these blocks. Finally, the superpixel blocks are classified by the SVM classifier, thus the different parts of the Tibetan historical document are identified. Experiments on the dataset of historical Tibetan documents show that our method can effectively separate the different layout elements of Tibetan historical documents.
[1] 中华人民共和国国务院新闻办公室. 西藏文化的保护与发展[M]. 北京: 外文出版社, 2008. [2] Chen K, Liu C L, Seuret M, et al. Page Segmentation for Historical Document Images Based on Superpixel Classification with Unsupervised Feature Learning[C]//Proceeding of the 12th IAPR Workshop on Document Analysis Systems. IEEE, 2016: 299-304. [3] Eskenazi S, Gomez-Krmer P, Ogier J M. A comprehensive survey of mostly textual document segmentation algorithms since 2008[J]. Pattern Recognition, 2016, 64: 1-14. [4] DaiTon H, DucDung N, Le D H. An adaptive over-split and mergealgorithm for page segmentation[J]. Pattern Recognition Letters, 2016, 80: 137-143. [5] 郭佥. 图像文本的版面分析与理解[D]. 天津: 河北工业大学硕士学位论文, 2012. [6] Chen K, Seuret M, Liwicki M, et al. Page segmentation of historical document images with convolutional autoencoders[C]//Proceeding of the 13th International Conference on Document Analysis and Recognition. IEEE, 2015: 1101-1105. [7] Yadav V, Ragot N. Text Extraction in Document Images: Highlight on Using Corner Points[C]//Proceeding of the 12th IAPR Workshop on Document Analysis Systems. IEEE, 2016: 281-286. [8] Ramel J Y, Busson S, Busson S, et al. User-driven page layout analysis of historical printed books[J]. International Journal of Document Analysis and Recognition, 2007, 9(2): 243-261. [9] 姜哲, 马少平, 夏莹. 大型中文古籍《四库全书》自动版面分析系统[J]. 中文信息学报, 2000, 14(2): 14-20. [10] 肖荣. 复杂背景下彝文古籍文本提取方法研究[D]. 武汉: 中南民族大学硕士学位论文, 2011. [11] Bukhari S S, Breuel T M, Asi A, et al. Layout Analysis for Arabic Historical Document Images Using Machine Learning[C]//Proceeding of the 13th International Conference on Frontiers in Handwriting Recognition. IEEE, 2012: 639-644. [12] Achanta R, Shaji A, Smith K, et al. SLIC superpixels[J]. Epfl, 2010. [13] Achanta R, Shaji A, Smith K, et al. SLIC superpixels compared to state-of-the-art superpixel methods[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2012, 34(11): 2274-2282. [14] Goodfellow I, Bengio Y, Courville A. Deep learning[M]. MIT press, 2016. [15] 李航. 统计学习方法[M]. 北京: 清华大学出版社, 2012. [16] Keras Documnetation[OL]. https: //keras. io.