从大篇幅的满文文档图像中分割和提取满文单词,是满文文档分析的关键步骤。该文提出了一种基于缝隙剪裁的满文文档图像单词分割和提取方法。首先,通过投影轮廓匹配策略初步涂抹并确定文本列数目;然后,在相邻文本列间自底向上地进行动态规划,寻找最小能量线,并通过中线区域约束得到不损坏满文文字部件的最佳分割线;最后,依据分割线提取独立满文文本列进而提取满文单词。结果表明,该方法在满文文档图像数据库上取得了较好的分割和提取效果。
Abstract
An important step in the Manchu document analysis is segmentation and extraction Manchu words from large images of Manchu documents. The paper proposes a new Manchu word segmentation and extraction method based on seam craving. First of all, this paper detects the number of text lines by projection profile matching method, then paints them. Secondly, the minimum energy line is located by dynamic planning from bottom to top between adjacent text lines, and the best segmentation lines that don‘t cut through Manchu word components are determined by restraining the midline areas. Finally the independent Manchu text column and Manchu word is extracted according to the segmentation curve. Experimental results show that this method achieved better segmentation and extraction result on Manchu document image datasets.
关键词
满文文档图像 /
缝隙裁剪 /
文本列分割 /
投影轮廓匹配 /
区域约束
{{custom_keyword}} /
Key words
Manchu document images /
seam craving /
text line segmentation /
projection profile matching /
restraining the midline areas
{{custom_keyword}} /
{{custom_sec.title}}
{{custom_sec.title}}
{{custom_sec.content}}
参考文献
[1] 许爽,等.满文识别技术研究与分析[J].大连民族大学学报,2014,16(5): 546-551.
[2] 倪恩志,蒋旻隽,周昌乐.古代汉字文献切分研究[J].计算机工程与应用,2013,49(2): 29-33.
[3] 高学,金连文.一种基于笔画密度的弹性网格特征提取方法[J].模式识别与人工智能,2002,15(3): 351-154.
[4] Tong L,et al.Document image skew correction method based on characteristic sample point detection and hough transform[C] //Proceedings of the 9th International Symposium on Linear Drives for Industry Applications. Berlin Heidelberg: Springer,2014: 759-767.
[5] Chen J,Lopresti D.Model-based ruling line detection in noisy handwritten documents[J].Pattern Recognition Letters,2014,35(1): 34-45.
[6] 黄亮,殷飞,陈庆虎.基于图聚类的脱机手写文档图像文本行分割[J].华中科技大学学报(自然科学版),2014,42(3): 33-36.
[7] Liwicki M, Indermuhle E, Bunke H. On-line handwritten text line detection using dynamic programming[C]//Proceedings of the International Conference on Document Analysis and Recognition, 2007: 447-451.
[8] Avidan S, Shamir A. Seam carving for content-aware image resizing[J]. ACM Transactions on Graphics,2007,26(3): 10.
[9] 钟艳如.基于高斯滤波和信息熵原理的评定研究[J].计算机工程与应用.2006,45(7): 230-234.
[10] 舒昌献.基于软化形态学的边缘检测[J].中国图像图形学报, 1999,4(2): 139-142.
[11] 达力扎布.军机处满文准噶尔使者档译编[M].北京: 中央民族大学出版社.2009.
[12] 胡增益.新满汉大词典[M].乌鲁木齐: 新疆人民出版社.1994.
[13] 中国边疆史地研究中心,中国第一历史档案馆合编.清代新疆满文档案汇编[M].桂林: 广西师范大学出版社.2012.
[14] 朱晓宗,杨兵.特征离散点计算在手写文本行分割中的应用[J].计算机工程与应用.2015,51(8): 148-152.
[15] Prajna R,Ramya V R,Dr.Mamatha H. R. A Study of different text line extraction for multi-font and multi-size printed Kannada documents[J]. International Jouranl of Computer Applications,2015,119(11): 32-38.
{{custom_fnGroup.title_cn}}
脚注
{{custom_fn.content}}
基金
国家自然科学基金(61503058,61702081);辽宁省自然科学基金(201602190);辽宁省教育厅科学研究项目(L2015127);辽宁省自然科学基金指导计划(201602205);大连市青年科技之星项目(2016RQ072)
{{custom_fund}}