非关系型表格理解前沿进展

PDF(4587 KB)

中文信息学报 ›› 2024, Vol. 38 ›› Issue (5) : 1-21.

综述

非关系型表格理解前沿进展

罗平^1,2,3,杨清平^1,2,曹逸轩^1,2,曹荣禹^1,2,何清^1,2

作者信息 +

A Survey on Non-Relational Table Understanding

LUO Ping^1,2,3, YANG Qingping^1,2, CAO Yixuan^1,2, CAO Rongyu^1,2, HE Qing^1,2

Author information +

History +

摘要

表格理解是指通过计算机对广泛存在于互联网、垂直领域的表格进行自动识别、解析和应用的过程。表格可大致分为关系型表格和非关系型表格。前者类似关系数据库表格,具有结构固定、机器易解析等特点,其研究历史由来已久。后者通常布局多变,语法灵活,具有更明显的语言特性,这也导致计算机在解析和应用非关系型表格时面临着极大挑战。非关系型表格理解是自然语言和计算机视觉多模态交叉的重要新兴领域之一。随着近年来深度学习技术的普及应用,非关系型表格在表格识别、语义分析、创新应用几个方向得到了长足发展。该文介绍了非关系型表格的结构特点,阐述了其在研究过程中面临的独特挑战,然后从表格识别、语义分析、创新应用三个研究方向简要介绍了近年来此领域的发展,归纳了相关数据集,最后总结了目前非关系型表格理解领域亟需解决的问题,展望了未来研究方向。

Abstract

Table understanding is the process of automatically recognizing, parsing, and applying tables that are widely available on the Internet and in vertical domains. Tables can be broadly classified into relational tables and non-relational tables. The former is similar to relational database tables, with a fixed structure easy for machine parsing. The latter is usually more flexible in layout and syntax, with more obvious linguistic features, which is very challenging for computers to parse. Non-relational table understanding is one of the important emerging areas at the intersection of natural language and computer vision. With the popularity of deep learning technology in recent years, non-relational table understanding has been greatly developed in several directions, including recognition, semantic analysis, and application. This paper introduces the characteristics of non-relational tables, then systematically introduces the recent developments in this field from the three research directions mentioned above. It also summarizes the public datasets related to non-relational tables, reveals the existing problems that need to be solved in non-relational table understanding and ends with possible future research directions.

导出引用

罗平,杨清平,曹逸轩,曹荣禹,何清. 非关系型表格理解前沿进展. 中文信息学报. 2024, 38(5): 1-21

LUO Ping, YANG Qingping, CAO Yixuan, CAO Rongyu, HE Qing. A Survey on Non-Relational Table Understanding. Journal of Chinese Information Processing. 2024, 38(5): 1-21

参考文献

[1] CHEN Z, CAFARELLA M. Automatic web spreadsheet data extraction[C]//Proceedings of the 3rd International Workshop on Semantic Search over the Web, 2013: 1-8.
[2] DONG H, LIU S, HAN S, et al. TableSense: Spreadsheet table detection with convolutional neural networks[C]//Proceedings of the AAAI Conference on Artificial Intelligence, 2019: 69-76.
[3] DOUSH I A, PONTELLI E. Detecting and recognizing tables in spreadsheets[C]//Proceedings of the 9th IAPR International Workshop on Document Analysis Systems, 2010: 471-478.
[4] KOCI E, THIELE M, LEHNER W, et al. Table recognition in spreadsheets via a graph representation[C]//Proceedings of the 9th IAPR International Workshop on Document Analysis Systems,2018: 139-144.
[5] CESARINI F, MARINAI S, SARTI L, et al. Trainable table location in document images[C]//Proceedings of the International Conference on Pattern Recognition, 2002: 236-240.
[6] PINTO D, MCCALLUM A, WEI X, et al. Table extraction using conditional random fields[C]//Proceedings of the 26th Annual International ACM SIGIR Conference on Research and Development in Informaion Retrieval. Association for Computing Machinery, 2003: 235-242.
[7] KASAR T, BARLAS P, ADAM S, et al. Learning to detect tables in scanned document images using line information[C]//Proceedings of the 12th International Conference on Document Analysis and Recognition, ICDAR, 2013: 1185-1189.
[8] ITONORI K. Table structure recognition based on text block arrangement and ruled line position[C]//Proceedings of 2nd International Conference on Document Analysis and Recognition. Institute of Electrical and Electronics Engineers, 1993: 765-768.
[9] KIENINGER T, DENGEL A. The t-recs table recognition and analysis system[C]//Proceedings of the International Workshop on Document Analysis Systems. Springer Verlag, 1998: 255-270.
[10] HAO L, GAO L, YI X, et al. A table detection method for pdf documents based on convolutional neural networks[C]//Proceedings of the 12th IAPR Workshop on DAS, 2016: 287-292.
[11] GILANI A, QASIM S R, MALIK I, et al. Table detection using deep learning[C]//Proceedings of the 14th IAPR International Conference on Document Analysis and Recognition, 2017: 771-776.
[12] CAO R, LI H, ZHOU G, et al. Towards document panoptic segmentation with pinpoint accuracy: Method and evaluation[C]//Proceedings of the International Conference on Document Analysis and Recognition. Springer Science and Business Media Deutschland GmbH, 2021: 3-18.
[13] SCHREIBER S, AGNE S, WOLF I, et al. DeepDeSRT: Deep learning for detection and structure recognition of tables in document images[C]//Proceedings of the International Conference on Document Analysis and Recognition, 2017: 1162-1167.
[14] PALIWAL S, RAHUL R, SHARMA M, et al. TableNet: Deep learning model for end-to-end table detection and tabular data extraction from scanned document images[C]//Proceedings of the International Conference on Document Analysis and Recognition, 2019: 128-133.
[15] LI M, CUI L, HUANG S, et al. TableBank: Table benchmark for image-based table detection and recognition[C]//Proceedings of the International Conference on Document Analysis and Recognition, 2019: 1918-1925.
[16] QIAO L, LI Z, CHENG Z, et al. LGPMA: Complicated table structure recognition with local and global pyramid mask alignment[C]//Proceedings of the International Conference on Document Analysis and Recognition, 2021: 99-114.
[17] FANG J, MITRA P, TANG Z, et al. Table header detection and classification[C]//Proceedings of the AAAI Conference on Artificial Intelligence, 2012: 599-605.
[18] SETH S, NAGY G. Segmenting tables via indexing of value cells by table headers[C]//Proceedings of the International Conference on Document Analysis and Recognition, 2013: 889-891.
[19] KOCI E, THIELE M, ROMERO O, et al. A machine learning approach for layout inference in spreadsheets[C]//Proceedings of the 8th International Joint Conference on Knowledge Discovery, Knowledge Engineering and Knowledge Management. SciTePress, 2016: 77-88.
[20] GHASEMI-GOL M, PUJARA J, SZEKELY P. Tabular cell classification using pre-trained cell embeddings[C]//Proceedings of the IEEE International Conference on Data Mining, 2019: 230-239.
[21] DONG H, YANG J, HAN S, et al. Learning formatting style transfer and structure extraction for spreadsheet tables with a hybrid neural network architecture[C]//Proceedings of the 29th ACM International Conference on Information & Knowledge Management. Virtual Event, 2020: 2389-2396.
[22] 张建东, 陈仕吉, 徐小婷, 等. 基于词向量的PDF表格抽取研究[J].数据分析与知识发现, 2021,5(8): 34-44.
[23] DONG H, LIU S, FU Z, et al. Semantic structure extraction for spreadsheet tables with a multi-task learning architecture[C]//Proceedings of the Workshop on Document Intelligence at NeurIPS, 2019: 1-4.
[24] WANG Y, HU J. Detecting tables in HTML documents[C]//Proceedings of the International Workshop on Document Analysis Systems. Springer Verlag, 2002: 249-260.
[25] CRESTAN E, PANTEL P. Web-scale table census and classification[C]//Proceedings of the 4th ACM International Conference on Web Search and Data Mining, 2011: 545-554.
[26] LEHMBERG O, RITZE D, MEUSEL R, et al. A large public corpus of web tables containing time and context metadata[C]//Proceedings of the 25th International Conference Companion on World Wide Web. Association for Computing Machinery, 2016: 75-76.
[27] CHEN Z, CAFARELLA M. Integrating spreadsheet data via accurate and low-effort extraction[C]//Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2014: 1126-1135.
[28] CHEN Z, CAFARELLA M, CHEN J, et al. Senbazuru: A prototype spreadsheet database management system[J]. Proceedings of the VLDB Endowment, 2013, 6(12): 1202-1205.
[29] CHEN X, CHITICARIU L, DANILEVSKY M, et al. A rectangle mining method for understanding the semantics of financial tables[C]//Proceedings of the International Conference on Document Analysis and Recognition. IEEE Computer Society, 2017: 268-273.
[30] HERMANS F, PINZGER M, VAN DEURSEN A. Automatically extracting class diagrams from spreadsheets[C]//Proceedings of the European Conference on Object-Oriented Programming, 2010: 52-75.
[31] HERMANS F, PINZGER M, VAN DEURSEN A. Supporting professional spreadsheet users by generating leveled dataflow diagrams[C]//Proceedings of the 33rd International Conference on Software Engineering, 2011: 451-460.
[32] HERZIG J, NOWAK P K, ULLER T M, et al. TAPAS: Weakly supervised table parsing via pre-training[C]//Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, 2020: 4320-4333.
[33] YIN P, NEUBIG G, YIH W T, et al. TABERT: Pretraining for joint understanding of textual and tabular data[C]//Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, 2020: 8413-8426.
[34] DENG X, SUN H, LEES A, et al. Turl: Table understanding through representation learning[J]. Proceedings of the VLDB Endowment, 2020, 14(3): 307-319.
[35] WANG Z, DONG H, JIA R, et al. TUTA: Tree-based transformers for generally structured table pre-training[C]//Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery and Data Mining. ACM, 2021: 1780-1790.
[36] DU L, GAO F, CHEN X, et al. TabularNet: A neural network architecture for understanding semantic structures of tabular data[C]//Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery and Data Mining. ACM, 2021: 322-331.
[37] YANG Q, CAO Y, LUO P. Numerical tuple extraction from tables with pre-training[C]//Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining. New York, NY, USA: ACM, 2022: 2233-2241.
[38] SHIGAROV A O, MIKHAILO A A. Rule-based spreadsheet data transformation from arbitrary to relational tables[J]. Information Systems, 2017, 71: 123-136.
[39] IBRAHIM Y, RIEDEWALD M, WEIKUM G. Making sense of entities and quantities in web tables[C]//Proceedings of the 25th ACM International on Conference on Information and Knowledge Management. 2016: 1703-1712.
[40] IBRAHIM Y, RIEDEWALD M, WEIKUM G, et al. Bridging quantities in tables and text[C]//Proceedings of the IEEE 35th International Conference on Data Engineering, 2019: 1010-1021.
[41] LI H, YANG Q, CAO Y, et al. Cracking tabular presentation diversity for automatic cross-checking over numerical facts[C]//Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. Association for Computing Machinery, 2020: 2599-2607.
[42] YANG Q, CAO Y, LI H, et al. Numerical formula recognition from tables[C]//Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. Association for Computing Machinery, 2021: 1986-1996.
[43] DONG H, WANG J, FU Z, et al. Neural formatting for spreadsheet tables[C]//Proceedings of the 29th ACM International Conference on Information & Knowledge Management, 2020: 305-314.
[44] LI H, HU Y, CAO Y, et al. Rich-text document styling restoration via reinforcement learning[J]. Frontiers of Computer Science, 2020, 15(4): 1-11.
[45] LI H, YANG Q, CAO Y, et al. Semantic matching over matrix-style tables in richly formatted documents[C]//Proceedings of the International Conference on Database and Expert Systems Applications. Springer Science and Business Media Deutschland GmbH, 2020: 369-384.
[46] PARIKH A P, WANG X, GEHRMANN S, et al. ToTTo: A controlled table-to-text generation dataset[C]//Proceedings of the Conference on Empirical Methods in Natural Language Processing,2020: 1173-1186.
[47] KALE M, RESEARCH G, RASTOGI A. Text-to-text pre-training for data-to-text tasks[C]//Proceedings of the 13th International Conference on Natural Language Generation, 2020: 97-102.
[48] SU Y, VANDYKE D, WANG S, et al. Plan-then-generate: Controlled data-to-text generation via planning[C]//Proceedings of the Association for Computational Linguistics. Association for Computational Linguistics, 2021: 895-909.
[49] CHENG Z, DONG H, WANG Z, et al. HiTab: A hierarchical table dataset for question answering and natural language generation[C]//Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics, 2022: 1094-1110.
[50] EBERIUS J, THIELE M, BRAUNSCHWEIG K, et al. Top-k entity augmentation using consistent set covering[C]//Proceedings of the International Conference on Scientific and Statistical Database Management. Association for Computing Machinery, 2015: 1-12.
[51] ZHANG J, LIU Y, LUAN H, et al. Prior knowledge integration for neural machine translation using posterior regularization[C]//Proceedings of the Annual Meeting of the Association for Computational Linguistics, 2017: 1514-1523.
[52] CAFARELLA M J, HALEVY A, WANG D Z, et al. WebTables: Exploring the power of tables on the web[J]. Proceedings of the VLDB Endowment, 2008, 1(1): 538-549.
[53] ZHANG S, BALOG K. Web table extraction, retrieval and augmentation[J]. ACM Transactions on Intelligent Systems and Technology, 2020, 11(2): 1-35.
[54] REN S, HE K, GIRSHICK R, et al. Faster R-CNN: Towards real-time object detection with region proposal networks[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2017, 39(06): 1137-1149.
[55] HUANG Y, LV T, CUI L, et al. LayoutLMv3: Pre-training for document AI with unified text and image masking[C]//Proceedings of the ACM International Conference on Multimedia. Association for Computing Machinery, 2022.
[56] YE J, QI X, HE Y, et al. PingAn-VCGroup’s solution for ICDAR 2021 competition on scientific literature parsing task b: table recognition to html[J/OL]. arXiv preprint arXiv: 2105.01848, 2021.
[57] LI J, XU Y, LV T, et al. DiT: Self-supervised pre-training for document image transformer[C]//Proceedings of the ACM International Conference on Multimedia. Association for Computing Machinery, 2022.
[58] LIU H, LI X, LIU B, et al. Neural collaborative graph machines for table structure recognition[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022: 4533-4542.
[59] MA C, LIN W, SUN L, et al. Robust table detection and structure recognition from heterogeneous document images[J]. Pattern Recognition, 2023, 133(C): 109006.
[60] YEPES A J. PubLayNet: Largest dataset ever for document layout analysis[C]//Proceedings of the International Conference on Document Analysis and Recognition, 2019: 1015-1022.
[61] ZHANG Z, ZHANG J, DU J, et al. Split, embed and merge: An accurate table structure Recognizer[J]. Pattern Recognition, 2022,126: 108565.
[62] LIN W, SUN Z, MA C, et al. TSRFormer: Table structure recognition with transformers[C]//Proceedings of the 30th ACM International Conference on Multimedia. Association for Computing Machinery, 2022: 6473-6482.
[63] CHANDRAN S, KASTURI R. Structural recognition of tabulated data[C]//Proceedings of the International Conference on Document Analysis and Recognition, 2002: 516-519.
[64] HURST M F. The interpretation of tables in texts[D]. PhD Thesis. University of Edinburgh, 2000.
[65] GATOS B, DANATSAS D, PRATIKAKIS I, et al. Automatic table detection in document images[C]//Proceedings of the International Conference on Pattern Recognition and Image Analysis, 2005: 609-618.
[66] SHAFAIT F, SMITH R. Table detection in heterogeneous documents[C]//Proceedings of the International Workshop on Document Analysis Systems, 2010: 65-72.
[67] LAW H, DENG J. Cornernet: Detecting objects as paired keypoints[C]//Proceedings of the European Conference on Computer Vision, 2018: 765-781.
[68] QASIM S R, MAHMOOD H, SHAFAIT F. Rethinking table recognition using graph neural networks[C]//Proceedings of ICDAR, 2019: 142-147.
[69] RIBA P, DUTTA A, GOLDMANN L, et al. Table detection in invoice documents by graph neural networks[C]//Proceedings of ICDAR, 2019: 122-127.
[70] LAUTERT L R, SCHEIDT M M, DORNELES C F. Web table taxonomy and formalization[C]//Proceedings of the ACM SIGMOD Record. New York, NY, USA, 2013: 28-33.
[71] DEVLIN J, CHANG M W, LEE K, et al. BERT: Pre-training of deep bidirectional transformers for language understanding[C]//Proceedings of NAACL-HLT, 2018: 4171-4186.
[72] CHENG Z, DONG H, JIA R, et al. FORTAP: Using formulas for numerical-reasoning-aware table pretraining[C]//Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics, 2022: 1150-1166.
[73] SUN K, RAYUDU H, PUJARA J. A hybrid probabilistic approach for table understanding[C]//Proceedings of the AAAI Conference on Artificial Intelligence, 2021: 4366-4374.
[74] HUMEAU S, SHUSTER K, LACHAUX M A, et al. Poly-encoders: Transformer architectures and pre-training strategies for fast and accurate multi-sentence scoring[C]//Proceedings of the International Conference on Learning Representations, 2020: 1-13.
[75] LEBRET R, DAVID GRANGIER MICHAEL AULI S A. Neural text generation from structured data with application to the biography domain[C]//Proceedings of the Conference on Empirical Methods in Natural Language Processing, 2016: 1203-1213.
[76] NOVIKOVA J, DUEK O, RIESER V. The E2E dataset: New challenges for end-to-end generation[C]//Proceedings of the the Special Interest Group on Discourse and Dialogue, 2017: 201-206.
[77] WISEMAN S, SHIEBER S M, RUSH A M. Challenges in data-to-document generation[C]//Proceedings of the Conference on Empirical Methods in Natural Language Processing, 2017: 2253-2263.
[78] CHEN W, CHEN J, SU Y, et al. Logical natural language generation from open-domain tables[C]//Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, 2020: 7929-7942.
[79] PASUPAT P, LIANG P. Compositional semantic parsing on semi-structured tables[C]//Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, 2015: 1470-1480.
[80] YU T, WU C S, VICTORIA LIN X, et al. GRAPPA: Grammar-augmented pre-training for table semantic parsing[C]//Proceedings of the International Conference on Learning Representations, 2021: 1-16.
[81] FANG J, TAO X, TANG Z, et al. Dataset, ground-truth and performance metrics for table detection evaluation[C]//Proceedings of the International Workshop on Document Analysis Systems, 2012: 445-449.
[82] GOBEL M, HASSAN T, ORO E, et al. ICDAR 2013 table competition[C]//Proceedings of the International Conference on Document Analysis and Recognition, 2013: 1449-1453.
[83] GAO L, YI X, JIANG Z, et al. ICDAR competition on page object detection[C]//Proceedings of the International Conference on Document Analysis and Recognition. IEEE, 2017: 1417-1422.
[84] SHAHAB A, SHAFAIT F, KIENINGER T, et al. An open approach towards the benchmarking of table structure recognition systems[C]//Proceedings of the IAPR Conference on Document Analysis Systems, 2010: 113-120.
[85] SIEGEL N, LOURIE N, POWER R, et al. Extracting scientific figures with distantly supervised neural networks[C]//Proceedings of the ACM/IEEE Joint Conference on Digital Libraries, 2018: 223-232.
[86] GAO L, HUANG Y, DEJEAN H, et al. ICDAR competition on table detection and recognition[C]//Proceedings of the International Conference on Document Analysis and Recognition. IEEE Computer Society, 2019: 1510-1515.
[87] KOCI E, THIELE M, REHAK J, et al. DECO: A dataset of annotated spreadsheets for layout and table recognition[C]//Proceedings of the International Conference on Document Analysis and Recognition, 2019: 1280-1285.
[88] CHI Z, HUANG H, XU H D, et al. Complicated table structure recognition[J/OL]. arXiv preprint arXiv:1908.04729, 2019.
[89] DENG Y, ROSENBERG D, MANN G. Challenges in end-to-end neural scientific table recognition[C]//Proceedings of the International Conference on Document Analysis and Recognition, 2019: 894-901.
[90] ZHONG X, SHAFIEIBAVANI E, YEPES A J. Image-based table recognition: Data, model, and evaluation[C]//Proceedings of the European Conference on Computer Vision, 2020: 564-580.
[91] GHASEMI-GOL M, SZEKELY P. TabVec: Table vectors for classification of web tables[J]. arXiv preprint arXiv:1802.06290, 2018.
[92] CHEUNG S C, CHEN W, LIU Y, et al. CUSTODES: Automatic spreadsheet cell clustering and smell detection using strong and weak features[C]//Proceedings of the International Conference on Software Engineering, 2016: 464-475.
[93] FISHER M, ROTHERMEL G. The EUSES spreadsheet corpus: A shared resource for supporting experimentation with spreadsheet dependability mechanisms[C]//Proceedings of the 1st Workshop on End-user Software Engineering, 2005: 1-5.
[94] HERMANS F, MURPHY HILL E. Enron’s spreadsheets and related emails: A dataset and analysis[C]//Proceedings of the 37th IEEE International Conference on Software Engineering, 2015: 7-16.
[95] BARIK T, LUBICK K, SMITH J, et al. FUSE: A reproducible, extendable, internet-scale corpus of spreadsheets[C]//Proceedings of the 12th Working Conference on Mining Software Repositories, 2015: 486-489.
[96] LI X, SUN Y. TSQA: Tabular scenario based question answering[C]//Proceedings of the AAAI Conference on Artificial Intelligence, 2021: 13297-13305.
[97] GANCHEV K, GRAA J, GILLENWATER J, et al. Posterior regularization for structured latent variable models[J]. Journal of Machine Learning Research, 2010, 11: 2001-2049.
[98] EISENSCHLOS J M, KRICHENE S, MLLER T, et al. Understanding tables with intermediate pre-training[C]//Proceedings of the Association for Computational Linguistics: EMNLP, 2020: 281-296.

基金

国家自然科学基金(62076231,U1811461,62206265);国家博士后基金(2021M703271)

PDF(4587 KB)

759

Accesses

Citation

Detail

段落导航

摘要
Abstract
关键词
Key words
引用本文
参考文献
基金

选择文件类型/文献管理软件名称

选择包含的内容

摘要

Abstract

关键词

Key words

引用本文

{{custom_sec.title}}

{{custom_sec.title}}

参考文献

{{custom_fnGroup.title_cn}}

脚注

基金

Published
2024-06-24
Issue Date
2024-06-26