基于深度学习的视觉文档信息抽取研究综述

吴泊心,仲国强,马龙龙

PDF(9374 KB)
PDF(9374 KB)
中文信息学报 ›› 2023, Vol. 37 ›› Issue (12) : 1-16.
综述

基于深度学习的视觉文档信息抽取研究综述

  • 吴泊心1,仲国强1,马龙龙2
作者信息 +

Survey on Visual Document Information Extraction Via Deep Learning Models

  • WU Boxin1, ZHONG Guoqiang1, MA Longlong2
Author information +
History +

摘要

视觉文档赋予文本丰富的多模态特征,如视觉特征、文本特征和布局特征等。视觉文档信息抽取旨在利用视觉文档的多模态特征更好地从文档内容中提取结构化的关键信息,已逐渐成为自然语言处理和计算机视觉技术的重要交叉领域,在商业、医疗、教育等行业应用广泛。随着深度学习技术的发展与突破,近期视觉文档信息抽取发展迅速,研究方法大致可分为两类,一类是基于有监督学习的方法,包括基于图的方法、基于网格的方法、端到端方法;另一类是基于自监督预训练和有监督微调的方法,逐渐成为主流的研究方向。该文概述了基于有监督学习的三类方法,基于自监督预训练和有监督微调方法的四个方面以及一些常用的公开数据集,最后总结并展望了未来可能的研究方向。

Abstract

Visual documents provide rich multimodal features such as visual, textual, and layout features for the text. Visual document information extraction aims to extract structured key information from document content by utilizing the multimodal features of visual documents. It has gradually become an important interdisciplinary field of natural language processing and computer vision technology, with wide applications in business, medical, education and so on. With the rapid advancement and breakthroughs in deep learning technology, visual document information extraction researches can be roughly divided into two categories: supervised learning-based methods (including graph-based, grid-based, and end-to-end methods) and self-supervised pretraining and supervised fine-tuning-based methods, the latter of which have gradually become the mainstream research direction. This paper introduces three types of methods based on supervised learning, four aspects for method based on self-supervised pre-training and supervised fine-tuning methods, and some publicly available datasets. Finally, it summarizes and discuss possible future research directions.

关键词

视觉文档信息抽取 / 多模态 / 预训练 / 深度学习

Key words

visual document information extraction / multi-modal / pre-training / deep learning

引用本文

导出引用
吴泊心,仲国强,马龙龙. 基于深度学习的视觉文档信息抽取研究综述. 中文信息学报. 2023, 37(12): 1-16
WU Boxin, ZHONG Guoqiang, MA Longlong. Survey on Visual Document Information Extraction Via Deep Learning Models. Journal of Chinese Information Processing. 2023, 37(12): 1-16

参考文献

[1] 崔磊,徐毅恒,吕腾超,等.文档智能: 数据集、模型和应用[J]. 中文信息学报, 2022, 36(6): 1-19.
[2] 郭喜跃,何婷婷. 信息抽取研究综述[J]. 计算机科学, 2015, 42(2): 14-17.
[3] CESARINI F, GORI M, MARINAI S, et al. INFORMys: A flexible invoice-like form-reader system[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 1998, 20(7): 730-745.
[4] ISHITANI Y. Model based information extraction and its application to document images[C]//Proceedings of the Workshop on Document Layout Interpretation and its Applications, 2001.
[5] AUMANN Y, FELDMAN R, LIBERZON Y, et al. Visual information extraction[J]. Knowledge and Information Systems, 2006, 10(1): 1-15.
[6] PEANHO C A, STAGNI H,DA SILVA F S C. Semantic information extraction from images of complex documents[J]. Applied Intelligence, 2012, 37(4): 543-557.
[7] SCHUSTER D, MUTHMANN K, ESSER D, et al. Intellix—end-user trained information extraction for document archiving[C]//Proceedings of the 12th International Conference on Document Analysis and Recognition. IEEE, 2013: 101-105.
[8] PALM R B, WINTHER O,LAWS F. Cloudscan: A configuration-free invoice analysis system using recurrent neural networks[C]//Proceedings of the 14th ICDAR. IEEE, 2017: 406-413.
[9] SAGE C, AUSSEM A, ELGHAZEL H,et al. Recurrent neural network approach for table field extraction in business documents[C]//Proceedings of the International Conference on Document Analysis and Recognition. IEEE, 2019: 1308-1313.
[10] VASWANI A, SHAZEER N, PARMAR N, et al. Attention is all you need[C]//Proceedings of the 31st International Conferance on Neural Information Processing Systems, 2017: 5998-6008.
[11] DEVLIN J, CHANG M-W, LEE K, et al. BERT: Pretraining of deep bidirectional transformers for language understanding[C]//Proceedings of the NAACLHLT, 2019: 4171-4186.
[12] DOSOVITSKIY A. An image is worth 16x16 words: Transformers for image recognition at scale[C]//Proceedings of the ICLR, 2021:1-22.
[13] XU Y, LI M, CUI L, et al. LayoutLM: Pre-training of text and layout for document image understanding[C]//Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2020:1192-1200.
[14] KATTI A R. Chargrid: Towards understanding 2D documents[C]//Proceedings of the Conference on Empirical Methods in Natural Language Processing. Brussels: Association for Computational Linguistics, 2018. 4459-4469.
[15] DANG T N, THANH D. End-to-end information extraction by character-level embedding and multi-stage attentional U-Net[C]//Proceedings of the British Machine Vision Conference. BMVA Press, 2019, 170:1-13.
[16] DENK T I, REISSWIG S D C, REISCHL P D M. Wordgrid: Extending chargrid with word-level information[D]. Bachelor’s Thesis, Karlsruhe, Baden-Wurttemberg Cooperative State University Karlsruhe, 2019.
[17] DENK T I,CHRISTIAN R. BERTgrid: Contextualized embedding for 2D document representation and understanding[C]//Proceedings of Document Intelligence Workshop of 33rd Conference on Neural Information Processing Systems, 2019.
[18] ZHAO X, WU Z, WANG X. CUTIE: Learning to understand documents with convolutional universal text information extractor[J/OL]. arXiv preprint arXiv:1903.12363, 2019.
[19] KERROUMI M. VisualWordGrid: Information extraction from scanned documents using a multimodal approach[G]//Lecture Notes in Computer Science, 2021: 389-402.
[20] LIN W. ViBERTgrid: A jointly trained multi-modal 2D document representation for key information extraction from documents[C]//Proceedings of Lecture Notes in Computer Science, 2021: 548-63.
[21] QIAN Y, SANTUS E, JIN Z, et al. GraphIE: A graph-based framework for information extraction[C]//Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1, 2019: 751-761.
[22] LIU X, GAO F, ZHANG Q,et al. Graph convolution for multimodal information extraction from visually rich documents[C]//Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies,2019: 32-39.
[23] GAL R, ARDAZI S, SHILKROT R. Cardinal graph convolution framework for document information extraction[C]//Proceedings of the ACM Symposium on Document Engineering, 2020: 1-11.
[24] HWANG W, YIM J, PARK S, et al. Spatial dependency parsing for semi-structured document information extraction[C]//Proceedings of the Findings of the Association for Computational Linguistics, 2021: 330-343.
[25] YU W, LU N, QI X, et al. PICK: Processing key information extraction from documents using improved graph learning convolutional networks[C]//Proceedings of the 25th International Conference on Pattern Recognition. IEEE, 2021: 4363-4370.
[26] TANG G, XIE L, JIN L,et al. MatchVIE: Exploiting match relevancy between entities for visual information extraction[C]//Proceedings of the 30th International Joint Conference on Artificial Intelligence, 2021: 1039-1045.
[27] LEE C, LI C, DOZAT T,et al. FormNet: Structural encoding beyond sequential modeling in form document information extraction[C]//Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics, 2022: 3735-3754.
[28] HUANG Z, XU W, YU K. Bidirectional LSTM-CRF models for sequence tagging[J/OL]. arXiv preprint arXiv:1508.01991, 2015.
[29] RAMSHAW L A. Text chunking using transformation-based learning[C]//Proceedings of the 3rd Workshop on Very Large Corpora of the Association for Computational Linguistics, 1995: 82-94.
[30] GUO H, QIN X, LIU J,et al. Eaten: Entity-aware attention for single shot visual text extraction[C]//Proceedings of the International Conference on Document Analysis and Recognition. IEEE, 2019: 254-259.
[31] ZHANG P, XU Y, CHENG Z, et al. TRIE: End-to-end text reading and information extraction for document understanding[C]//Proceedings of the 28th ACM International Conference on Multimedia, 2020: 1413-1422.
[32] WANG J, LIU C, JIN L, et al. Towards robust visual information extraction in real world: New dataset and novel solution[C]//Proceedings of the AAAI Conference on Artificial Intelligence,2021: 2738-2745.
[33] HWANG W, LEE H, YIM J, et al. Cost-effective end-to-end information extraction for semi-structured document images[J/OL]. arXiv preprint arXiv:2104.08041, 2021.
[34] CHENG Z, ZHANG P, LI C, et al. TRIE++: Towards end-to-end information extraction from visually rich documents[J/OL]. arXiv preprint arXiv: 2207. 06744, 2022.
[35] XU Y. LayoutLMv2: Multi-modal pre-training for visually-rich document understanding[C]//Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, 2021: 2579-2591.
[36] HONG T.BROS: A pre-trained language model for understanding texts in document[C]//Proceedings of ICLR, 2021:1-17.
[37] WEI M, HE Y, ZHANG Q. Robust layout-aware IE for visually rich documents with pre-trained language models[C]//Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval, 2020: 2367-2376.
[38] LI J, XU Y, CUI L, et al. MarkupLM: Pre-training of text and markup language for visually rich document understanding[C]//Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics,2022: 6078-6087.
[39] WANG Z, GU J, TENSMEYER C,et al. MGDoc: Pre-training with multi-granular hierarchy for document image understanding[J/OL]. arXiv preprint arXiv:2211.14958, 2022.
[40] LI P, GU J, KUEN J, et al. SelfDoc: Self-supervised document representation learning[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021: 5652-5660.
[41] LI C.StructuralLM: Structural pre-training for form understanding[C]//Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, 2021: 6309-6318.
[42] LI Y, QIAN Y, YU Y, et al. StrucTexT: Structured text understanding with multi-modal transformers[C]//Proceedings of the 29th ACM International Conference on Multimedia, 2021: 1912-1920.
[43] GU J, KUEN J, MORARIU V I, et al. UniDoc: Unified pretraining framework for document understanding[C]//Proceedings of the Neural Information Processing Systems, 2021(34): 39-50.
[44] BAI H, LIU Z, MENG X, et al. Wukong-reader: Multi-modal pre-training for fine-grained visual document understanding[J/OL]. arXiv preprint arXiv:2212.09621, 2022.
[45] GU Z, MENG C, WANG K, et al. Xylayoutlm: Towards layout-aware multimodal networks for visually-rich document understanding[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022:4583-4592.
[46] NAGY G, SETH S C. Hierarchical representation of optically scanned documents[C]//Proceedings of the International Conference on Pattern Recognition, 1984.
[47] HA J, HARALICK R M, PHILLIPS I T. Recursive XY cut using bounding boxes of connected components[C]//Proceedings of 3rd International Conference on Document Analysis and Recognition. IEEE, 1995: 952-955.
[48] BAO H, DONG L, WEI F,et al. Unilmv2: Pseudo-masked language models for unified language model pre-training[C]//Proceedings of the International Conference on Machine Learning, 2020: 642-652.
[49] LIU Y. RoBERTa: A robustly optimized BERT pretraining approach[J]. arXiv preprint arXiv:1907.11692, 2019.
[50] REIMERS N, GUREVYCH I. Sentence-BERT: Sentence embeddings using siamese bert-networks[C]//Proceedings of the Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, 2019: 3982-3992.
[51] GARNCAREK .LAMBERT: Layout-aware language modeling for information extraction[G]. Lecture Notes in Computer Science, 2021: 532-547.
[52] HE K, ZHANG X, REN S, et al. Deep residual learning for image recognition[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016: 770-778.
[53] XIE S, GIRSHICK R, HE K. Aggregated residual transformations for deep neural networks[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 2017: 1492-1500.
[54] HUANG Y, LV T, CUI L,et al. LayoutLMv3: Pre-training for document ai with unified text and image masking[C]//Proceedings of the 30th ACM International Conference on Multimedia, 2022: 4083-4091.
[55] KIM W, SON B, KIM I. Vilt: Vision-and-language transformer without convolution or region supervision[C]//Proceedings of the International Conference on Machine Learning. ICLR, 2021: 5583-5594.
[56] POWALSKI R. Going full-TILT boogie on document understanding with text-image-layout transformer[G].Lecture Notes in Computer Science, 2021: 732-747.
[57] PRAMANIK S, MUJUMDAR S, PATEL H. Towards a multi-modal, multi-task learning based pre-training framework for document representation learning[J/OL]. arXiv preprint arXiv:2009.14457, 2020.
[58] CAO H, MA J, GUO A, et al. GMN: Generative multi-modal network for practical document information extraction[C]//Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2022: 3768-3778.
[59] LEE CY, LI C L, WANG C, et al. ROPE: Reading order equivariant positional encoding for graph-based document information extraction[J/OL]. arXiv preprint arXiv:2106.10786, 2021.
[60] KIRKPATRICK D G, RADKE J D. A framework for computational morphology[G]//Machine Intelligence and Pattern Recognition. North-Holland, 1985: 217-248.
[61] APPALARAJU S. DocFormer: End-to-end transformer for document understanding[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021: 973-983.
[62] WANG J, JIN L, DING K. LiLT: A simple yet effective language-independent layout transformer for structured document understanding[C]//Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics, 2022:7747-7757.
[63] LEWIS D, AGAM G, ARGAMON S, et al. Building a test collection for complex document information processing[C]//Proceedings of the 29th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, 2006: 665-666.
[64] HARLEY A W, ALEX U, KONSTANTINOS G D.Evaluation of deep convolutional nets for document image classification and retrieval[C]//Proceedings of the 13th International Conference on Document Analysis and Recognition, 2015:991-995.
[65] XU Z,TANG J,ANTONIO J Y. PubLayNet: Largest dataset15ever for document layout analysis[C]//Proceedings of the International Conferenceon Document Analysis and Recognition, 2019: 1015-1022.
[66] XU Y. XFUND: A benchmark dataset for multilingual visually rich form understanding[C]//Proceedings of the ACL Findings, 2022:3214-3224.
[67] TJONG K SANG E F, DE MEULDER F. CoNLL-2003 shared task: Language-independent named entity recognition[C]//Proceedings of the 7th Conference on Natural Language Learning at HLT-NAACL, 2003: 142-147.
[68] FORNS A, ROMERO V, BAR A, et al. ICDAR2017 competition on information extraction in historical handwritten records[C]//Proceedings of the 14th IAPR International Conference on Document Analysis and Recognition. IEEE, 2017: 1389-1394.
[69] JAUME G,HAZIM K E,JEAN P T. FUNSD: A dataset for form understanding in noisy scanned documents[C]//Proceedings of the ICDARW, 2019:1-6.
[70] HUANG Z. ICDAR competition on scanned receipt OCR and information extraction[C]//Proceedings of the International Conference on Document Analysis and Recognition, 2019: 1516-1520.
[71] PARK S. CORD: A consolidated receipt dataset forpost-OCR parsing[DB/OL]. https://github.com/clovaai/ cord [2021-08-29].
[72] STRAY J, STACEY S. Project deepform: Extract informationfrom documents[DB/OL]. https:wandb.ai/ deepform/politicd-ad-extraction [2021-08-29].
[73] STANISAWEK T. Kleister: Key information extraction datasets involving longdocument swith complex layouts[G]//Lecture Notesin Computer Science, 2021: 564-79.
[74] SUN H, KUANG Z, YUE X, et al. Spatial dual-modality graph reasoning for key information extraction[J/OL]. arXiv preprint arXiv:2103.14470, 2021.
[75] MAJUMDER B P. Representation learning for information extraction from form-like documents[C]//Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, 2020: 6495-6504.
[76] CHENG M. One-shot text field labeling using attention and belief propagation for structure information extraction[C]//Proceedings of the 28th ACM International Conference on Multimedia, 2020: 340-348.
[77] WANG Z, SHANG J. Towards few-shot entity recognition in document images: A label-aware sequence-to-sequence framework[C]//Proceeding of the Findings of the Association for Computational Linguistics: ACL,2022: 4174-4186.
[78] KIM G, HONG T, YIM M, et al. Ocr-free document understanding transformer[C]//Proceedings of the European Conference on Computer Vision.Springer, 2022: 498-517.
[79] TOLEDO J I, SUDHOLT S, FORNS A, et al. Handwritten word image categorization with convolutional neural networks and spatial pyramid pooling[C]//Proceedings of the Joint IAPR International Workshops on Statistical Techniques in Pattern Recognition and Structural and Syntactic Pattern Recognition, 2016: 543-552.
[80] TOLEDO J I, CARBONELL M, FORNS A, et al. Information extraction from historical handwritten document images with a context-aware neural model[J]. Pattern Recognition, 2019, 86: 27-36.
[81] CARBONELL M, FORNS A, VILLEGAS M, et al. A neural model for text localization, transcription and named entity recognition in full pages[J]. Pattern Recognition Letters, 2020, 136: 219-227.
[82] CARBONELL M, RIBA P, VILLEGAS M, et al. Named entity recognition and relation extraction with graph neural networks in semi structured documents[C]//Proceedings of the 25th International Conference on Pattern Recognition. IEEE, 2021: 9622-9627.

基金

“新一代人工智能”重大项目(2018AAA0100400);山东省自然科学基金(ZR2020MF131);山东省重大基础研究项目(ZR2021ZD19);青岛市科技计划项目(21-1-4-ny-19-nsh)
PDF(9374 KB)

Accesses

Citation

Detail

段落导航
相关文章

/