金融要素抽取旨在应用信息抽取技术,从合同、计划书中提取出能够反映金融文档关键性信息的一些实体、短语等,又称为金融要素,最终实现金融文档的自动化处理。相比现有抽取任务,金融要素抽取任务面临着样本长尾分布、细粒度以及长文本长要素等难点,现有抽取模型无法有效处理如此复杂的抽取问题,抽取效果不佳。对此,该文提出了将要素抽取任务转换为带类型的头尾指针预测任务的模型ENAPtBERT。一方面,ENAPtBERT头尾指针的设计缓解了不合法标签的影响,并能很好地结合不均衡损失函数以缓解不均衡问题。另一方面,ENAPtBERT利用引入的要素名称信息增强模型发现要素、分类要素的准确率。在金融要素抽取数据集上,ENAPtBERT的Micro-F1指标比现有抽取模型提升了2.50%,Macro-F1指标至少提升了2.66%,有效证明了ENAPtBERT处理复杂抽取问题的有效性。
Abstract
Financial element extraction attempts to utilize information extraction technology to extract particular entities and phrases from contracts and plans that can reflect the main information of financial documents. This task is challenged by long tail distribution of samples, fine granularity, long components and long text, which are seldom encountered in other extraction work. The model ENAPtBERT is proposed in this research to convert the factor extraction job into a prediction task using typed head and tail pointers. The ENAPtBERT head and tail pointer's design reduces the impact of unlawful labels and may solve the imbalance issue by combining the imbalanced loss function. Meanwhile, the ENAPtBERT improves the accuracy of element finding and categorization by using the newly added element name information. Experiments indicate that the proposed method achieves 2.50% increase in Micro-F1 and 2.66% increase in Macro-F1 when compared to the existing methods.
关键词
金融要素抽取 /
不均衡 /
细粒度 /
要素名称信息
{{custom_keyword}} /
Key words
financial element extraction /
imbalance /
fine-grained /
element name information
{{custom_keyword}} /
{{custom_sec.title}}
{{custom_sec.title}}
{{custom_sec.content}}
参考文献
[1] COWIE J, LEHNERT W. Information extraction[C]//Proceedings of Communications of the ACM, 1996: 80-91.
[2] LAMPLE G, BALLESTEROS M, SUBRAMANIAN S, et al. Neural architectures for named entity recognition[C]//Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2016: 260-270.
[3] ABHISHEK A, ANAND A,AWEKAR A. Fine-grained entity type classification by jointly learning representations and label embeddings[C]// Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics, 2017: 797: 807.
[4] 周玉新. 信息抽取研究与发展综述[J]. 才智. 2016, 27: 262-268.
[5] DRUZHKINA A, LEONTYEV A, STEPANOVA M. German NER with a multilingual rule based information extraction system: Analysis and issues[C]// Proceedings of the 6th Named Entity Workshop, 2016: 28-33.
[6] ZHAO S. Named entity recognition in biomedical texts using an HMM model[C]// Proceedings of the International Joint Workshop on Natural Language Processing in Biomedicine and its Applications, 2004: 87-90.
[7] ZHOU G,SU J. Named entity recognition using an HMM-based chunk tagger[C]// Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, 2002: 473-480.
[8] 朱咸军,洪宇,黄雅琳,等. 基于HMM的算法优化在中文分词中的应用[J]. 金陵科技学院学报, 2019, 35(3): 1-7.
[9] 杨新生,胡立生. 基于隐马尔科夫模型的古汉语词性标注[J]. 微型电脑应用, 2020, 36(5): 130-133.
[10] PENG F, MCCALLUM A. Information extraction from research papers using conditional random fields[C]//Proceedings of the Information Processing & Management, 2006: 963-979.
[11] COLLOBERT R, WESTON J, BOTTOU L, et al. Natural language processing (almost) from scratch[J]. Journal of Machine Learning Research, 2011: 2493-2537.
[12] MISAWA S, TANIGUCHI M, MIURA Y, et al. Character-based bidirectional LSTM-CRF with words and characters for Japanese named entity recognition[C]// Proceedings of the 1st Workshop on Subword and Character Level Models in NLP, 2017: 97-102.
[13] DEVLIN J, CHANG M W, LEE K, et al. BERT: Pre-training of deep bidirectional transformers for language understanding[C]// Proceedings of the Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, 2019: 4171-4186.
[14] LAN Z,CHEN M, GOODMAN S, et al. ALBERT: A lite BERT for self-supervised learning of language representations[C]// Proceedings of the 8th International Conference on Learning Representations, 2019.
[15] YANG Z, DAI Z, YANG Y, et al. Xlnet: Generalized autoregressive pretraining for language understanding[C]//Proceedings of the 33rd Conference on Neural Information Processing Systems, 2019: 1-11.
[16] YU J, BOHNET B, POESIO M. Named entity recognition as dependency parsing[C]// Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, 2020: 6470-6476.
[16] WANG Y, LI Y, TONG H, et al. HIT: Nested named entity recognition via head-tail pair and token interaction[C]// Proceedings of the Conference on Empirical Methods in Natural Language Processing, 2020: 6027-6036.
[17] LI X, FENG J, MENG Y, et al. A unified MRC framework for named entity recognition[C]// Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, 2020: 5849-5859.
[18] RUSTOGI R, PRASAD A. Swift imbalance data classification using SMOTE and extreme learning machine[C]// Proceedings of the International Conference on Computational Intelligence in Data Science, 2019: 1-6.
[19] GONG X, QIAO W. Current-based mechanical fault detection for direct-drive wind turbines via synchronous sampling and impulse detection[C]//Proceedings of the IEEE Transactions on Industrial Electronics, 2014: 1693-1702.
[20] CUI Y, JIA M, LIN T Y, et al. Class-balanced loss based on effective number of samples[C]// Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019: 9268-9277.
[21] LIN T Y, GOYAL P, GIRSHICK R, et al. Focal loss for dense object detection[C]// Proceedings of the IEEE International Conference on Computer Vision, 2017: 2980-2988.
[22] LI X, SUN X, MENG Y, et al. Dice loss for data-imbalanced NLP tasks[C]// Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, 2020: 465-476.
[23] DAI X, ADEL H. An analysis of simple data augmentation for named entity recognition[C]// Proceedings of the 28th International Conference on Computational Linguistics, 2020: 3861-3867.
[24] WEI J, ZOU K. EDA: Easy data augmentation techniques for boosting performance on text classification tasks[C]// Proceedings of the Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, 2019: 6382-6388.
[25] YANG S, FENG D, QIAO L, et al. Exploring pre-trained language models for event extraction and generation[C]// Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, 2019: 5284-5294.
[26] DING B, LIU L, BING L, et al. DAGA: Data augmentation with a generation approach for low-resource tagging tasks[C]// Proceedings of the Conference on Empirical Methods in Natural Language Processing, 2020: 6045-6057.
[27] LING X, WELD D S. Fine-grained entity recognition[C]// Proceedings of the 26th AAAI Conference on Artificial Intelligence, 2012: 94-100.
[28] YOGATAMA D, GILLICK D, LAZIC N. Embedding methods for fine grained entity type classification[C]// Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing, 2015: 291-296.
[29] XIN J, ZHU H, HAN X, et al. Put it back: Entity typing with language model enhancement[C]// Proceedings of the Conference on Empirical Methods in Natural Language Processing, 2018: 993-998.
[30] KATO T, ABE K, OUCHI H, et al. Embeddings of label components for sequence labeling: A case study of fine-grained named entity recognition[C]// Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics: Student Research Workshop, 2020: 222-229.
[31] CHEN, T, CHEN Y, DURME B. Hierarchical entity typing via multi-level learning to rank[C]// Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, 2020: 8465-8475.
[32] XIONG W, WU J, LEI D, et al. Imposing label-relational inductive bias for extremely fine-grained entity typing[C]// Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2019: 773-784.
[33] LI J, CHEN X, WANG D, et al. Enhancing label representations with relational inductive bias constraint for fine-grained entity typing[C]// Proceedings of the 30th International Joint Conference on Artificial Intelligence, 2021: 3843-3849.
[34] VASWANI A, SHAZEER N, PARMAR N, et al. Attention is all you need[C]//Proceedings of the 31st International Conference on Neural Information Processing Systems, 2017: 6000-6010.
[35] LIN Y, JI H. An attentive fine-grained entity typing model with latent type representation[C]// Proceedings of the Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, 2019: 6196-6201.
{{custom_fnGroup.title_cn}}
脚注
{{custom_fn.content}}
基金
国家自然科学基金(61872113)
{{custom_fund}}