李春楠,王雷,孙媛媛,林鸿飞. 基于BERT的盗窃罪法律文书命名实体识别方法[J]. 中文信息学报, 2021, 35(8): 73-81.
LI Chunnan, WANG Lei, SUN Yuanyuan, LIN Hongfei. BERT Based Named Entity Recognition for Legal Texts on Theft Cases. , 2021, 35(8): 73-81.
BERT Based Named Entity Recognition for Legal Texts on Theft Cases
LI Chunnan1, WANG Lei2, SUN Yuanyuan1, LIN Hongfei1
1.School of Computer Science and Technology, Dalian University of Technology, Dalian, Liaoning 116024, China; 2.People's Procuratorate of Jinzhou, Jinzhou, Liaoning 121000, China
摘要法律文书命名实体识别是智慧司法领域的关键性和基础性任务。在目前法律文书命名实体识别方法中,存在实体定义与司法业务结合不紧密、传统词向量无法解决一词多义等问题。针对以上问题,该文提出一种新的法律文本命名实体定义方案,构建了基于起诉意见书的法律文本命名实体语料集LegalCorpus;提出一种基于BERT-ON-LSTM-CRF(Bidirectional Encoder Representations from Transformers-Ordered Neuron-Long Short Term Memory Networks-Conditional Random Field)的法律文书命名实体识别方法,该方法首先利用预训练语言模型BERT根据字的上下文动态生成语义向量作为模型输入,然后运用ON-LSTM对输入进行序列和层级建模以提取文本特征,最后利用CRF获取最优标记序列。在LegalCorpus上进行实验,该文提出的方法F1值达到86.09%,相比基线模型lattice LSTM F1值提升了7.8%。实验结果表明,该方法可以有效对法律文书的命名实体进行识别。
Abstract:Legal named entity recognition(LNER)is a fundamentaltask for the field of smart judiciary.This paper presents a new definition of LNER and a corpus of letters of proposal for prosecution named LegalCorpus. This paper proposes novel BERT based NER model for legal texts, named BERT-ON-LSTM-CRF (Bidirectional Encoder Representations from Transformers-Ordered Neuron-Long Short Term Memory Networks-Conditional Random Field). The proposed model utilizes BERT model to dynamically obtain the semantic vectors according to the context of words. Then the ONLSTM is adopted to extract the text features by modeling the input sequence and hierarchy. Finally, the text features are decoded by CRF to obtain the optimal tag sequence. Experiments show that the proposed model can achieve a F1-value of 86.09%, with 7.8% increased than the best baseline Lattice-LSTM.
[1] 田荔枝. 法律文书[M]. 山东: 山东人民出版社,2008: 14. [2] Zhang Y, Yang J. Chinese NER using lattice LSTM[C]//Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, 2018: 1554-1564. [3] Devlin J, Chang M W, Lee K, et al. BERT: Pre-training of deep bidirectional transformers for language understanding[C]//Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2019: 4171-4186. [4] Shen Y, Tan S, Sordoni A, et al. Ordered neurons: Integrating tree structures into recurrent neural networks[C]//Proceedings of the International Conference on Learning Representations, 2018. [5] Lafferty J, Mccallum A, Pereira F C N. Conditional random fields: Probabilistic models for segmenting and labeling sequence data[C]//Proceedings of the 18th International Conf. on Machine Learning, 2001. [6] 郭喜跃, 何婷婷. 信息抽取研究综述[J]. 计算机科学, 2015, 42(2): 14-17. [7] Alfonseca E, Manandhar S. An unsupervised method for general named entity recognition and automated concept discovery[C]//Proceedings of the 1st International Conference on General WordNet, Mysore, India, 2002: 34-43. [8] Marrero M, Urbano J, Sánchez-Cuadrado S, et al. Named entity recognition: Fallacies, challenges and opportunities[J]. Computer Standards and Interfaces, 2013, 35(5): 482-489. [9] 王月,王孟轩,张胜,等.基于BERT的警情文本命名实体识别[J].计算机应用, 2020,40(2): 535-540. [10] Lample G, Ballesteros M, Subramanian S, et al. Neural architectures for named entity recognition[C]//Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2016: 260-270. [11] Ma X, Hovy E. End-to-end sequence labeling via bi-directional LSTM-CNNs-CRF[C]//Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, 2016: 1064-1074. [12] 赵浩新,俞敬松,林杰.基于笔画中文字向量模型设计与研究[J].中文信息学报,2019,33(05): 17-23. [13] 王路路,艾山·吾买尔,吐尔根·依布拉音,等.基于深度神经网络的维吾尔文命名实体识别研究[J].中文信息学报,2019,33(03): 64-70. [14] Luo L, Yang Z, Yang P, et al. An attention-based BiLSTM-CRF approach to document-level chemical named entity recognition[J]. Bioinformatics, 2018, 34(8): 1381-1388. [15] 李丽双, 郭元凯. 基于 CNN-BiLSTM-CRF 模型的生物医学命名实体识别[J]. 中文信息学报, 2018, 32(1): 116-122. [16] Peters M E, Neumann M, Iyyer M, et al. Deep contextualized word representations[C]//Proceedings of NAACL-HLT, 2018: 2227-2237. [17] Radford A, Narasimhan K, Salimans T, et al. Improving language understanding by generative pre-training[OL]. https://s3-us-west-2. amazonaws. com/openai-assets/researchcovers/languageunsupervised/language understanding paper. pdf, 2018. [18] 李妮,关焕梅,杨飘,等.基于BERT-IDCNN-CRF的中文命名实体识别方法[J].山东大学学报(理学版): 2020,55(1): 106-113. [19] 王子牛,姜猛,高建瓴,等.基于BERT的中文命名实体识别方法[J].计算机科学,2019,46(S2): 138-142. [20] 尹学振,赵慧,赵俊保,等.多神经网络协作的军事领域命名实体识别[J].清华大学学报(自然科学版): 2020,60(8): 35-42. [21] 谢云. 面向中文法律文本的命名实体识别研究[D]. 南京: 南京师范大学硕士学位论文,2018. [22] 王礼敏. 面向法律文书的中文命名实体识别方法研究[D].苏州: 苏州大学硕士学位论文,2018. [23] 林义孟. 面向司法领域的命名实体识别研究[D].昆明: 云南财经大学硕士学位论文,2019. [24] Hochreiter S, Schmidhuber J. Long short-term memory[J]. Neural Computation, 1997, 9(8): 1735-1780.