根据司法案件文书中实体名长度较长以及实体间的关联性较强这一特点,该文提出了一种利用最大正向匹配策略和社区注意力机制(FMM-CAM)的法律文书命名实体识别方法。该方法利用最大正向匹配策略,优先获得法律文书中每个字对应的较长的匹配词,将匹配词按字在词中的位置划分到B、M、E、S四个匹配词社区,并利用社区自注意力机制获取不同匹配词之间的关联性权重信息。具体过程利用BERT和Word2Vec的字表示,将字和匹配词社区压缩后的匹配词进行向量拼接,输入到一个BiLSTM中获得句子的语义表示,再利用CRF将句子进行解码,得到最优标签序列。实验结果表明,该文提出的方法可以对法律文书中的证据名、证实内容和卷宗号等实体边界进行有效确定。
Abstract
We observe that the length of entity names in judicial case documents are longer, with strong mutual correlation. This paper proposes a name entity recognition method based on the forward maximum matching strategy and community attention mechanism (FMM-CAM). In particular, the forward maximum matching strategy captures longer matching words corresponding to each character in the legal instrument by their positions in sentences, and then assigned as one of the four tags ina community: B, M, E and S. A community self-attention mechanism is exploited to get the better word embedding by assigning different weights to the different communities. Concatenating the word embedding and char embedding by BERT and Word2Vec models as input, a bidirectional LSTM is applied to obtain the semantic representations of the sentences, which are finally optimized for the tag sequence by CRF model. The experimental results show that the proposed method can effectively determine the entity boundary of legal documents, such as the evidence name, the proof contents and the files number.
关键词
法律文书 /
命名实体识别 /
自注意力 /
BiLSTM
{{custom_keyword}} /
Key words
legal Instruments /
named entity recognition /
self-attention /
BiLSTM
{{custom_keyword}} /
{{custom_sec.title}}
{{custom_sec.title}}
{{custom_sec.content}}
参考文献
[1] Yue Zhang, Jie Yang. Chinese NER using lattice LSTM[C]//Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, 2018: 1554-1564.
[2] Guillaume Lample, Miguel Ballesteros, et al. Neural architectures for named entity recognition[C]//Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2016:260-270.
[3] Dianbo Sui, Yubo Chen, Kang Liu, et al. Leverage lexical knowledge for Chinese named entity recognition via collaborative graph network[C]//Proceedings of the Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, 2019:3821-3831.
[4] Ruotian Ma, Minlong Peng, Qi Zhang,et al. Simplify the usage of lexicon in Chinese NER[C]//Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, 2020:5951-5960.
[5] Nuo Xu, Pinghui Wang, Long Chen, et al. Distinguish confusing law articles for legal judgment prediction[C]//Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics,2020:3086-3095.
[6] Ronan Collobert, Jason Weston, L′eon Bottou, et al. Natural language processing (almost) from scratch[J]. Journal of Machine Learning Research, 2011:2493- 2537.
[7] Zhiheng Huang, Wei Xu, Kai Yu. Bidirectional LSTM-CRF models for sequence tagging[J]. arXiv preprint arXiv:1508.01991,2015.
[8] Rei M, Crichton G K O, Pyysalo S. Attending to characters in neural sequence labeling models[C]//Proceedings of the 26th International Conference on Computational Linguistics, 2016:309-318.
[9] 谢云. 面向中文法律文本的命名实体研究[D]. 南京:南京师范大学硕士学位论文, 2018.
[10] 王礼敏. 面向法律文书的中文命名实体识别方法研究[D]. 苏州: 苏州大学硕士学位论文, 2018.
[11] 王得贤,王素格,裴文生. 基于JCWA-DLSTM的法律文书命名实体识别方法[J]. 中文信息学报,2020,34(10):51-58.
[12] Fangzhao Wu, Junxin Liu, et al. Neural Chinese named entity recognition via CNN-LSTM-CRF and joint training with word segmentation[J]. arXiv preprint arXiv: 1905.01964v1,2019.
[13] Diederik P kingma, Jimmy Ba. Adma: A method for stochastic optimization[J]. arXiv preprint arXiv:1412.6980, 2019.
{{custom_fnGroup.title_cn}}
脚注
{{custom_fn.content}}
基金
国家自然科学基金(62076158,62106130,62072294);山西省研究生创新项目(2021Y160);山西省重点研发计划项目(201803D421024);山西省基础研究计划项目(20210302124084)
{{custom_fund}}