面向司法领域的藏文事件数据集构建

高璐,赵小兵

PDF(7762 KB)
PDF(7762 KB)
中文信息学报 ›› 2023, Vol. 37 ›› Issue (8) : 34-42,51.
民族、跨境及周边语言信息处理

面向司法领域的藏文事件数据集构建

  • 高璐1,2,4,赵小兵3,4
作者信息 +

Construction of Tibetan Event Dataset Oriented to the Judicial Field

  • GAO Lu1,2,4, ZHAO Xiaobing3,4
Author information +
History +

摘要

为了构建高质量的藏文司法事件数据集,推动藏文司法事件抽取技术的进一步探索、评测与优化。该文面向藏文司法领域,以藏文刑事裁判文书为研究对象,设计了“类别分组-主题建模”两阶段的契合藏文司法实际的事件Schema;采用模型驱动的事件触发词预标注与事件要素人工标注相结合的半自动化数据标注方式,对1 863篇藏文刑事裁判文书进行爬取、OCR、降噪、分析、多人协同标注及审核,构建了藏文司法事件数据集TiEvent。TiEvent共定义了3个大类、12个小类的事件类型,涵盖1 863个藏文真实司法文本的2 249个事件提及。BiLSTM、BiLSTM-CRF、CINO-CRF等事件抽取模型的实验结果表明,藏文司法事件触发词检测和论元识别的最高F1值分别为75.36%、70.98%。在藏文司法文本上,TiEvent具有较高的事件覆盖度和事件要素完整度,能够满足藏文司法事件抽取工作的基本需要。

Abstract

Aims to build a high-quality tibetan judicial event dataset, this article focuses on the Tibetan criminal judgment documents and designs a two-stage event schema named "category grouping-theme modeling". A semi-automatic data annotation strategy is applied, consisting of model driven event trigger word pre-annotation and event element manual annotation. The Tibetan judicial event dataset TiEvent is constructed from 1863 Tibetan criminal judgment documents. TiEvent covers altogether 2 249 events in 3 major categories and 12 subcategories. Experimental results of event extraction based on BiLSTM, BiLSTM-CRF and CINO-CRF show that the top F1 values of tibetan judicial event detection and argument recognition are 75.36% and 70.98%, respectively.

关键词

事件数据集 / 事件抽取 / 藏文信息处理

Key words

judicial dataset / event extraction / Tibetan information process

引用本文

导出引用
高璐,赵小兵. 面向司法领域的藏文事件数据集构建. 中文信息学报. 2023, 37(8): 34-42,51
GAO Lu, ZHAO Xiaobing. Construction of Tibetan Event Dataset Oriented to the Judicial Field. Journal of Chinese Information Processing. 2023, 37(8): 34-42,51

参考文献

[1] 吕弢. 保护民族诉讼权利 推进双语审判工作[J]. 法制与社会, 2015,28(10): 109-111.
[2] YAO F, XIAO C J, WANG X Z, et al. LEVEN: A large-scale chinese legal event detection dataset[C]//Proceedings of the Association for Computational Linguistics, 2022: 183-201.
[3] 李震. 面向裁判文书的事件抽取系统设计与实现[D]. 南京: 东南大学硕士学位论文, 2021.
[4] WALKER C, et al. ACE 2005 multilingual training corpus LDC2006T06[J]. Progress of Theoretical Physics Supplement, 2006, 110(110):261-276.
[5] MITAMURA T, LIU Z, HOVY E H. Overview of TAC KBP 2015 event nugget track[C]//Proceedings of the Text Analysis Conference, 2015.
[6] MITAMURA T, LIU Z, HOVY E H. Overview of TAC-KBP 2016 event nugget track[C]//Proceedings of the Text Analysis Conference, 2016.
[7] MITAMURA T, LIU Z, HOVY E H. Events detection, conference and sequencing: What's next? overview of the TAC KBP 2017 event track [C]//Proceedings of the Text Analysis Conference, 2017.
[8] wang x z, wang z q, han X, et al. MAVEN: A massive general domain event detection dataset[C]//Proceedings of the Conference on Empirical Methods in Natural Language Processing, 2020: 1652-1671.
[9] LI X, LI F, PAN L, et al. DuEE: A large-scale dataset for chinese event extraction in real-world scenarios[C]//Proceedings of the CCF International Conference on Natural Language Processing and Chinese Computing, 2020: 534-545.
[10] SATYAPANICH T, FERRARO F, FININ T. Casie: Extracting cybersecurity event information from text[C]//Proceedings of the AAAI Conference on Artificial Intelligence, 2020: 8749-8757.
[11] TRONG H M D, LE D T, VEYSEH A P B, et al. Introducing a new dataset for event detection in cybersecurity texts[C]//Proceedings of the Conference on Empirical Methods in Natural Language Processing, 2020: 5381-5390.
[12] HAN C Y, ZHANG J C, LI X Y, et al. DuEE-Fin: A large-scale dataset for document-level event extraction[C]//Proceedings of the CCF International Conference on Natural Language Processing and Chinese Computing, 2022: 172-183.
[13] LI Q, ZHANG Q, YAO J, et al. Event extraction for criminal legal text[C]//Proceedings of the IEEE International Conference on Knowledge Graph, 2020:573-580.
[14] MARC V E, ROSER S, TOMMASO C, et al.SemEval-2010 task 13:TempEval-2[C]//Proceedings of the 5th International Workshop on Semantic Evaluation, 2010.
[15] VEYSEH A P B, VAN NGUYEN M, DERNONCOURT F, et al. MINION: A large-scale and diverse dataset for multilingual event detection [C]//Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2022: 2286-2299.
[16] NANCY A C. Overview of MUC-7[C]//Proceedings of the 7th Message Understanding Conference, Fairfax, Virginia, 1998.
[17] KIM J D, OHTA T, PYYSALO S, et al. Overview of BioNLP’09 shared task on event extraction[C]//Proceedings of the BioNLP Workshop Companion Volume for Shared Task, 2009: 1-9.
[18] KIM J D, PYYSALO S, OHTA T, et al. Overview of bionlp shared task 2011[C]//Proceedings of the BioNLP Shared Task Workshop, 2011:1-6.
[19] KIM J D, WANG Y,YASUNORI Y. The genia event extraction shared task, 2013 edition-overview[C]//Proceedings of the BioNLP Shared Task Workshop, 2013:8-15.
[20] 李亚超,江静,加羊吉,于洪志. TIP-LAS:一个开源的藏文分词词性标注系统[J]. 中文信息学报, 2015, 29(6): 203-207.
[21] 赵小兵, 高璐, 高定国,等. 少数民族语言分词技术评测数据集MLWS2021[J].中国科学数据(中英文网络版), 2022,7(2): 002-007.
[22] LIN Y P, RUAN T, LIANG M, et al. DoTAT: A domain-oriented text annotation tool[C]//Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics: System Demonstrations, Dublin, Ireland, 2022: 1-8.
[23] PUSTEJOVSKY J, STUBBS A. Natural language annotation for machine learning: A guide to corpus-building for applications[M]. Sebastopol: O'Reilly Media, Inc., 2012.
[24] YANG Z Q, XU Z H, CUI Y M, et al. CINO: A chinese minority pre-trained language model[C]//Proceedings of the 29th International Conference on Computational Linguistics, Gyeongju, Republic of Korea, 2022: 3937-3949.

基金

国家社会科学基金(22&ZD035)
PDF(7762 KB)

1115

Accesses

0

Citation

Detail

段落导航
相关文章

/