脑卒中疾病电子病历实体及实体关系标注语料库构建

常洪阳,昝红英,马玉团,张坤丽

PDF(2508 KB)
PDF(2508 KB)
中文信息学报 ›› 2022, Vol. 36 ›› Issue (8) : 37-45.
语言资源建设与应用

脑卒中疾病电子病历实体及实体关系标注语料库构建

  • 常洪阳1,2,昝红英1,2,马玉团1,2,张坤丽1,2
作者信息 +

Corpus Construction for Named-Entity and Entity Relations for Electronic Medical Records of Stroke Disease

  • CHANG Hongyang1,2, ZAN Hongying1,2, MA Yutuan1,2, ZHANG Kunli 1,2
Author information +
History +

摘要

该文探讨了在脑卒中疾病中文电子病历文本中实体及实体间关系的标注问题,提出了适用于脑卒中疾病电子病历文本的实体及实体关系标注体系和规范。在标注体系和规范的指导下,进行了多轮的人工标注及校正工作,完成了158万余字的脑卒中电子病历文本实体及实体关系的标注工作。构建了脑卒中电子病历实体及实体关系标注语料库(Stroke Electronic Medical Record entity and entity related Corpus, SEMRC)。该文所构建的语料库共包含命名实体10 594个,实体关系14 457个。实体名标注一致率达到85.16%,实体关系标注一致率达到94.16%。

Abstract

This paper discusses the labeling of Named-Entity and Entity Relations in Chinese electronic medical records of stroke disease, and proposes a system and norms for labeling entity and entity relations that are suitable for content and characteristics of electronic medical records of stroke disease. Based on the guidance of the labeling system and norms, after multiple rounds of labeling and correction, we completed the labeling of entities and relationships in electronic medical record of stroke disease more than 1.5 million words(Stroke Electronic Medical Record entity and entity related Corpus, SEMRC). The constructed corpus contains 10,594 named entities and 14,597 entity relationships. The consistency of named entity reached 0.8516, and that of entity relationship reached 0.9416.

关键词

脑卒中疾病 / 语料库构建 / 命名实体 / 实体关系

Key words

stroke disease / corpus construction / named entity / entity relations

引用本文

导出引用
常洪阳,昝红英,马玉团,张坤丽. 脑卒中疾病电子病历实体及实体关系标注语料库构建. 中文信息学报. 2022, 36(8): 37-45
CHANG Hongyang, ZAN Hongying, MA Yutuan, ZHANG Kunli. Corpus Construction for Named-Entity and Entity Relations for Electronic Medical Records of Stroke Disease. Journal of Chinese Information Processing. 2022, 36(8): 37-45

参考文献

[1] 胡钟竞. 脑卒中中西医治疗的最新研究进展[J]. 中国医药指南, 2018, 016(017): 39-41.
[2] 中华人民共和国国家卫生和计划生育委员会. 电子病历应用管理规范(试行)[J]. 中国实用乡村医生杂志, 2017, 24(6): 3.
[3] Uzuner , Goldstein I, Luo Y, et al. Identifying patient smoking status from medical discharge records[J]. Journal of the American Medical Informatics Association, 2008, 15(1): 14-24.
[4] Uzuner . Recognizing obesity and comorbidities in sparse data[J]. Journal of the American Medical Informatics Association, 2009, 16(4): 561-570.
[5] Uzuner , Solti I, Cadag E. Extracting medication information from clinical text[J]. Journal of the American Medical Informatics Association, 2010, 17(5): 514-518.
[6] Uzuner , South B R, Shen S, et al. 2010 i2b2/VA challenge on concepts, assertions, and relations in clinical text[J]. Journal of the American Medical Informatics Association, 2011, 18(5): 552-556.
[7] Sun W, Rumshisky A, Uzuner O. Evaluating temporal relations in clinical text: i2b2 challenge[J]. Journal of the American Medical Informatics Association, 2013, 20(5): 806-813.
[8] Stubbs A, Uzuner O . Annotating risk factors for heart disease in clinical narratives for diabetic patients[J]. Journal of Biomedical Informatics, 2015, 58(Suppl): S78-S91.
[9] Meystre S, Haug P J. Natural language processing to extract medical problems from electronic clinical documents: performance evaluation[J]. Journal of biomedical informatics, 2006, 39(6): 589-599.
[10] Savova G K, Masanz J J, Ogren P V, et al. Mayo clinical text analysis and knowledge extraction system: Architecture, component evaluation and applications[J]. Journal of the American Medical Informatics Association, 2010, 17(5): 507-513.
[11] Campillos L, Deléger L, Grouin C, et al. A French clinical corpus with comprehensive semantic annotations: development of the medical entity and relation LIMSI annOtated text corpus[J]. Language Resources and Evaluation, 2018, 52(2): 571-601.
[12] Rink B, Harabagiu S, Roberts K .Automatic extraction of relations between medical concepts in clinical texts[J]. Journal of the American Medical Informatics Association, 2011, 18(5): 594-600.
[13] Styler W F, Bethard S, Finan S, et al. Temporal annotation in the clinical domain[J].Transactions of the Association for Computational Linguistics, 2014, 2: 143-154.
[14] Meystre S M, Kim Y, Gobbel G T, et al. Congestive heart failure information extraction framework for automated treatment performance measures assessment[J]. Journal of the American Medical Informatics Association, 2017, 24(e1): e40-e46.
[15] Roberts A, Gaizauskas R, Hepple M, et al. Semantic annotation of clinical text: The CLEF corpus[C]//Proceedings of the LREC workshop on building and evaluating resources for biomedical text mining. 2008: 19-26.
[16] Yang J F, Yu Q B, Guan Y, et al. An overview of research on electronic medical record oriented named entity recognition and entity relation extraction[J]. Acta Automatica Sinica, 2014, 40(8): 1537-1562.
[17] Lei J, Tang B, Lu X, et al. A comprehensive study of named entity recognition in Chinese clinical text[J]. Journal of the American Medical Informatics Association, 2014, 21(5): 808-814.
[18] Lei J. Named entity recognition in chinese clinical text[D]. PhD diss., Texas: The University of Texas, USA, 2014.
[19] Wu Y, Min J, Lei J, et al. Named entity recognition in chinese clinical text using deep neural network[J]. Studies in Health Technology and Informatics, 2015, 216: 624-628.
[20] 昝红英,刘涛,牛常勇,等. 面向儿科疾病的命名实体及实体关系标注语料库构建及应用[J]. 中文信息学报, 2020, 34(5): 19-26.
[21] 昝红英,关同峰,张坤丽,等. 面向医学文本的实体关系抽取研究综述[J]. 郑州大学学报(理学版),2020,52(4): 1-15
[22] 张坤丽, 赵旭, 关同峰,等. 面向医疗文本的实体及关系标注平台的构建及应用[J]. 中文信息学报, 2020, 34(6): 36-44.
[23] Wang Y, Yu Z, Chen L, et al. Supervised methods for symptom name recognition in free-text clinical records of traditional Chinese medicine: An empirical study[J]. Journal of Biomedical Informatics, 2014, 47: 91-104.
[24] 杨锦锋, 关毅, 何彬,等. 中文电子病历命名实体和实体关系语料库构建[J]. 软件学报, 2016, 27(11): 2725-2746.
[25] 苏嘉, 何彬, 吴昊,等. 基于中文电子病历的心血管疾病风险因素标注体系及语料库构建[J]. 自动化学报, 2019, 45(002): 420-426.
[26] 昝红英, 韩杨超, 范亚鑫,等. 中文症状知识库的建立与分析[J]. 中文信息学报, 2020, v.34(04): 33-40.
[27] Guan T, Zan H, Zhou X, et al. CMeIE: Construction and evaluation of Chinese medical information extraction dataset[C]//Proceedings of the CCF International Conference on Natural Language Processing and Chinese Computing. Springer, Cham, 2020: 270-282.
[28] Artstein R, Poesio M. Inter-coder agreement for computational linguistics[J]. Computational Linguistics, 2008, 34(4): 555-596.

基金

河南省医学科技攻关计划省部共建项目(SB201901021);河南省高等学校重点科研项目(20A520038);郑州市协同创新重大专项科技攻关项目(20XTZX11020)
PDF(2508 KB)

Accesses

Citation

Detail

段落导航
相关文章

/