电子病历是医疗信息的重要来源,包含大量与医疗相关的领域知识。该文从糖尿病电子病历文本入手,在调研了国内外已有的电子病历语料库的基础上,参考I2B2实体及关系分类,建立了糖尿病电子病历实体及实体关系分类体系,并制定了标注规范。利用实体及关系标注平台,进行了实体及关系预标注及多轮人工校对工作,形成了糖尿病电子病历实体及关系标注语料库(Diabetes Electronic Medical Record entity and relation Corpus,DEMRC)。DEMRC共包含8 899个实体、456个实体修饰及16 564个关系,对其进行一致性评价和分析,发现实体及关系标注一致性达到了0.854 2和0.941 6。针对实体识别和实体关系抽取任务,分别采用基于迁移学习的BiLSTM-CRF模型和RoBERTa模型进行初步实验,并对语料库中的各类实体及关系进行评估,为后续糖尿病电子病历实体识别、关系抽取研究及糖尿病知识图谱构建打下基础。
Abstract
Electronic medical record (EMR) is an important source of medical information with rich medical knowledge. In light of the I2B2 entity and relationship classification, we establish the classification system of entity and entity relationship of the diabetes EMR, as well as the annotation scheme. With multiple rounds of manual proofreading, the Diabetes Electronic Medical Record entity and relation Corpus (DEMRC) is finally completed. DEMRC contains 8899 entities, 456 entity modifications and 16564 relationships. The consistency of entity and relationship annotation has reached 0.86 and 0.94, respectively. For the entity identification and entity relationship extraction tasks, The BILSTM-CRF model based on transfer learning and the Roberta model are trained on the corpus for preliminary experiments and various entities and relationships in the corpus are evaluated, which lay a foundation for the follow-up research on the Entity identification and relation extraction of diabetes EMR and the construction of the diabetes knowledge graph.
关键词
糖尿病 /
电子病历 /
实体及关系标注体系 /
语料库构建
{{custom_keyword}} /
Key words
diabetes /
electronic medical records /
entity and relation annotation system /
corpus construction
{{custom_keyword}} /
{{custom_sec.title}}
{{custom_sec.title}}
{{custom_sec.content}}
参考文献
[1] 魏军平.中西医结合防治糖尿病临床研究现状及发展趋势[J].中国中西医结合杂志,2021,41(01):13-15.
[2] 刘勇,齐梦霁.基于糖尿病防治的医学知识图谱构建的研究[J].医学信息,2020,33(18):11-14.
[3] STEARNS M Q, PRICE C, SPACKMAN K A, et al. SNOMED clinical terms: Overview of the development process and project status[J]. AMIA. Annual Symposium. AMIA Symposium, 2001, 8(1): 662-666.
[4] MEYSTRE S, HANG P J. Natural language processing to extract medical problems from electronic clinical documents: performance evaluation[J]. Journal of Biomedical Informatics, 2006, 39(6): 589-599.
[5] SAVOVA G K, MASANZ J J, OGREN P V, et al. Mayo clinical text analysis and knowledge extraction system (cTAKES): Architecture, component evaluation and applications[J]. Journal of the American Medical Informatics Association, 2010, 17(5): 507-513.
[6] ROBERTS A, GAIZAUSKAS R, HEPPLE M, et al. Building a semantically annotated corpus of clinical texts[J]. Journal of Biomedical Informatics, 2009, 42(5):950-966.
[7] UZUNER O, SOUTH B R, SHEN S, et al. 2010 i2b2/VA challenge on concepts, assertions, and relations in clinical text[J]. Journal of the American Medical Informatics Association, 2011, 18(5): 552-556.
[8] MIZUKI M, YOSHINOBU K, TOMOKO O, et al. Overview of the NTCIR-10 med NLP task[C]//Proceedings of the NTCIR-10, 2013.
[9] NVOL A, GROUIN C, LEIXA J, et al. The QUAERO French medical corpus: A resource for medical entity recognition and normalization[C]//Proceedings of the 4th Workshop on Building and Evaluating Resources for Health and Biomedical Text Processing-BioTxtM, 2014:24-30.
[10] CAMPILLOS L, LOUISE D, GROUIN C, et al. A French clinical corpus with comprehensive semantic annotations: development of the medical entity and relation limsi annotated text corpus (merlot)[J]. Language Resources and Evaluation, 2018, 52(2): 571-601.
[11] LEI J, TANG B, LU X, et al. Research and applications: A comprehensive study of named entity recognition in Chinese clinical text[J]. Journal of the American Medical Informatics Association Jamia, 2014, 21(5): 808-816.
[12] WANG H, ZHANG W, ZENG Q, et al. Extracting important information from Chinese operation notes with natural language processing methods[J]. Journal of Biomedical Informatics, 2014, 48: 130-136.
[13] WANG Y, YU Z, CHEN L, et al. Supervised methods for symptom flame recognition in free-text clinical records of traditional Chinese medicine: An empirical study[J]. Journal of Biomedical Informatics,2014, 47: 91-104.
[14] 杨锦锋,关毅,何彬,等. 中文电子病历命名实体和实体关系语料库构建[J]. 软件学报, 2016,27(11): 2725-2746.
[15] 昝红英,刘涛,牛常勇,等. 面向儿科疾病的命名实体及实体关系标注语料库构建及应用[J]. 中文信息学报,2020,34(05): 19-26.
[16] CHANG D,CHEN M,LIU L, et al. Diakg: An annotated diabetes dataset for medical knowledge graph construction[C]//Proceedings of the 6th China Conferenceon Knowledge Graph and Semantic Computing, 2021: 308-314.
[17] ZHAO Y S, ZHANG K L, MA H C, et al. Leveraging text skeleton for de-identification of electronic medical records[C]//Proceedings of the BMC Medical Informatics and Decision Making, 2018.
[18] 张坤丽,赵旭,关同峰,等. 面向医疗文本的实体及关系标注平台的构建及应用[J]. 中文信息学报,2020,34(06): 36-44.
[19] CARLETTA J. Assessing agreement on classification tasks: The kappa statistic[J]. Computational Linguistics, 1996, 22(2): 249-254.
[20] HRIPCSAK G, ROTHSCHILD A S. Agreement, the F-measure, and reliability in information retrieval[J]. Journal of the American Medical Informatics Association, 2005, 12(3): 296-298.
[21] OGREN P, SAVOVA G, CHUTE C. Constructing evaluation corpora for automated clinical named entity recognition[C]//Proceedings of the 12th World Congress on Health (Medical) Informatics. Marrakech, Morocco: European Language Resources Association, 2008: 2325-2330.
[22] ARTSTEIN, Ron, Poesio, et al. Inter-coder agreement for computational linguistics[J]. Computational Linguistics, 2008, 34: 555-596.
[23] ZHANG K, YUE D, ZHUANG L. Improving Chinese clinical named entity recognition based on BiLSTM-CRF by cross-domain transfer[C]//Proceedings of HPCCT & BDAI: The 4th high Performance Computing and Cluster Technologies Conference & 2020 3rd International Conference on Big Data and Artificial Intelligence, 2020.
[24] LIU Y, OTT M, GOYAL N, et al. RoBERTa: A robustly optimized BERT pretraining approach[J]. arXiv preprint arXiv:1907.11692,2019.
{{custom_fnGroup.title_cn}}
脚注
{{custom_fn.content}}
基金
中国博士后科学基金(2020M682349);河南省科技攻关项目(232102211033);河南省医学科技攻关计划省部共建项目(SB201901021);河南省高等学校重点科研项目(19A520003,20A520038);教育部人文社科规划项目(20YJA740033);河南省哲学社会科学规划项目(2019BYY016)
{{custom_fund}}