当前医学语料库实体及实体关系的分类体系难以满足精准医学发展需求的问题,该文针对儿科疾病开展研究。在医学领域专家的指导下制定了适合儿科学的命名实体和实体关系的标注体系及详细标注规范;融合国内外相关医学标准资源,利用标注工具对298余万字儿科医学文本中实体及实体关系进行机器预标注、人工标注及人工校对,构建了面向儿科疾病的医学实体及关系语料库。所构建的语料库包含504种儿科常见疾病,共标注命名实体23 603个,实体关系36 513个,多轮标注一致性分别为0.85和0.82。基于该语料库构建了儿科医学知识图谱,并开发了基于知识图谱的儿科医学知识问答系统。
Abstract
In the current medical corpus, the classification system of entities and entity relations is difficult to meet the development requirement of precision medicine. This paper conducts the research about pediatric diseases. In particular, this paper constructs an annotation system and detailed annotation schemes for named entity and entity relations under the guidance of medical experts. By fusing the relevant medical standard, annotation tools are applied for machine pre-annotation, manual annotation and manual proofreading of entities and entity relations in pediatric medical texts with more than 2.98 million words, thus constructing a medical entities and entity relations corpus for 504 common pediatric diseases. In this corpus, 23 603 named entities and 36 513 entity relationships were annotated, and for them the consistency accuracies of multiple-around annotation are 0.85 and 0.82, respectively. Based on the annotated corpus, this paper also constructs a pediatric medical knowledge graph and develops a pediatric medical knowledge QA system.
关键词
儿科疾病 /
语料库建设 /
命名实体 /
实体关系 /
知识图谱
{{custom_keyword}} /
Key words
pediatries /
corpus construction /
named entity /
entity relation /
knowledge graph
{{custom_keyword}} /
{{custom_sec.title}}
{{custom_sec.title}}
{{custom_sec.content}}
参考文献
[1] Liu Y, Yang L L, Xu S Y, et al. Pediatrics in China: Challenges and prospects[J]. World Journal of Pediatrics, 2018, 14(5):1-3.
[2] Meystre S, Hang P J. Natural language processing to extract medical problems from electronic clinical documents: Performance evaluation[J]. Journal of Biomedical Informatics, 2006, 39(6): 589-599.
[3] Savova G K, Masanz J J, Ogren P V, et al. Mayo clinical text analysis and knowledge extraction system(cTAKES): Architecture,component evaluation and applications[J]. Journal of the American Medical Informatics Association, 2010, 17(5): 507-513.
[4] Roberts A, Gaizauskas R, Hepple M, et al. Building a semantically annotated corpus of clinical texts[J]. Journal of Biomedical Informatics, 2009, 42(5): 950-966.
[5] Névéol A, Grouin C, Leixa J, et al. The QUAERO French medical corpus: A resource for medical entity recognition and normalization[C]//Proceedings of the 4thWorkshop on Building and Evaluating Resources for Health and Biomedical Text Processing - BioTxtM2014. 2014:24-30.
[6] Campillos L, Louise Deleger, Grouin C, et al. A French clinical corpus with comprehensive semantic annotations: Development of the medical entity and relation LIMSI annotated text corpus(MERLOT)[J]. Language Resources and Evaluation, 2018, 52(2): 571-601.
[7] Lei J, Tang B, Lu X, et al. A comprehensive study of named entity recognition in Chinese clinical text[J]. Journal of the American Medical Informatics Association, 2014, 21(5): 808-814.
[8] Wang Y, Yu Z, Chen L, et al. Supervised methods for symptom flame recognition in free-text clinical records of traditional Chinese medicine: An empirical study[J]. Journal of Biomedical Informatics,2014,47:91-104.
[9] 杨锦锋, 关毅, 何彬,等. 中文电子病历命名实体和实体关系语料库构建[J]. 软件学报, 2016, 27(11): 2725-2746.
[10] 昝红英,韩杨超,范亚鑫,等. 中文症状知识库的建立与分析[J]. 中文信息学报, 2020, 34(4): 30-37.
[11] 王卫平, 孙锟, 常立文. 儿科学(第9版)[M]. 北京:人民卫生出版社, 2018.
[12] 沈晓明, 桂永浩. 临床儿科学(第2版)[M]. 北京:人民卫生出版社, 2013.
[13] Uzuner , Mailoa J, Ryan RJ, et al. Semantic relations for problem-oriented medical records[J]. Artificial Intelligence in Medicine, 2010, 50:63-73.
[14] Donghui Yue,Kunli Zhang,Lei Zhuang,et al. Annotation scheme and specification for named entities and relations on Chinese medical knowledge graph[C]//Proceedings of the 20th Chinese Lexical Semantic Workshop, 2019: 563-574.
[15] Xia Fei,Yetisgen Meliha. Clinical corpus annotation: Challenges and strategies[C]//Proceedings of the 3rd Workshop on Building and Evaluating Resources for Biomedical Text Mining (BioTxtM2012) in Conjunction with the International Conference on Language Resources and Evaluation, Istanbul, Turkey, 2012.
[16] Lipscomb C E. Medical Subject Headings[J]. Bulletin of the Medical Library Association, 2000, 88(3): 265-266.
[17] Sundararajan V, Henderson T, Perry C, et al. New ICD-10 version of the Charlson comorbidity index predicted in-hospital mortality[J]. Journal of Clinical Epidemiology, 2004, 57(12): 1288-1294.
[18] Nahler G. The anatomical therapeutic chemical classi-
fication system (ATC)[J]. WHO Technical Report Series, 2005, 933:93-111.
[19] Hripcsak G, Rothschild A S. Agreement, the f-measure, and reliability in information retrieval[J]. Journal of the American Medical Informatics Association, 2005, 12(3): 296-298.
[20] Ogren P, Savova G, Chute C. Constructing evaluation corpora for automated clinical named entity recognition[C]//Proceedings of the 6th International Conference on Language Resources and Evaluation (LREC'08). Marrakech, Morocco: European Language Resources Association, 2008: 28-30.
[21] Artstein R, Poesio M. Inter-coder agreement for computational linguistics[J]. Computational Linguistics, 2008, 34(4): 555-596.
[22] 奥德玛,杨云飞,穗志方,等. 中文医学知识图谱CMeKG构建初探[J]. 中文信息学报, 2019, 33(10): 1-7.
[23] 昝红英,窦华溢,贾玉祥,等.基于多来源文本的中文医学知识图谱的构建[J/OL].郑州大学学报(理学版):1-7[2020-03-19].https://doi.org/10.13705/j.issn.1671-6841.2019383.
{{custom_fnGroup.title_cn}}
脚注
{{custom_fn.content}}
基金
国家社会科学基金(18ZDA315);河南省高等学校重点科研项目(20A520038);河南省科技攻关项目(192102210260);河南省科技攻关计划国际合作项目(172102410065);河南省医学科技攻关计划省部共建项目(SB201901021)
{{custom_fund}}