中文医学细粒度知识表示体系与标注语料库构建

杨洋,关毅,李雪,姜京池,史怀璋,柳曦光

PDF(9409 KB)
PDF(9409 KB)
中文信息学报 ›› 2023, Vol. 37 ›› Issue (6) : 52-66.
语言资源建设与应用

中文医学细粒度知识表示体系与标注语料库构建

  • 杨洋1,关毅1,李雪1,姜京池1,史怀璋2,柳曦光3
作者信息 +

Fine-grained Chinese Medical Knowledge: A Representation System and an Annotated Corpus

  • YANG Yang1, GUAN Yi1, LI Xue1, JIANG Jingchi1, SHI Huaizhang2, LIU Xiguang3
Author information +
History +

摘要

面向医学知识的细粒度、可共享性与高精准性的需求,该文提出了中文医学文本知识表示体系,融合了电子病历、医学书籍与专业医学网站文本三个数据来源的医疗知识。该体系包括9类医学实体、60类实体关系。基于此,开发了可操作性高的标注工具,并为每种来源提供了规范标注的医学文本数据,构建了涵盖范围广、一致性高的细粒度标注语料库。4名临床医生对《诊断学》书籍标注了6 526个医学实体,4 229条关系,标注一致性可达0.974。三个数据源融合后实体数量344 475个,关系数量3 196 787条。该文综述了数据源融合的映射过程、标注细则,分析了各数据源的文本特点并总结标注模式,通过应用场景与文本特点表明医学书籍标注必要性。该文为中文医学语料库构建提供标注规范,并为中文医学实体识别与关系抽取提供语料支持。

Abstract

To build a fine-grained, sharable, and high-quality knowledge base in the medical field, we propose a Chinese medical knowledge representation system to cover Chinese clinical texts including electronic medical records, books, and professional medical web text data. This system defines 9 entity types and 60 entity relation types. Then we develop a highly operable annotation tool and construct a public available annotated corpus with wide coverage and high consistency for all three text sources. Four annotators annotate the Chinese medical book named “Diagnostics” with 0.974 inter-annotator agreement, generating altogether 6 526 medical entities and 4 229 entity relations. The whole corpus consists of 344 475 medical entities and 3 196 787 entity relations without duplications. The paper presents the mapping scheme, annotation rules for knowledge fusion, as well as the text characteristics of each data source. As a pioneering work for Chinese corpus of medical entity recognition and relation extraction, this paper provides an annotation standard for Chinese medical construction.

关键词

细粒度标注规范 / 多源医疗文本 / 语义标注 / 语料库构建

Key words

fine-grained annotation standard / multi-source medical text / semantic annotation / corpus construction

引用本文

导出引用
杨洋,关毅,李雪,姜京池,史怀璋,柳曦光. 中文医学细粒度知识表示体系与标注语料库构建. 中文信息学报. 2023, 37(6): 52-66
YANG Yang, GUAN Yi, LI Xue, JIANG Jingchi, SHI Huaizhang, LIU Xiguang. Fine-grained Chinese Medical Knowledge: A Representation System and an Annotated Corpus. Journal of Chinese Information Processing. 2023, 37(6): 52-66

参考文献

[1] YANG Y, HUO H, JIANG J, et al. Clinical decision-making framework against over-testing based on modeling implicit evaluation criteria[J]. Journal of Biomedical Informatics, 2021, 119: 103823.
[2] ZHAO C, JIANG J, XU Z, et al. A study of EMR-based medical knowledge network and its applications[J]. Computer Methods and Programs in Biomedicine, 2017, 143: 13-23.
[3] NICKEL M, MURPHY K, TRESP V, et al. A review of relational machine learning for knowledge graphs[J]. Proceedings of the IEEE, 2015, 104(1): 11-33.
[4] CHANG D, CHEN M, LIU C, et al. DiaKG: An annotated diabetes dataset for medical knowledge graph construction[C]//Proceedings of the China Conference on Knowledge Graph and Semantic Computing, 2021: 308-314.
[5] CURLEY S P, CONNELLY D P, RICH E C. Physicians' use of medical knowledge resources: Preliminary theoretical framework and findings[J]. Medical Decision Making, 1990, 10(4): 231-41.
[6] WU S T, LIU H, LI D, et al. FOCUS on clinical research informatics: Unified medical language system term occurrences in clinical notes: A large-scale corpus analysis[J]. Journal of the American Medical Informatics Association Jamia, 2012, 19(e1): 149-56.
[7] WANG Y, WANG L, RASTEGAR-MOJARAD M, et al. Clinical information extraction applications: A literature review[J]. Journal of Biomedical Informatics, 2018, 77: 34-49.
[8] ROBERTS A, GAIZAUSKAS R, HEPPLE M, et al. Building a semantically annotated corpus of clinical texts[J]. Journal of Biomedical Informatics, 2009, 42(5): 950-66.
[9] BODENREIDER O. The unified medical language system (UMLS): Integrating biomedical terminology[J]. Nucleic Acids Research, 2004, 32(suppl_1): D267-D70.
[10] KIM J-D, OHTA T, TATEISI Y, et al. GENIA corpus: A semantically annotated corpus for bio-textmining[J]. Bioinformatics, 2003, 19(suppl_1): i180-i2.
[11] ROSARIO B, HEARST M A. Classifying semantic relations in bioscience texts[C]//Proceedings of the 42nd Annual Meeting of the Association for Computational Linguistics, 2004: 430-437.
[12] GARTEN Y, TATONETTI N P, ALTMAN R B. Improving the prediction of pharmacogenes using text-derived drug-gene relationships[M]. Biocomputing 2010. World Scientific, 2010: 305-14.
[13] LEAMAN R, MILLER C, GONZALEZ G. Enabling recognition of diseases in biomedical text with machine learning: Corpus and benchmark[C]//Proceedings of the Symposium on Languages in Biology and Medicine, 2009.
[14] MARTNEZ-DEMIGUEL C, SEGURA-BEDMAR I, CHACN-SOLANO E, et al. The RareDis corpus: A corpus annotated with rare diseases, their signs and symptoms[J]. Journal of Biomedical Informatics, 2022, 125: 103961.
[15] YU S, YUAN Z, XIA J, et al. Bios: An algorithmically generated biomedical knowledge graph[J/OL]. arXiv preprint arXiv: 2203.09975, 2022.
[16] UZUNER , SOUTH B R, SHEN S, et al. 2010 i2b2/VA challenge on concepts, assertions, and relations in clinical text[J]. Journal of the American Medical Informatics Association, 2011, 18(5): 552-6.
[17] KITTNER M, LAMPING M, RIEKE D T, et al. Annotation and initial evaluation of a large annotated German oncological corpus[J]. JAMIA Open, 2021, 4(2): ooab025.
[18] CAMPILLOS L, DELGER L, GROUIN C, et al. A French clinical corpus with comprehensive semantic annotations: Development of the medical entity and relation LIMSI and otated text corpus[J]. Language Resources and Evaluation, 2018, 52(2): 571-601.
[19] 周肖彬, 曹存根. 基于本体的医学知识获取[J]. 计算机科学, 2003, 030(010): 35-39.
[20] 阮彤, 孙程琳, 王昊奋, 等. 中医药知识图谱构建与应用[J].医学信息学杂志, 2016, 37(04): 8-13.
[21] LEI J, TANG B, LU X, et al. A comprehensive study of named entity recognition in Chinese clinical text[J]. Journal of the American Medical Informatics Association, 2014, 21(5): 808-814.
[22] 杨锦锋, 关毅, 何彬, 等. 中文电子病历命名实体和实体关系语料库构建[J]. 软件学报, 2016, (11): 2725-2746.
[23] GAO Y, GU L, WANG Y, et al. Constructing a Chinese electronic medical record corpus for named entity recognition on resident admit notes[J]. BMC Medical Informatics and Decision Making, 2019, 19(2): 67-78.
[24] ALNAZZAWI N, THOMPSON P, ANANIADOU S. Building a semantically annotated corpus for congestive heart and renal failure from clinical records and the literature[C]//Proceedings of the 5th International Workshop on Health Text Mining and Information Analysis, 2014: 69-74.
[25] ALNAZZAWI N. Building a semantically annotated corpus for chronic disease complications using two document types[J]. PloS One, 2021, 16(3): e0247319.
[26] 张欢,宗源,常宝宝,等. 面向医学文本处理的医学实体标注规范[C]//第十九届中国计算语言学大会论文集. 2020: 561-571.
[27] ZHANG N, CHEN M, BI Z, et al. Cblue: A Chinese biomedical language understanding evaluation benchmark[C]//Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics, 2022: 7888-7915.
[28] GUAN T, ZAN H, ZHOU X, et al. CMeIE: Construction and evaluation of Chinese medical information extraction dataset[C]//Proceedings of the 9th CCF International Conference on Natural Language Processing and Chinese Computing, 2020: 270-282.
[29] EL-SAPPAGH S, FRANDA F, ALI F, et al. SNOMED CT standard ontology based on the ontology for general medical science[J]. BMC Medical Informatics and Decision Making, 2018, 18(1): 1-19.
[30] HADZIC M, CHANG E. Ontology-based support for human disease study[C]//Proceedings of the 38th Annual Hawaii International Conference on System Sciences, 2005: 1-7.
[31] LI X, YAN H, QIU X, et al. FLAT: Chinese NER using flat-lattice transformer[C]//Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. 2020: 6836-6842.
[32] DAI Z, WANG X, NI P, et al. Named entity recognition using BERT BiLSTM CRF for Chinese electronic health records[C]//Proceedings of the 12th International Congress on Image and Signal Processing, BioMedical Engineering and Informatics, 2019.
[33] HE B, GUAN Y, DAI R. Classifying medical relations in clinical text via convolutional neural networks[J]. Artificial Intelligence in Medicine, 2019,93(c): 43-49.
[34] WU S, HE Y. Enriching pre-trained language model with entity information for relation classification[C]//Proceedings of the 28th ACM International Conference on Information and knowledge Managemene. 2019: 2361-2364.
[35] CUI Y, CHE W, LIU T, et al. Pre-training with whole word masking for Chinese BERT[J]. Institute of Electrical and Electronics Engineers (IEEE), 2021,29: 3504-3514.

基金

国家自然科学基金(62006063);黑龙江省博士后科学基金(LBH-Z20015)
PDF(9409 KB)

2411

Accesses

0

Citation

Detail

段落导航
相关文章

/