基于协同集成学习的医疗实体标准化方法

PDF(5146 KB)

中文信息学报 ›› 2023, Vol. 37 ›› Issue (3) : 135-142.

信息抽取与文本挖掘

基于协同集成学习的医疗实体标准化方法

姜京池¹,侯俊屹²,李雪²,关毅²,关昌赫²

作者信息 +

Medical Entity Standardization Method Based on Collaborative Ensemble Learning

JIANG Jingchi¹, HOU Junyi², LI Xue², GUAN Yi², GUAN Changhe²

Author information +

History +

摘要

医疗实体标准化旨在将电子病历、患者主诉等文本数据中非标准化术语映射为统一且规范的医疗实体。针对医学文本普遍存在的标注语料规模小、规范化程度低等领域特点,该文提出了一种基于多模型协同的集成学习框架,用以解决医疗实体标准化问题。该框架通过建立多模型之间的“合作与竞争”模式,能够兼具字符级、语义级等不同标准化方法的优势。具体而言,运用知识蒸馏技术进行协同学习,从各模型中汲取有效特征;利用竞争意识综合各模型的实体标准化结果,保证候选集的多样性。在CHIP-CDN 2021医疗实体标准化评测任务中,该文提出的方法在盲测数据集上达到了73.985%的F₁值,在包括百度BDKG、蚂蚁金融Antins、思必驰AIspeech在内的255支队伍中,取得了第二名的成绩。后续实验结果进一步表明,该方法可有效对医疗文本中的术语进行标准化处理。

Abstract

Medical entity standardization aims to map non-standardized terms in texts (e.g. electronic medical records and patient complaints) into unified and standardized medical entities. In view of the small scale and hardly standardized of annotated corpora in medical texts, this paper proposes a multi-model collaborative ensemble learning framework to solve the standardization of medical entities. By establishing a "cooperation and competition" mechanism among multiple models, we can combine the advantages of different standardization methods in character level and semantic level. Specifically, the collaborative learning implemented by knowledge distillation technology can extract effective features from each model. The diversity of candidate sets can be guaranteed by integrating entity standardization results of each model with competition-aware. In the CHIP-CDN 2021 task of medical entity standardization, the method proposed achieved a F1 value of 73.985% in the blind test data set, ranking second among 255 teams including Baidu BDKG, Ant-Financial Antins and AISpeech. Experimental results also show that this method can effectively standardize terms in medical texts.

导出引用

姜京池,侯俊屹,李雪,关毅,关昌赫. 基于协同集成学习的医疗实体标准化方法. 中文信息学报. 2023, 37(3): 135-142

JIANG Jingchi, HOU Junyi, LI Xue, GUAN Yi, GUAN Changhe. Medical Entity Standardization Method Based on Collaborative Ensemble Learning. Journal of Chinese Information Processing. 2023, 37(3): 135-142

参考文献

[1] D'SOUZA J, NG V. Sieve-based entity linking for the biomedical domain[C]//Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing, 2015: 297-302.
[2] LEAMAN R, LU Z. TaggerOne: Joint named entity recognition and normalization with semi-Markov models[J]. Bioinformatics, 2016, 32(18): 2839-2846.
[3] RAJANI N F, BORNEA M, BARKER K. Stacking with auxiliary features for entity linking in the medical domain[C]//Proceedings of the BioNLP, 2017: 39-47.
[4] LI H, CHEN Q, TANG B, et al. CNN-based ranking for biomedical entity normalization[J]. BMC Bioinformatics, 2017, 18(11): 385.
[5] WRIGHT D. Normco: Deep disease normalization for biomedical knowledge base construction[D]. UC San Diego, 2019.
[6] JACOB D, MING-WEI C. KENTON L, et al. BERT: Pre-training of deep bidirectional transformers for language understanding [J]. arXiv: 1810. 04805, 2018.
[7] 陈漠沙,仇伟,谭传奇. 基于BERT的手术名称标准化重排序算法[J].中文信息学报,2021,35(03): 88-93.
[8] 崇伟峰, 李慧, 李雪,等. 基于BERT蕴含推理的术语标准化系统[J]. 中文信息学报, 2021, 35(5): 86-90.
[9] 孙曰君,刘智强,杨志豪,等. 基于BERT的临床术语标准化[J].中文信息学报,2021,35(04): 75-82.
[10] 闫璟辉,向露,周玉,等. 深度生成式模型在临床术语标准化中的应用[J].中文信息学报,2021,35(05): 77-85.
[11] VASWANI A, SHAZEER N, PARMAR N, et al. Attention is all you need[C]//Proceedings of the 31st International Conference of Neural Information Processing Systems, 2017, 6000-6010.
[12] WEI C H, PENG Y, LEAMAN R, et al. Overview of the BioCreative V chemical disease relation (CDR) task[C]//Proceedings of the 5th BioCreative Challenge Evaluation Workshop, 2015, 14.
[13] ELHADAD N, PRADHAN S, GORMAN S, et al. SemEval-2015 task 14: Analysis of clinical text[C]//Proceedings of the 9th International Workshop on Semantic Evaluation, 2015: 303-310.
[14] LUO Y F, SUN W, RUMSHISKY A. MCN: A comprehensive corpus for medical concept normalization[J]. Journal of Biomedical Informatics, 2019: 103-132.
[15] ZHANG N, BI Z, LIANG X, et al. CBLUE: A Chinese biomedical language understanding evaluation benchmark[C]//Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics, 2022:7888-7915.
[16] 李亮德,王秀娟,康孟珍,等. 基于语义融合与模型蒸馏的农业实体识别[J].智慧农业(中英文),2021,3(01): 118-128.
[17] 唐晨. 面向医疗文本的实体抽取及概念标准化技术研究[D]. 哈尔滨: 哈尔滨工业大学硕士学位论文, 2020.
[18] GU Y, TINN R, CHENG H, et al. Domain-specific language model pretraining for biomedical natural language processing[J]. ACM Transactions on Computing for Healthcare, 2021, 3(1): 1-23.
[19] LI X, XIONG H, CHEN Z, et al. In-network ensemble: Deep ensemble learning with diversified knowledge distillation[J]. ACM Transactions on Intelligent Systems and Technology, 2021, 12(5): 1-19.
[20] SUNG M, JEON H, LEE J, et al. Biomedical entity representations with synonym marginalization[C]//Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, 2020: 3641-3650.

基金

国家青年自然基金(NSFC62006063);黑龙江省博士后面上自然基金(LBH-Z20015)

PDF(5146 KB)

1181

Accesses

Citation

Detail

段落导航

摘要
Abstract
关键词
Key words
引用本文
参考文献
基金

选择文件类型/文献管理软件名称

选择包含的内容

摘要

Abstract

关键词

Key words

引用本文

{{custom_sec.title}}

{{custom_sec.title}}

参考文献

{{custom_fnGroup.title_cn}}

脚注

基金

Published
2023-05-08
Issue Date
2023-05-17