医疗实体标准化旨在将电子病历、患者主诉等文本数据中非标准化术语映射为统一且规范的医疗实体。针对医学文本普遍存在的标注语料规模小、规范化程度低等领域特点,该文提出了一种基于多模型协同的集成学习框架,用以解决医疗实体标准化问题。该框架通过建立多模型之间的“合作与竞争”模式,能够兼具字符级、语义级等不同标准化方法的优势。具体而言,运用知识蒸馏技术进行协同学习,从各模型中汲取有效特征;利用竞争意识综合各模型的实体标准化结果,保证候选集的多样性。在CHIP-CDN 2021医疗实体标准化评测任务中,该文提出的方法在盲测数据集上达到了73.985%的F1值,在包括百度BDKG、蚂蚁金融Antins、思必驰AIspeech在内的255支队伍中,取得了第二名的成绩。后续实验结果进一步表明,该方法可有效对医疗文本中的术语进行标准化处理。
Abstract
Medical entity standardization aims to map non-standardized terms in texts (e.g. electronic medical records and patient complaints) into unified and standardized medical entities. In view of the small scale and hardly standardized of annotated corpora in medical texts, this paper proposes a multi-model collaborative ensemble learning framework to solve the standardization of medical entities. By establishing a "cooperation and competition" mechanism among multiple models, we can combine the advantages of different standardization methods in character level and semantic level. Specifically, the collaborative learning implemented by knowledge distillation technology can extract effective features from each model. The diversity of candidate sets can be guaranteed by integrating entity standardization results of each model with competition-aware. In the CHIP-CDN 2021 task of medical entity standardization, the method proposed achieved a F1 value of 73.985% in the blind test data set, ranking second among 255 teams including Baidu BDKG, Ant-Financial Antins and AISpeech. Experimental results also show that this method can effectively standardize terms in medical texts.
关键词
医疗实体标准化 /
知识蒸馏 /
集成学习 /
CHIP-CDN 2021
{{custom_keyword}} /
Key words
medical entity standardization /
knowledge distillation /
ensemble learning /
CHIP-CDN 2021
{{custom_keyword}} /
{{custom_sec.title}}
{{custom_sec.title}}
{{custom_sec.content}}
参考文献
[1] D'SOUZA J, NG V. Sieve-based entity linking for the biomedical domain[C]//Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing, 2015: 297-302.
[2] LEAMAN R, LU Z. TaggerOne: Joint named entity recognition and normalization with semi-Markov models[J]. Bioinformatics, 2016, 32(18): 2839-2846.
[3] RAJANI N F, BORNEA M, BARKER K. Stacking with auxiliary features for entity linking in the medical domain[C]//Proceedings of the BioNLP, 2017: 39-47.
[4] LI H, CHEN Q, TANG B, et al. CNN-based ranking for biomedical entity normalization[J]. BMC Bioinformatics, 2017, 18(11): 385.
[5] WRIGHT D. Normco: Deep disease normalization for biomedical knowledge base construction[D]. UC San Diego, 2019.
[6] JACOB D, MING-WEI C. KENTON L, et al. BERT: Pre-training of deep bidirectional transformers for language understanding [J]. arXiv: 1810. 04805, 2018.
[7] 陈漠沙,仇伟,谭传奇. 基于BERT的手术名称标准化重排序算法[J].中文信息学报,2021,35(03): 88-93.
[8] 崇伟峰, 李慧, 李雪,等. 基于BERT蕴含推理的术语标准化系统[J]. 中文信息学报, 2021, 35(5): 86-90.
[9] 孙曰君,刘智强,杨志豪,等. 基于BERT的临床术语标准化[J].中文信息学报,2021,35(04): 75-82.
[10] 闫璟辉,向露,周玉,等. 深度生成式模型在临床术语标准化中的应用[J].中文信息学报,2021,35(05): 77-85.
[11] VASWANI A, SHAZEER N, PARMAR N, et al. Attention is all you need[C]//Proceedings of the 31st International Conference of Neural Information Processing Systems, 2017, 6000-6010.
[12] WEI C H, PENG Y, LEAMAN R, et al. Overview of the BioCreative V chemical disease relation (CDR) task[C]//Proceedings of the 5th BioCreative Challenge Evaluation Workshop, 2015, 14.
[13] ELHADAD N, PRADHAN S, GORMAN S, et al. SemEval-2015 task 14: Analysis of clinical text[C]//Proceedings of the 9th International Workshop on Semantic Evaluation, 2015: 303-310.
[14] LUO Y F, SUN W, RUMSHISKY A. MCN: A comprehensive corpus for medical concept normalization[J]. Journal of Biomedical Informatics, 2019: 103-132.
[15] ZHANG N, BI Z, LIANG X, et al. CBLUE: A Chinese biomedical language understanding evaluation benchmark[C]//Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics, 2022:7888-7915.
[16] 李亮德,王秀娟,康孟珍,等. 基于语义融合与模型蒸馏的农业实体识别[J].智慧农业(中英文),2021,3(01): 118-128.
[17] 唐晨. 面向医疗文本的实体抽取及概念标准化技术研究[D]. 哈尔滨: 哈尔滨工业大学硕士学位论文, 2020.
[18] GU Y, TINN R, CHENG H, et al. Domain-specific language model pretraining for biomedical natural language processing[J]. ACM Transactions on Computing for Healthcare, 2021, 3(1): 1-23.
[19] LI X, XIONG H, CHEN Z, et al. In-network ensemble: Deep ensemble learning with diversified knowledge distillation[J]. ACM Transactions on Intelligent Systems and Technology, 2021, 12(5): 1-19.
[20] SUNG M, JEON H, LEE J, et al. Biomedical entity representations with synonym marginalization[C]//Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, 2020: 3641-3650.
{{custom_fnGroup.title_cn}}
脚注
{{custom_fn.content}}
基金
国家青年自然基金(NSFC62006063);黑龙江省博士后面上自然基金(LBH-Z20015)
{{custom_fund}}