基于远程监督的病历文本漏标问题研究

杨一帆,施淼元,缪庆亮,李茂龙

PDF(3264 KB)
PDF(3264 KB)
中文信息学报 ›› 2022, Vol. 36 ›› Issue (8) : 73-80.
信息抽取与文本挖掘

基于远程监督的病历文本漏标问题研究

  • 杨一帆,施淼元,缪庆亮,李茂龙
作者信息 +

Conquering Unlabeled Entity in Medical Record Text under
Distant Supervision Framework

  • YANG Yifan, SHI Miaoyuan, MIAO Qingliang, LI Maolong
Author information +
History +

摘要

医疗健康一直是人们热议的话题,针对病历文本的自动抽取技术也日趋重要。目前医疗领域数据人工标注成本高,获取大规模标注语料较困难。一种解决标注语料缺失的方法是基于词表的远程监督方法。但由于远程监督的标准数据质量不高,导致模型性能缩水严重。该文主要研究如何缓解远程监督带来的数据漏标问题。通过对数据进行增强、结合基于片段排列的命名实体识别模型与负采样方法提高模型泛化能力,并选取全局最优节点集合解决实体识别冲突问题。实验表明,数据增强与选取全局最优节点集合两者分别对结果有0.5%左右稳定提高,负采样方法提高5%至10%不等。

Abstract

Automatic extraction technology for medical record text is becoming increasingly important. At present, the distant supervision method is a popular solution to the lack of labeled corpus. Focusing on alleviating the unlabeled entity issue caused by distant supervision, this paper proposes a combined strategy of data augmentation, negative sampling and global optimal node set selection for the span-level based named entity recognition model. Experiments show that both data enhancement and the global optimal node set selection have a stable improvement of about 0.5%, and the negative sampling method has 5% to 10% improvement.

关键词

命名实体识别 / 远程监督 / 数据漏标 / 数据增强 / 负采样

Key words

named entity recognition / distant supervision / data omission / data augmentation / negative sampling

引用本文

导出引用
杨一帆,施淼元,缪庆亮,李茂龙. 基于远程监督的病历文本漏标问题研究. 中文信息学报. 2022, 36(8): 73-80
YANG Yifan, SHI Miaoyuan, MIAO Qingliang, LI Maolong. Conquering Unlabeled Entity in Medical Record Text under
Distant Supervision Framework. Journal of Chinese Information Processing. 2022, 36(8): 73-80

参考文献

[1] 奥德玛, 杨云飞, 穗志方, 等. 中文医学知识图谱CMeKG构建初探[J]. 中文信息学报, 2019, 33(10): 1-7.
[2] Li L, Wang P, Yan J, et al. Real-world data medical knowledge graph: construction and applications[J]. Artificial Intelligence in Medicine, 2020, 103: 101817.
[3] Chinchor N, Robinson P. MUC-7 named entity task definition [C]//Proceedings of the 7th Conference on Message Understanding, 1997, 29: 1-21.
[4] 徐志祥, 王莹. 我国医疗行业大数据应用现状及政策建议[J]. 中国卫生信息管理杂志, 2017, 014(006): 822-825.
[5] Mintz M, Bills S, Snow R, et al. Distant supervision for relation extraction without labeled data[C]//Proceedings of the 47th Annual Meeting of the Association for Computational Linguistics and the 4th International Joint Conference on Natural Language Processing of the AFNLP, 2009: 1003-1011.
[6] Donnelly K. SNOMED-CT: The advanced terminology and coding system for eHealth[J]. Studies in Health Technology and Informatics, 2006, 121: 2.
[7] Campbell K E, Oliver D E, Shortliffe E H. The unified medical language system[J]. J Am Med Inform Association, 1998(1): 12-16.
[8] 卫生部卫生统计信息中心. 国际疾病分类(ICD-10)应用指导手册[M]. 北京: 中国协和医科大学出版社, 2001.
[9] Zhang W, Deng H. Introduction of ICD-11 and its traditional medicine module[C]//Proceedings of Shanghai Journal of Traditional Chinese Medicine, 2019, 52(6): 10-13.
[10] Wei J, Zou K. Eda: Easy data augmentation techniques for boosting performance on text classification tasks[J]. arXiv preprint arXiv: 1901.11196, 2019.
[11] Kobayashi S. Contextual augmentation: Data augmentation by words with paradigmatic relations[C]//Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2018, 452-457.
[12] Harrington P. Machine learning in action[M]. Greenwich: Manning, 2012: 113-114.
[13] Lafferty J, Mc Callum A, Pereira F C N. Conditional random fields: Probabilistic models for segmenting and labeling sequence data[C]//Proceedings of the 18th International Conference on Machine Learning, 2001: 282-289.
[14] Passos A, Kumar V, McCallum A. Lexicon infused phrase embeddings for named entity resolution[J]. arXiv preprint arXiv: 1404.5367, 2014: 78-86.
[15] Lample G, Ballesteros M, Subramanian S,et al. Neural architectures for named entity recognition[C]//Proceedings of the Annual Conference of the North American Chapter of the Association for Computational Linguistics, 2016: 260-270.
[16] Devlin J, Chang M W, Lee K, et al. BERT: Pre-training of deep bidirectional transformers for language understanding[J].arXiv preprint arXiv: 1810.04805, 2018.
[17] Dai Z, Wang X, Ni P, et al. Named entity recognition using BERT BiLSTM CRF for Chinese electronic health records[C]//Proceedings of the 12th International Congress on Image and Signal Processing, Bio Medical Engineering and Informatics, 2019: 1-5.
[18] Eberts M, Ulges A. Span-based joint entity and relation extraction with transformer pre-training[J]. arXiv preprint arXiv: 1909.07755, 2019.
[19] Yu J, Bohnet B, Poesio M. Named entity recognition as dependency parsing[C]//Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, 2020: 6470-6476.
[20] Peng M, Xing X, Zhang Q, et al. Distantly supervised named entity recognition using positive-unlabeled learning[C]//Proceedings of the Meeting of the Association for Computational Linguistics, 2019: 2409-2419.
[21] Yang Y, Chen W, Li Z, et al. Distantly supervised ner with partial annotation learning and reinforcement learning[C]//Proceedings of the 27th International Conference on Computational Linguistics, 2018: 2159-2169.
[22] Li Y, Liu L, Shi S. Empirical analysis of unlabeled entity problem in named entity recognition[J].arXiv preprint arXiv: 2012.05426, 2020.
[23] Dai X, Adel H. An analysis of simple data augmentation for named entity recognition[J].arXiv preprint arXiv: 2010.11683, 2020.
PDF(3264 KB)

Accesses

Citation

Detail

段落导航
相关文章

/