张知行,张佳影,高大启,阮彤,王俊,何萍,姚华彦. 临床检验指标术语库的构建与病历挖掘应用[J]. 中文信息学报, 2020, 34(12): 100-110.
ZHANG Zhixing, ZHANG Jiaying, GAO Daqi, RUAN Tong, WANG Jun, HE Ping, YAO Huayan. Construction of Clinic Indicator Terminology Base and Its Application in Medical Record Mining. , 2020, 34(12): 100-110.
Construction of Clinic Indicator Terminology Base and Its Application in Medical Record Mining
ZHANG Zhixing1, ZHANG Jiaying1, GAO Daqi1, RUAN Tong1, WANG Jun2, HE Ping3, YAO Huayan4
1.School of Informaton Science and Engineering, East China University of Science and Technology, Shanghai 200237, China; 2.SimMed, Shanghai 200436, China;3.Shanghai Hospital Development Center, Shanghai 200041, China; 4.Ruijin Hospital, School of Medicine, Shanghai JiaoTong University, Shanghai 200025, China
Abstract:On Shanghai Regional Health Platform with electronic medical record data of 38 tertiary hospitals, the diversity and ambiguity of clinic indicators have seriously affected medical data mining. In this paper, we propose a semi-automatic terminology base construction solution based on the following four steps: schema design, information extraction, knowledge fusion and knowledge verification. We first build a standard indicator sub-base according to the medical insurance standard provided by Shanghai Municipal Health Commission. Then we use BERT-based clinical indicator alignment model to integrate indicators in 38 hospitals as synonyms into the standard. The constructed terminology base contains 23, 495 entities and 47, 746 factual triples, with potential applications in medical data cleaning, medical record retrieve and other tasks. Experiments show that the F1-score of our alignment model reaches 95.78%, and its application in colorectal cancer data mining task can improve the record up to 94%. In addition, a part of this terminology database related to colorectal cancer has been published in dcazb.ecustnlplab.com.
[1] Bodenreider O. The unified medical language system (UMLS): Integrating biomedical terminology[J]. Nucleic Acids Research, 2004, 32(suppl_1): D267-D270. [2] NCBI. Medical Subject Headings[EB/OL]. [2019-08-15]http://www.ncbi.nlm.nih.gov/mesh/. [3] Donnelly K. SNOMED-CT: The advanced terminology and coding system for eHealth[J]. Studies in Health Technology and Informatics, 2006, 121: 279-290. [4] McDonald C J, Huff S M,Suico J G, et al. LOINC: A universal standard for identifying laboratory observations: A 5-year update[J]. Clinical Chemistry, 2003, 49(4): 624-633. [5] 刘峤,李杨,段宏,等. 知识图谱构建技术综述[J]. 计算机研究与发展, 2016, 53(3): 582-600. [6] 阮彤,王梦婕,王昊奋,等. 垂直知识图谱的构建与应用研究[J]. 知识管理论坛, 2016 (3): 226-234. [7] 漆桂林,高桓,吴天星. 知识图谱研究进展[J]. 情报工程, 2017, 3(1): 4-25. [8] 庄严,李国良,冯建华. 知识库实体对齐技术综述[J]. 计算机研究与发展, 2016, 53(1): 165-192. [9] Newcombe H B, Kennedy J M,Axford S J, et al. Automatic linkage of vital records[J]. Science, 1959, 130(3381): 954-959. [10] Song D, Luo Y, Heflin J. Linking heterogeneous data in the semantic web using scalable and domain-independent candidate selection[J]. IEEE Transactions on Knowledge and Data Engineering, 2017, 29(1): 143-156. [11] Lacoste-Julien S,Palla K, Davies A, et al. Sigma: Simple greedy matching for aligning large knowledge bases[C]//Proceedings of the 19th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 2013: 572-580. [12] Bhattacharya I,Getoor L. Collective entity resolution in relational data[J]. ACM Transactions on Knowledge Discovery from Data (TKDD), 2007, 1(1): 5. [13] Bordes A, Usunier N, Garcia-Duran A, et al. Translating embeddings for modeling multi-relational data[C]//Proceedings of the 26th International Conference on Neural Information Processing Systems, 2013: 2787-2795. [14] Chen M, Tian Y, Yang M, et al. Multilingual knowledge graph embeddings for cross-lingual knowledge alignment[C]//Proceedings of the 26th International Conference on Artificial Intelligence, 2017, 1511-1517. [15] Sun Z, Hu W, Zhang Q, et al. Bootstrapping entity alignment with knowledge graph embedding[C]//Proceedings of the 27th International Conference on Artificial Intelligence, 2018: 4396-4402. [16] Auer S,Bizer C, Kobilarov G, et al. DBPedia: A nucleus for a web of open data[M]. The Semantic Web. Berlin: Springer, Heidelberg, 2007: 722-735. [17] Suchanek F M, Kasneci G, Weikum G. YAGO: A core of semantic knowledge[C]//Proceedings of the 16th International Conference on World Wide Web. ACM, 2007: 697-706. [18] Niu X, Sun X, Wang H, et al. Zhishi. me: Weaving Chinese linking open data[C]//Proceedings of the 10th International Semantic Web Conference. Springer, Berlin, Heidelberg, 2011: 205-220. [19] Xu B, Xu Y, Liang J, et al. CN-DBpedia: A never-ending Chinese knowledge extraction system[C]//Proceedings of the 24th International Conference on Industrial, Engineering and other Applications of Applied Intelligent Systems. Springer, Cham, 2017: 428-438. [20] Banda J M, Kuhn T, Shah N H, et al. Provenance-centered dataset of drug-drug interactions[C]Proceedings of the 14th International Semantic Web Conference. Springer International Publishing, 2015: 293-300. [21] Ruan T, Wang M, Sun J, et al. An automatic approach for constructing a knowledge base of symptoms in Chinese[J]. Journal of Biomedical Semantics, 2017, 8(1): 33. [22] Ruan T, Xue L, Wang H, et al. Building and exploring an enterprise knowledge graph for investment analysis[C]//Proceedings of the 15th International Semantic Web Conference. Springer, Cham, 2016: 418-436. [23] Brickley D. RDF vocabulary description language 1.0: RDF schema[DB/OL]. [2014-02-25]http://www. w3. org/TR/rdf-schema/. [24] Wang Q, Xu C, Zhou Y, et al. An attention-based Bi-GRU-CapsNet model for hypernymy detection between compound entities[C]//Proceedings of the 2018 International Conference on Bioinformatics and Biomedicine (BIBM). IEEE, 2018: 1031-1035. [25] Devlin J, Chang M W, Lee K, et al. Bert: Pre-training of deep bidirectional transformers for language understanding[J].arXiv preprint arXiv:1810.04805, 2018. [26] Vaswani A,Shazeer N, Parmar N, et al. Attention is all you need[C]//Proceedings of the 31st International Conference on Neural Information Processing Systems, 2017: 5998-6008. [27] LeCun Y, Bottou L, Bengio Y, et al. Gradient-based learning applied to document recognition[J]. Proceedings of the IEEE, 1998, 86(11): 2278-2324. [28] Zhou P, Shi W, Tian J, et al. Attention-based bidirectional long short-term memory networks for relation classification[C]//Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), 2016, 2: 207-212. [29] Lei Ba J,Kiros J R, Hinton G E. Layer normalization[J]. arXiv preprint arXiv:1607.06450, 2016. [30] Carletta J. Assessing agreement on classification tasks: The Kappa statistic[J]. Computational Linguistics, 1996, 22(2): 249-254. [31] Zhang J, Wang Q, Zhang Z, et al. An effective standardization method for the lab indicators in regional medical health platform using n-grams and stacking[C]//Proceedings of the 2018 IEEE International Conference on Bioinformatics and Biomedicine (BIBM). IEEE, 2018: 1602-1609.