NLP Application
ZHANG Zhixing, ZHANG Jiaying, GAO Daqi, RUAN Tong,
WANG Jun, HE Ping, YAO Huayan
2020, 34(12): 100-110.
On Shanghai Regional Health Platform with electronic medical record data of 38 tertiary hospitals, the diversity and ambiguity of clinic indicators have seriously affected medical data mining. In this paper, we propose a semi-automatic terminology base construction solution based on the following four steps: schema design, information extraction, knowledge fusion and knowledge verification. We first build a standard indicator sub-base according to the medical insurance standard provided by Shanghai Municipal Health Commission. Then we use BERT-based clinical indicator alignment model to integrate indicators in 38 hospitals as synonyms into the standard. The constructed terminology base contains 23, 495 entities and 47, 746 factual triples, with potential applications in medical data cleaning, medical record retrieve and other tasks. Experiments show that the F1-score of our alignment model reaches 95.78%, and its application in colorectal cancer data mining task can improve the record up to 94%. In addition, a part of this terminology database related to colorectal cancer has been published in dcazb.ecustnlplab.com.