目前针对中医古籍实体识别研究较少,且大多使用有监督学习方法。但古籍数字化程度低、标注语料稀少,且其语言多为文言文,专业术语也不断发展,现有方法无法有效解决以上问题。故而,该文在构建了中医古籍语料库的基础上,通过对中医古籍中实体名的分析研究,提出了一种基于半监督学习和规则相结合的中医古籍实体识别方法。以条件随机场模型为基本框架,在引入词、词性、词典等有监督特征的同时也引入了通过词向量获得的无监督语义特征,对比不同特征组合的识别性能,确定最优的半监督学习模型,并与其他模型进行了对比。之后,结合古籍语言学特点构建规则库对其进行基于规则的后处理。实验结果中最终F值达到83.18%,证明了该方法的有效性。
Abstract
The named entity recognition of traditional Chinese medicine books is a less addressed topic. Considering the difficulty and cost in annotating such professional text in classical Chinese, this paper proposes a method for identifying traditional Chinese medicine entities based on a combination of semi-supervised learning and rules. Under the framework of the conditional random fields model, supervised features such as lexical features and dictionary features are introduced together with the unsupervised semantic features derived from word vectors. The optimal semi-supervised learning model is gained by examining the performance of different feature combinations. Finally, the recognition results of the model are analyzed and a rule based post-processing is established with the linguistic characteristics of ancient books. Experiments results reveals 83.18% F-score, which proves the validity of this method.
关键词
半监督学习 /
条件随机场 /
命名实体识别 /
中医古籍
{{custom_keyword}} /
Key words
semi-supervised /
conditional random fields /
named entity recognition /
traditional Chinese medicine books
{{custom_keyword}} /
{{custom_sec.title}}
{{custom_sec.title}}
{{custom_sec.content}}
参考文献
[1] 李兵, 张华敏, 李莎莎, 等.中医古籍知识深度利用方法与知识库构建[J].中国数字医学, 2018, 13(8):33-35.
[2] 刘凯, 周雪忠, 于剑,等. 基于条件随机场的中医临床病历命名实体抽取[J].计算机工程, 2014(9):312-316.
[3] 孟洪宇,谢晴宇,常虹,等.基于条件随机场的《伤寒论》中医术语自动识别[J].北京中医药大学学报, 2015, 38(9): 587-590.
[4] 叶辉,姬东鸿.基于多特征条件随机场的《金匮要略》症状药物信息抽取研究[J].中国中医药图书情报杂志, 2016, 40(5): 14-17.
[5] 王国龙, 杜建强, 郝竹林, 等.中医诊断古文的词性标注与特征重组[J].计算机工程与设计, 2015, 38(3):835-840.
[6] 李明浩, 刘忠, 姚远哲. 基于LSTM-CRF的中医医案症状术语识别[J].计算机应用, 2018, 038(0z2):42-46.
[7] Turian J, Ratinov L, Bengio Y. Word representations: a simple and general method for semi-supervised learning[C]//Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, 2010: 384-394.
[8] Alfred R, Leong L C, On C K, et al. A rule-based named-entity recognition formalay articles[C]//Proceedings of the International Conference on Advanced Data Mining and Applications. Springer, Berlin, Heidelberg, 2013: 288-299.
[9] Hanisch D, Fundel K, Mevissen H T, et al. ProMiner: rule-based protein and gene entity recognition[J].BMC bioinformatics, 2005, 6(1): S14.
[10] Zhou G D, Su J.Named entity recognition using an HMM-based chunk tagger[C]//Proceedings of the 40th Annual Meeting on Association for Computational Linguistics. Association for Computational Linguistics, 2002: 473-480.
[11] Singh V, Vijay D, Akhtar S S, et al. Named entity recognition for hindi-English code-mixed social media text[C]//Proceedings of the 7th Named Entities Workshop, 2018: 27-35.
[12] 李珩,朱靖波,姚天顺.基于 SVM 的中文组块分析[J]. 中文信息学报, 2004, 18(2): 2-8.
[13] 王世昆,李绍滋,陈彤生.基于条件随机场的中医命名实体识别[J].厦门大学学报(自然科学版), 2009, 048(003):359-364.
[14] 加羊吉,李亚超,宗成庆,等.最大熵和条件随机场模型相融合的藏文人名识别[J].中文信息学报, 2014, 28(1):107-112.
[15] Das A,Ganguly D, Garain U. Named entity recognition with word embeddings and Wikipedia categories for a low-resource language[J].ACM Transactions on Asian and Low-Resource Language Information Processing, 2017, 16(3): 1-19.
[16] 李广一, 王厚峰. 基于多步聚类的汉语命名实体识别和歧义消解[J].中文信息学报, 2013, 27(5): 29-35.
[17] Li J, Sun A, Han J, et al. A survey on deep learning for namedentity recognition[C]//Proceedings of the IEEE Transactions on Knowledge and Data Engineering, 2020.
[18] Yadav V, Bethard S. A survey on recent advances in named entity recognition from deep learning models[C]//Proceedings of the 27th International Conference on Computational Linguistics, 2018: 2145-2158.
[19] Ma X, Hovy E. End-to-end sequence labeling via bi-directional LSTM-CNNs-CRF[C]//Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, 2016: 1064-1074.
[20] Zhang Y, Yang J. Chinese NER using lattice LSTM[C]//Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, 2018: 1554-1564.
[21] 武文雅, 陈钰枫, 徐金安, 等.中文实体关系抽取研究综述[J].计算机与现代化, 2018, 276(08):25-31.
[22] 孟凡红,尚文玲,李莎莎, 等.中医古籍分类体系及其演变[J].中华医学图书情报杂志,2015(9):62-66.
[23] Guo J, Che W, Wang H, et al. Revisiting embedding features for simple semi-supervised learning[C]//Proceedings of the Conference on Empirical Methods in Natural Language Processing, 2014: 110-120.
[24] Mnih A, Hinton G E. A scalable hierarchical distributed language model[C]//Proceedings of the Advances in Neural Information Processing Systems, 2009: 1081-1088.
[25] Collobert R, Weston J. A unified architecture for natural language processing: deep neural networks with multitask learning[C]//Proceedings of the 25th International Conference on Machine Learning, 2008: 160-167.
[26] Mikolov T, Sutskever I, Chen K,et al. Distributed representations of words and phrases and their compositionality[C]//Proceedings of the 21st International Conference on Neural Information Processing Systems, 2013, 26: 3111-3119.
[27] Siwei L, Kang L, Shizhu H, et al. How to generate a good word embedding[J]. IEEE Intelligent Systems, 2016, 31(6): 5-14.
[28] Lafferty J D, Mccallum A, Pereira F C N. Conditional random fields: probabilistic models for segmenting and labeling sequence data[C]//Proceedings of the 18th International Conference on Machine Learning. Morgan Kaufmann Publishers Inc, 2001: 282-289.
[29] 黎敬波,马力.中医临床常见症状术语规范[M].北京:中国医药科技出版社,2005.
[30] Ding R, Xie P, Zhang X, et al. A neural multi-digraph model for Chinese NER with gazetteers[C]//Proceedings of the 57th Annual Meetingof the Association for Computational Linguistics, 2019: 1462-1467.
{{custom_fnGroup.title_cn}}
脚注
{{custom_fn.content}}
基金
北京市教委科技计划一般项目(KM202110025021);北京中医药“薪火传承3+3工程”崔锡章中医文化传承工作室;首都医科大学校科研培育基金(PYZ19167)
{{custom_fund}}