该文提出了一种在低资源条件下,只利用无标注文档资源进行电力领域命名实体识别的无监督方法。该方法收集电力领域相关语料,利用串频统计技术更新电力领域词典,同时根据结构化电力数据解析出实体词及其类型,并通过表示学习获得每种实体类型的代表词表示。同时利用BERT全词遮盖技术对文本中的词语进行预测,计算文本词语和实体类型代表词之间的语义相似度,进而完成命名实体识别及类型判断。实验表明,该方法对数据条件要求低,具有很强的实用性,且易于复用到其他领域。
Abstract
This paper proposes an unsupervised method for the low-resource named entity recognition in the electric power domain. We collect the target domain corpus and use string statistics techniques to update the domain vocabulary. We also obtain a small scale of entity words with their types by parsing the structured electric power maintenance manuals, and the representative words for each entity type are selected according to the word embedding based similarity. At the same time, we pre-train the electric power BERT model with the whole word masking technique, and predict the entity words in the text and their possible entity types by calculating their semantic similarities with those representative words. Experiments show that our method is feasible for low-source data and can be easily reused in other domains.
关键词
命名实体识别 /
无监督方法 /
电力领域 /
BERT全词遮盖
{{custom_keyword}} /
Key words
named entity recognition /
unsupervised method /
electric power domain /
BERT whole word masking
{{custom_keyword}} /
{{custom_sec.title}}
{{custom_sec.title}}
{{custom_sec.content}}
参考文献
[1] 中国人工智能产业发展联盟. "AIIA"杯国家电网电力专业领域词汇挖掘大赛 [EB/OL]. https://www.datafountain.cn/competitions/320[2018-09-04].
[2] Collobert R, Weston J, Bottou L, et al. A survey on recent advances in named entity recognition from deep learning models natural language processing (almost) from scratch [J]. Journal of Machine Learning Research, 2011, 12(8): 2493-2537.
[3] Zhang S, Elhadad N. Unsupervised biomedical named entity recognition: experiments with clinical and biological texts [J]. Journal of Biomedical Informatics, 2013, 46(6): 1088-1098.
[4] Li J, Sun A, Han J, et al. A survey on deep learning for named entity recognition [J]. IEEE Transactions on Knowledge and Data Engineering. 2020, 34(1): 50-70.
[5] Chang Y, Kong L, Jia K, et al. Chinese named entity recognition method based on BERT [C] //Proceedings of the IEEE International Conference on Data Science and Computer Application, 2021: 294-299.
[6] Nadeau D, Turney P D, Matwin S. Unsupervised named entity recognition: generating gazetteers and resolving ambiguity [C]//Proceedings of the 19th international conference on Advances in Artificial Intelligence: Canadian Society for Computational Studies of Intelligence, 2006: 266-277.
[7] 王群弼. 电力领域实体关系抽取及知识图谱构建研究[D]. 北京:中国地质大学硕士学位论文,2020.
[8] Ji Zx,Wang Xh,Cai Cy,et al. Power entity recognition based on bidirectional long short-term memory and conditional random fields [J]. Global Energy Interconnection, 2020, 3(2):186-192.
[9] Feng SY, Gangal V, Wei J, et al. A survey of data augmentation approaches for NLP[C]//Proceedings of the Findings of the Association for Computational Linguistics, 2021: 968-988.
[10] 肖仰华,等. 知识图谱:概念与技术[M]. 北京:电子工业出版社, 2020.
[11] 赵军. 知识图谱[M]. 北京:高等教育出版社, 2018.
[12] Zhang Y, Yang J. Chinese NER using Lattice LSTM [C] //Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics,2018: 1554-1564.
[13] 石教祥,朱礼军,望俊成,等. 面向少量标注数据的命名实体识别研究 [J]. 技术智能工程,2020,6(4):37-49.
[14] Lin D, Pantel P. Induction of semantic classes from natural language text [C]//Proceedings of ACM SIGKDD on knowledge discovery and data mining, 2001: 317-322.
[15] Zhou Joey Ty, Zhang H, Jin D, et al. Dual Ad-versarial neural transfer for low-resource named entity recognition [C]//Proceedings of the 57th Annual Meeting of the Association for Compu-tational Linguistics, 2019: 3461-3471.
[16] Liu K, Chen Y, Liu J, et al. Extracting events and their relations from texts: A survey on recent research progress and challenges[J]. AI Open, 2020,(1):22-39.
[17] Fries J, Wu S, Ratner A, et al. Swell Shark: A generative model for biomedical named entity recognition without labeled data [J]. arXiv preprint arXiv: 1704.06360, 2017.
[18] Wang X, Zhang Y, Li Q, et al. PENNER: Pattern-enhanced nested named entity recog-nition in Biomedical literature [C] //Proceedings of the IEEE International Conference on Bioinformatics and Biomedicine, 2018: 540-547.
[19] Cui Ym, Che Wx, Liu T, et al. Pre-Training with whole word masking for Chinese BERT [J]. arXiv preprint arXiv: 1906.08101, 2019.
[20] 熵减科技.金融领域开源中文BERT预训练模型[CP/OL]. https://github.com/valuesimplex/FinBERT[2020-10-22].
[21] Rogers A, Kovaleva O, Rumshisky A. A primer in BERTology: What we know about how BERT works [J]. Transactions of the Association for Computational Linguistics, 2020, 8: 842-866.
{{custom_fnGroup.title_cn}}
脚注
{{custom_fn.content}}
基金
国网山东省电力公司科技项目(2020A-013)
{{custom_fund}}