命名实体识别是文学作品智能分析的基础性工作,当前文学领域命名实体识别的研究还较薄弱,一个主要原因是缺乏标注语料。该文从金庸小说入手,对两部小说180余万字进行了命名实体的标注,共标注4类实体,共计5万多个。针对小说文本的特点,该文提出融入篇章信息的命名实体识别模型,引入篇章字典保存汉字的历史状态,利用可信度计算融合BiGRU-CRF与Transformer模型。实验结果表明,利用篇章信息有效提升了命名实体识别的效果。最后,该文还探讨了命名实体识别在小说社会网络构建中的应用。
Abstract
Named entity recognition is essential to the intelligent analysis of literary works. We annotate over 50 thousands named entities of four types from about 1.8 million words of two Jin Yong’s novels. According to the characteristics of novel text, this paper proposes a document-level named entity recognition model with a dictionary to record the historical state of Chinese characters. We use confidence estimation to fuse BiGRU-CRF and Transformer model. The experimental results show that the proposed method can effectively improve the performance of named entity recognition.
关键词
文学作品 /
命名实体识别 /
篇章信息
{{custom_keyword}} /
Key words
literary text /
named entity recognition /
document level information
{{custom_keyword}} /
{{custom_sec.title}}
{{custom_sec.title}}
{{custom_sec.content}}
参考文献
[1] 李保利, 陈玉忠, 俞士汶. 信息抽取研究综述[J]. 计算机工程与应用, 2003, 39(10): 1-5.
[2] BAMMAN D, O’CONNOR B, SMITH N A. Learning latent personas of film characters[C]//Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics, 2013: 352-361.
[3] LABATUT V, BOST X. Extraction and analysis of fictional character networks: A survey[J]. ACM Computing Surveys, 2019, 52(5): 1-40.
[4] SIMS M, PARK J H, BAMMAN D. Literary event detection[C]//Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, 2019: 3623-3634.
[5] JOCKERS M L. Macroanalysis: Digital Methods and Literary History[M]. Univerty of Lllinois Press,2013.
[6] 林峰,赵广平,林娜等.《红楼梦》文本的社会网络结构分析[J].石家庄铁道大学学报(社会科学版),2018,12(01):58-63.
[7] THOMAS A, SANGEETHA S. Deep learning architectures for named entity recognition: A survey[M].Advanced Computing and Intelligent Engineering. Springer, Singapore, 2020: 215-225.
[8] COLLOBERT R, WESTON J, BOTTOU L, et al. Natural language processing (almost) from scratch[J]. Journal of Machine Learning Research, 2011, 12(ARTICLE): 2493-2537.
[9] KURU O, CAN O A, YURET D. Charner: Character-level named entity recognition[C]//Proceedings of COLING, the 26th International Conference on Computational Linguistics: Technical Papers, 2016: 911-921.
[10] 柏兵,侯霞,石松.基于 CRF 和 BI-LSTM 的命名实体识别方法[J].北京信息科技大学学报 (自然科学版),2018,33(06):27-33.
[11] ZHANG Y, YANG J. Chinese NER using lattice LSTM[C]//Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, 2018: 1554-1564.
[12] LIU L, SHANG J, REN X, et al. Empower sequence labeling with task-aware neural language model[C]//Proceedings of 32nd AAAI Conference on Artificial Intelligence.2018.
[13] 王月, 王孟轩, 张胜, 等. 基于 BERT 的警情文本命名实体识别[J]. 计算机应用, 2019, 40(2): 535-540.
[14] 陈茹,卢先领.融合空洞卷积神经网络与层次注意力机制的中文命名实体识别[J].中文信息学报,2020,34(08):70-77.
[15] GUI T, YE J, ZHANG Q, et al. Leveraging document-level label consistency for named entity recognition[C]//Proceedings of the 29th International Conference on International Joint Conferences on Artificial Intelligence, 2021: 3976-3982.
[16] VALA H, JURGENS D, PIPER A, et al. Mr. bennet, his coachman, and the archbishop walk into a bar but only one of them gets recognized: On the difficulty of detecting characters in literary texts[C]//Proceedings of the Conference on Empirical Methods in Natural Language Processing, 2015: 769-774.
[17] BROOKE J, HAMMOND A, BALDWIN T. Bootstrapped text-level named entity recognition for literature[C]//Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, 2016: 344-350.
[18] XU J, WEN J, SUN X, et al. A Discourse-level named entity recognition and relation extraction dataset for chinese literature text[J]. Training, 1044966(24165): 604.
[19] 谢韬. 基于古文学的命名实体识别的研究与实现[D]. 北京: 北京邮电大学硕士学位论文, 2018.
[20] BAMMAN D, POPAT S, SHEN S. An annotated dataset of literary entities[C]//Proceedings of NAACL-HLT, 2019: 2138-2144.
[21] HRIPCSAK G, ROTHSCHILD A S. Agreement, the F-measure, and reliability in information retrieval[J].Journal of the American Medical Informatics Association, 2005, 12(3):296-298.
[22] ARTSTEIN R, POESIO M. Inter-coder agreement for computational linguistics[J]. Computational Linguistics, 2008, 34(4): 555-596.
[23] CHO K, VAN MERRIENBOER B, GULCEHRE C, et al. Learning phrase representations using RNN encoder-decoder for statistical machine translation[C]//Proceedings of the Conference on Empirical Methods in Natural Language Processing, 2014.
[24] VASWANI A, SHAZEER N, PARMAR N, et al. Attention is all you need[C]//Proceedings of the 31st International Conference on Neural Information Processing Systems, 2017: 6000-6010.
[25] ZAREMBA W, SUTSKEVER I, VINYALS O. Recurrent neural network regularization[J]. arXiv preprint arXiv:1409.2329, 2014.
[26] LAFFERTY J D, MCCALLUM A, PEREIRA F C N. Conditional random fields: Probabilistic models for segmenting and labeling sequence data[C]//Proceedings of the 18th International Conference on Machine Learning, 2001: 282-289.
[27] CHE W, FENG Y, QIN L, et al. N-LTP: An open-source neural language technology platform for Chinese[C]//Proceedings of the Conference on Empirical Methods in Natural Language Processing: System Demonstrations, 2021: 42-49.
[28] JIA Y, DOU H, CAO S, et al.Speaker identification and its application to social network construction for Chinese novels[C]//Proceedings of the International Conference on Asian Language Processing, 2021: 13-18.
{{custom_fnGroup.title_cn}}
脚注
{{custom_fn.content}}
基金
国家重点研究发展计划(2017YFB1002101);国家社会科学基金(18ZDA295,17ZDA318);国家自然科学基金(62006211);中国博士后科学基金(2019TQ0286,2020M682349)
{{custom_fund}}