实现家谱资源的高效的组织和利用,需要从非结构化的家谱文本中提取实体及关系,进行结构化的表示。实体和关系的提取通常被作为序列标注任务来解决,输入的句子被映射到标签序列。针对家谱文本中实体和关系高度密集、关系重叠很常见的特点,该文构建了相应的概念模型来指导整个提取过程。在序列标注部分,该文在真实数据上检验了常用的深度学习模型的表现。实验结果显示,BERT-BiLSTM-CRF模型的精确率、召回率和F1值均优于所对比的其他模型,该文所提出的方法能够有效地解决家谱文本中的实体关系提取问题。
Abstract
In order to organize genealogy resources efficiently, it is necessary to extract entities and their relationships from unstructured genealogy text and build a structured representation. The extraction of entities and the relationships is often transformed to sequence tagging task. Given the high density of entities, relationships and the overlapping relationships, this paper proposes a conceptual model to guide the extraction. Then the commonly-used deep learning models for sequence tagging are tested and compared on a real dataset. Experimental results show that BERT-BiLSTM-CRF outperforms the others in terms of precision, recall and F1 score, and the proposed method is effective in extracting entities and relationships from genealogy text.
关键词
家谱 /
命名实体识别 /
关系提取 /
深度学习
{{custom_keyword}} /
Key words
genealogy /
named entity recognition /
relationships extraction /
deep learning
{{custom_keyword}} /
{{custom_sec.title}}
{{custom_sec.title}}
{{custom_sec.content}}
参考文献
[1] Punch T M. Genealogy, migration and the study of the past[EBOL].[2019-01-18] https://journals.lib.unb.ca/static_content/ACAD/acadpress/theyplantedwell/132-137Punch.pdf.
[2] Chiticariu L, Li Y Y, Reiss F R. Rule-based information extraction is dead! Long live rule-based information extraction systems![C]//Proceedings of EMNLP 2013, 2013: 827-832.
[3] Huang Z, Xu W, Yu K. Bidirectional LSTM-CRF models for sequence tagging[J/OL]. arXiv preprint, arXiv: 1508.01991, 2015.
[4] Kambhatla N, Combining lexical, syntactic and semantic features with maximum entropy models for extracting relations[C]//Proceedings of the 42nd Annual Meeting of ACL, 2004: 1-22.
[5] Zeng D J, Liu K, Lai S W, et al. Relation classification via convolutional deep neural network[C]//Proceedings of COLING 2014, 2014: 2335-2344.
[6] Zeng D J, Liu K, Chen Y, et al. Distant supervision for relation extraction via piecewise convolutional neural networks[C]//Proceedings of EMNLP 2015, 2015: 1753-1762.
[7] Li Q, Ji H. Incremental joint extraction of entity mentions and relations[C]//Proceedings of the 52nd Annual Meeting of ACL, 2014: 402-412.
[8] Hochreiter S, Schmidhuber J. Long Short-Term Memory[J]. Neural Computation, 1997, 9(8): 1735-1780.
[9] Zhang J T, Liu J. Concurrent entity recognition and relationship extraction from unstructured text[J]. Advanced Science and Technology Letters, 2016,121: 440-444.
[10] Zheng S C, Wang F, Bao H Y, et al. Joint extraction of entities and relations based on a novel tagging scheme[C]//Proceedings of the 55th Annual Meeting of ACL, 2017: 1227-1236.
[11] Hoffman R, Zhang C L, Ling X, et al. Knowledge-based weak supervision for information extraction of overlapping relations[C]//Proceedings of the 49th Annual Meeting of ACL, 2011: 541-550.
[12] Wang S L, Zhang Y, Che W X, et al. Joint extraction of entities and relations based on a novel graph scheme[C]//Proceedings of the IJCAI 2018, 2018: 4461-4467.
[13] Dai D, Xiao X Y, Lyu Y J, et al. Joint extraction of entities and overlapping relations using position-attentive sequence labeling[C]//Proceedings of the AAAI'19, 2019: 6300-6308.
[14] Fu T J, Li P H, Ma W Y. GraphRel: Modeling text as relational graphs for joint entity and relation extraction[C]//Proceedings of the 57th Annual Meeting of ACL, 2019: 1409-1418.
[15] Jia S B, E S J, Xiang Y, Neural open relation extraction via an overlap-aware sequence tagging scheme[J/OL]. arXiv preprint arXiv: 1908.01761v2, 2019.
[16] Liddle S W, Dobson D, Embley D W, et al. Enabling efficient Chinese Jiapu information extraction[C]//Proceedings of the 15th Annual Family History Technology Workshop, 2015.
[17] Walker T, Embley D W. Automating the extraction of genealogical information from the web[C]//Proceedings of the 4th Annual Family History Technology Workshop, 2004.
[18] Baker J, Campbell H, Crabtree J, et al. Pattern markup language: A pattern-based tool for quickly automating genealogy data extraction[C]//Proceedings of the 8th Annual Family History Technology Workshop, 2008.
[19] Lonsdale D, Hutchison M, Richards T, et al. An NLP system for extracting and representing knowledge from abbreviated text[C]//Proceedings of International Conference on Computer Science & Software Engineering, 2001.
[20] Packer T L. Scalable detection and extraction of data in lists in OCRed text for ontology population using semi-supervised and unsupervised active wrapper induction[D]. PhD diss., Provo: Brigham Young University, 2014.
[21] Kim T W. A green form-based information extraction system for historical documents[D]. PhD diss., Provo: Brigham Young University, 2017.
[22] Lindes P. OntoSoar: Using language to find genealogy facts[D]. PhD diss., Provo: Brigham Young University, 2014.
[23] Nagy G. Estimation, learning, and adaptation: Systems that improve with use[C]//Proceedings of the Joint IAPR International Workshop on Structural, Syntactic, and Statistical Pattern Recognition, 2012.
[24] Park J. FROntIER: A framework for extracting and organizing biographical facts in historical documents[D]. Provo: Brigham Young University, 2015.
[25] Embley D W, Liddle S W, Eastmond T S, et al. Conceptual modeling in accelerating information ingest into family tree[M]. Conceptual Modeling Perspectives. Springer, Cham, 2017: 69-84.
[26] Embley D W, Liddle S W, Lonsdale D W. Conceptual modeling foundations for a web of knowledge[M]. Handbook of Conceptual Modeling, Springer Berlin Heidelberg, 2011: 477-516.
[27] 皇甫晶,王凌云.基于规则的纪传体古代汉语文献姓名识别[J].图书情报工作, 2013, 57(03): 120-124.
[28] 汤亚芬.先秦古汉语典籍中的人名自动识别研究[J].现代图书情报技术, 2013(Z1): 63-68.
[29] 谢韬. 基于古文学的命名实体识别的研究与实现[D].北京: 北京邮电大学硕士学位论文, 2018.
[30] 王晓玉,李斌.基于CRFs和词典信息的中古汉语自动分词[J].数据分析与知识发现, 2017, 1(05): 62-70.
[31] 王东波,黄水清,何琳.基于多特征知识的先秦典籍词性自动标注研究[J].图书情报工作, 2017, 61(12): 64-70.
[32] 袁悦,王东波,黄水清,等.不同词性标记集在典籍实体抽取上的差异性探究[J].数据分析与知识发现, 2019, 3(03): 57-65.
{{custom_fnGroup.title_cn}}
脚注
{{custom_fn.content}}
基金
中央高校基本科研经费中国人民大学科研基金(19XNA009)
{{custom_fund}}