属性抽取的主要目标是从非结构化文本中获取实体的属性值。为了从文本中抽取出人物属性,通常需要大量的标注数据,然而这些数据资源却十分稀少。为了解决这个问题,该文从百科类网页的表格数据出发,构建了人物属性表,然后采用远程监督的方法得到大规模、多类别的人物属性标注语料,从而免去了人工标注的繁琐流程。针对新构建的数据集,分别使用条件随机场(CRF)和双向长短期记忆-条件随机场(BiLSTM-CRF)构建了属性抽取的两个基线模型。实验结果表明,BiLSTM-CRF取得比CRF更好的性能,其中BiLSTM-CRF的平均F1值为83.39%。
Abstract
Attribute recognition is aimed at obtaining attribute values of entities from unstructured text. In order to extract person attributes from text, a large amount of annotated data is usually needed, which is not availabel so far. To address this issue, we use Infobox of encyclopedia web pages to construct the tuples of person attributes, and then apply distant supervision method to obtain large-scale and multi-category annotated datasets for person attributes, thus avoiding the tedious process of manual annotation. Additionally, we present two kinds of models based on CRF and BiLSTM-CRF for person attribute recognition as the baseline systems. The experimental results show that BiLSTM-CRF performs better than CRF on this newly built dataset.
关键词
属性抽取 /
标注数据 /
远程监督
{{custom_keyword}} /
Key words
attribute recognition /
annotated data /
distant supervision
{{custom_keyword}} /
{{custom_sec.title}}
{{custom_sec.title}}
{{custom_sec.content}}
参考文献
[1] Rao R. From unstructured data to actionable intelligence[J]. It Professional, 2003, 5(6): 29-35.
[2] Banerjee J, Chou H T, Garza J F, et al. Data model issues for object-oriented applications[J]. Acm Transactions on Information Systems, 1987, 5(1): 3-26.
[3] 李红亮. 基于规则的百科人物属性抽取算法的研究[D]. 成都: 西南交通大学硕士学位论文, 2013.
[4] 苏丰龙, 谢庆华, 邱继远,等. 基于深度学习的领域实体属性词聚类抽取研究[J]. 微型机与应用, 2016, 35(1): 53-55.
[5] Cho K, Van Merrinboer B, Gulcehre C, et al. Learning phrase representations using RNN encoder-decoder for statistical machine translation[J]. arXiv preprint arXiv: 1406.1078, 2014.
[6] Wang J, Yu L, Lai K, et al. Dimensional sentiment analysis using a regional CNN-LSTM model[C]//Proceedings of the Meeting of the Association for Computational Linguistics,2016: 225-230.
[7] Lin Y, Shen S, Liu Z, et al. Neural relation extraction with selective attention over instances[C]//Proceedings of the Meeting of the Association for Computational Linguistics, 2016: 2124-2133.
[8] Irsoy O, Cardie C. Opinion mining with deep recurrent neural networks[C]//Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP),2014: 720-728.
[9] Katiyar A, Cardie C. Investigating lstms for joint extraction of opinion entities and relations[C]//Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics. 2016, 1: 919-929.
[10] Cho K, Merrienboer B, Bahdanau D, et al. On the properties of neural machine translation: Encoder-decoder approaches[C]//Proceedings of SSST-8, 8th Workshop on Syntax, Semantics and Structure in Statistical Translation, 2014: 103-111.
[11] 向晓雯. 基于条件随机场的中文命名实体识别[D]. 厦门: 厦门大学硕士学位论文, 2006.
[12] 朱臻,孙媛等. 基于SVM和泛化模板协作的藏语人物属性抽取[J]. 中文信息学报, 2015, 29(6):220-227.
[13] 张巧,熊锦华,程学旗. 基于弱监督学习的主页人物属性抽取方法[J]. 山西大学学报(自然科学版), 2015, 38(1): 8-15.
[14] 张丙奇, 姜吉发. 企业相关信息抽取技术研究与系统实现[J]. 微电子学与计算机, 2004, 21(1): 1-6.
[15] Angeli G, Tibshirani J, Wu J, et al. Combining distant and partial supervision for relation extraction[C]//Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP),2014: 1556-1567.
[16] Hearst M A. Automatic acquisition of hyponyms from large text corpora[C]//Proceedings of the 14th Conference on Computational Linguistics-Volume 2. Association for Computational Linguistics, 1992: 539-545.
[17] Debashis Kushary. Bootstrap methods and their application[J]. Technometrics, 2000, 42(2):216-217.
[18] Brin S. Extracting patterns and relations from the world wide web[C]//Proceedings of the International Workshop on the World Wide Web and Databases. Springer, Berlin, Heidelberg, 1998: 172-183.
[19] Kambhatla N. Combining lexical, syntactic, and semantic features with maximum entropy models for extracting relations[C]//Proceedings of the ACL 2004 on Interactive Poster and Demonstration Sessions. Association for Computational Linguistics, 2004: 22.
[20] GuoDong Z, Jian S, Jie Z, et al. Exploring various knowledge in relation extraction[C]//Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics. Association for Computational Linguistics, 2005: 427-434.
[21] Lodhi H, Saunders C, Shawe Taylor J, et al. Text classification using string kernels[J]. Journal of Machine Learning Research, 2002, 2(3):419-444.
[22] Hasegawa T, Sekine S, Grishman R. Discovering relations among named entities from large corpora[C]//Proceedings of the 42nd Annual Meeting on Association for Computational Linguistics. Association for Computational Linguistics, 2004: 415.
[23] Chen J, Ji D, Tan C L, et al. Unsupervised feature selection for relation extraction[C]//Proceedings of Conference Including Posters/Demos and Tutorial Abstracts, 2005.
[24] Huang L, Sil A, Ji H, et al. Improving slot filling performance with attentive neural networks on dependency structures[J]. arXiv preprint arXiv:1707.01075, 2017.
[25] Rajani N F, Mooney R J. Supervised and unsupervised ensembling for knowledge base population[J]. arXiv preprint arXiv:1604.04802, 2016.
[26] Peters M E, Neumann M, Iyyer M, et al. Deep contextualized word representations[J]. arXiv preprint arXiv:1802.05365, 2018.
[27] Devlin J, Chang M W, Lee K, et al. Bert: Pre-training of deep bidirectional transformers for language understanding[J]. arXiv preprint arXiv:1810.04805, 2018.
[28] Yang Y, Chen W, Li Z, et al. Distantly supervised ner with partial annotation learning and reinforcement learning[C]//Proceedings of the 27th International Conference on Computational Linguistics,2018: 2159-2169.
{{custom_fnGroup.title_cn}}
脚注
{{custom_fn.content}}
基金
国家自然科学基金(61525205,61876115)
{{custom_fund}}