中文维基百科的实体分类研究

徐志浩,惠浩添,钱龙华,朱巧明,

PDF(1031 KB)
PDF(1031 KB)
中文信息学报 ›› 2015, Vol. 29 ›› Issue (5) : 91-98.
信息抽取与文本挖掘

中文维基百科的实体分类研究

  • 徐志浩1,2,惠浩添1,2,钱龙华1,2,朱巧明1,2
作者信息 +

Classifying Named Entities on Chinese Wikipedia

  • XU Zhihao1,2,HUI Haotian1,2,QIAN Longhua1,2,ZHU Qiaoming1,2
Author information +
History +

摘要

维基百科实体分类对自然语言处理和机器学习具有重要的作用。该文采用机器学习的方法对中文维基百科的条目进行实体分类,在利用维基百科页面中半结构化信息和无结构化文本作为基本特征的基础上,结合中文的特点使用扩展特征和语义特征来提高实体分类性能。在人工标注的语料库上的实验表明,这些额外特征有效地提高了ACE分类体系上的实体分类性能,总体F1值达到96%,同时在扩展实体分类上也取得了较好的效果,总体F1值达95%。

Abstract

Classifying Wikipedia Entities is of great significance to NLP and machine learning. This paper presents a machine learning based method to classify the Chinese Wikipedia articles. Besides using semi-structured data and non-structured text as basic features, we also extend to use Chinese-oriented features and semantic features in order to improve the classification performance. The experimental results on a manually tagged corpus show that the additional features significantly boost the entity classification performance with the overall F1-measure as high as 96% on the ACE entity type hierarchy and 95% on the extended entity type hierarchy.

关键词

维基百科 / 实体分类 / 半结构化信息 / 信息框

Key words

Wikipedia / named entities classification / semi-structured data / Infobox

引用本文

导出引用
徐志浩,惠浩添,钱龙华,朱巧明,. 中文维基百科的实体分类研究. 中文信息学报. 2015, 29(5): 91-98
XU Zhihao,HUI Haotian,QIAN Longhua,ZHU Qiaoming,. Classifying Named Entities on Chinese Wikipedia. Journal of Chinese Information Processing. 2015, 29(5): 91-98

参考文献

[1] Nothman J, Curran J R, Murphy T. Transforming Wikipedia into named entity training data[C]//Proceedings of the Australian Language Technology Workshop. 2008: 124-132.
[2] Nothman J. Learning named entity recognition from Wikipedia[D]. The University of Sydney Australia 7, 2008.
[3] Bunescu R C, Pasca M. Using Encyclopedic Knowledge for Named entity Disambiguation[C]//Proceedings of the EACL. 2006, 6: 9-16.
[4] Zirn C, Nastase V, Strube M. Distinguishing between instances and classes in the wikipedia taxonomy[M]. Springer Berlin Heidelberg, 2008.
[5] Toral A, Munoz R. A proposal to automatically build and maintain gazetteers for Named Entity Recognition by using Wikipedia[J]. NEW TEXT Wikis and blogs and other dynamic text sources, 2006, 56.
[6] Bhole A, Fortuna B, Grobelnik M, et al. Extracting named entities and relating them over time based on wikipedia[J]. Informatica (Slovenia), 2007, 31(4): 463-468.
[7] Tardif S, Curran J R, Murphy T. Improved text categorisation for Wikipedia named entities[C]//Proceedings of the Australasian Language Technology Association Workshop 2009. 2009: 104.
[8] Dakka W, Cucerzan S. Augmenting Wikipedia with

基金

国家自然科学基金(61373096,90920004),江苏省高校自然科学研究重大项目(11KJA520003)
PDF(1031 KB)

868

Accesses

0

Citation

Detail

段落导航
相关文章

/