维基百科实体分类对自然语言处理和机器学习具有重要的作用。该文采用机器学习的方法对中文维基百科的条目进行实体分类,在利用维基百科页面中半结构化信息和无结构化文本作为基本特征的基础上,结合中文的特点使用扩展特征和语义特征来提高实体分类性能。在人工标注的语料库上的实验表明,这些额外特征有效地提高了ACE分类体系上的实体分类性能,总体F1值达到96%,同时在扩展实体分类上也取得了较好的效果,总体F1值达95%。
Abstract
Classifying Wikipedia Entities is of great significance to NLP and machine learning. This paper presents a machine learning based method to classify the Chinese Wikipedia articles. Besides using semi-structured data and non-structured text as basic features, we also extend to use Chinese-oriented features and semantic features in order to improve the classification performance. The experimental results on a manually tagged corpus show that the additional features significantly boost the entity classification performance with the overall F1-measure as high as 96% on the ACE entity type hierarchy and 95% on the extended entity type hierarchy.
关键词
维基百科 /
实体分类 /
半结构化信息 /
信息框
{{custom_keyword}} /
Key words
Wikipedia /
named entities classification /
semi-structured data /
Infobox
{{custom_keyword}} /
{{custom_sec.title}}
{{custom_sec.title}}
{{custom_sec.content}}
参考文献
[1] Nothman J, Curran J R, Murphy T. Transforming Wikipedia into named entity training data[C]//Proceedings of the Australian Language Technology Workshop. 2008: 124-132.
[2] Nothman J. Learning named entity recognition from Wikipedia[D]. The University of Sydney Australia 7, 2008.
[3] Bunescu R C, Pasca M. Using Encyclopedic Knowledge for Named entity Disambiguation[C]//Proceedings of the EACL. 2006, 6: 9-16.
[4] Zirn C, Nastase V, Strube M. Distinguishing between instances and classes in the wikipedia taxonomy[M]. Springer Berlin Heidelberg, 2008.
[5] Toral A, Munoz R. A proposal to automatically build and maintain gazetteers for Named Entity Recognition by using Wikipedia[J]. NEW TEXT Wikis and blogs and other dynamic text sources, 2006, 56.
[6] Bhole A, Fortuna B, Grobelnik M, et al. Extracting named entities and relating them over time based on wikipedia[J]. Informatica (Slovenia), 2007, 31(4): 463-468.
[7] Tardif S, Curran J R, Murphy T. Improved text categorisation for Wikipedia named entities[C]//Proceedings of the Australasian Language Technology Association Workshop 2009. 2009: 104.
[8] Dakka W, Cucerzan S. Augmenting Wikipedia with
{{custom_fnGroup.title_cn}}
脚注
{{custom_fn.content}}
基金
国家自然科学基金(61373096,90920004),江苏省高校自然科学研究重大项目(11KJA520003)
{{custom_fund}}