实体指代识别(Entity Mention Detection, EMD)是识别文本中对实体的指代(Mention)的任务,包括专名、普通名词、代词指代的识别。本文提出一种基于多层次特征集成的中文实体指代识别方法,利用条件随机场模型的特征集成能力,综合使用字符、拼音、词及词性、各类专名列表、频次统计等各层次特征提高识别性能。本文利用流水线框架,分三个阶段标注实体指代的各项信息。基于本方法的指代识别系统参加了2007年自动内容抽取(ACE07)中文EMD评测,系统的ACE Value值名列第二。
Abstract
The purpose of Entity Mention Detection (EMD) is to recognizel all mentions of entities in a document, involving recognition of named entities, noun words and pronoun coreference etc. In this paper, we propose an approach for Chinese entity mention detection by integrating multi-level features into the Conditional Random Fields (CRFs) framework. These features used include characters, phonetic symbols, lexical words and part-of-speech, named entities, and frequency statistics. All EMD subtasks are integrated into a three-stage pipeline framework in which three different CRFs classifiers are used to label different attributes sequentially in a predefined order. The system described here is the our submission to NIST ACE07 EMD Evaluation project, and achieved rank-2 performance in ACE07.
关键词
计算机应用 /
中文信息处理 /
实体指代识别 /
多任务标注 /
条件随机场模型 /
ACE评测
{{custom_keyword}} /
Key words
computer applicatiopn /
Chinese information processing /
entity mention detection /
mutil-task labeling conditional random fields /
ACE evaluation
{{custom_keyword}} /
{{custom_sec.title}}
{{custom_sec.title}}
{{custom_sec.content}}
参考文献
[1] The ACE 2007 (ACE07) Evaluation Plan v1.3. http://www.nist.gov/speech/tests/ace07/doc/.
[2] K. Hacioglu, B. Douglas, Y. Chen. Detection of Entity Mentions Occurring in English and Chinese Text[A]. In: Proceedings of HLT/EMNLP-2005[C]. Vancouver: 2005. 379-386.
[3] R. Florian, H. Hassan, A. Ittycheriah et al. A Statistical Model for Multilingual Entity Detection and Tracking[A]. In: Proceeding of HLT-NAACL 2004[C]. Boston: 2004, 1-8.
[4] G.D. Zhou, J. Su. Named Entity Recognition using an HMM-based Chunk Tagger[A]. In: Proceeding of the 40th Annual Meeting of the ACL[C]. Philadelphia: 2002, 473-480.
[5] 刘非凡,赵军,吕碧波, 等. 面向商务信息抽取的产品命名实体识别研究[J]. 中文信息学报, 2006, 20(1): 7-13.
[6] 吴雪军,朱靖波,王会珍,等. Co-Training的机器学习方法在中文机构名识别中的应用[A]. 全国第七届计算语言学联合学术会议[C]. 2003. 85-90.
[7] J. Lafferty, A. McCallum, F. Pereira. Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data[A]. International Conference on Machine Learning (ICML01)[C]. 2001. 282-289.
[8] W.L. Chen, Y.J. Zhang, H. Isahara. Chinese Named Entity Recognition with Conditional Random Fields[A]. In: Proceedings of the Fifth SIGHAN Workshop on Chinese Language Processing[C]. Sydney: 2006. 118-121.
[9] R. Florian, H. Jing, N. Kambhatla et al. Factorizing Complex Models: A Case Study in Mention Detection[A]. In: Proceedings of the 21st International Conference on Computational Linguistics and 44th Annual Meeting of the ACL[C]. Sydney: 2006. 473-480.
[10] H. Daume III, D. Marcu. A Large-Scale Exploration of Effective Global Features for a Joint Entity Detection and Tracking Model[A]. In: Proceedings of HLT/EMNLP-2005[C]. Vancouver: 2005. 379-386.
[11] H. Zhao, C.N. Huang, M. Li. An Improved Chinese Word Segmentation System with Conditional Random Field[A]. In: Proceeding of the 5th SIGHAN Workshop on Chinese Language Processing[C]. Sydney: 2006. 162-165.
[12] 吴雪军. 面向信息抽取的命名实体识别与模块获取技术研究[D]. 沈阳: 东北大学, 2004.
{{custom_fnGroup.title_cn}}
脚注
{{custom_fn.content}}
基金
国家自然科学基金资助项目(60473140);国家863高科技计划资助项目(2006AA01Z154);国家教育部新世纪优秀人才计划资助项目(NCET-05-0287);国家985工程计划资助项目(985-2-DB-C03)
{{custom_fund}}