实体识别在许多自然语言处理应用系统中发挥着极其重要的作用。目前大部分研究集中在命名实体识别,且不考虑实体之间的嵌套,本文在自动内容抽取评测(Automatic Content Extraction, ACE)背景下,对汉语文本中各种实体提及(命名性,名词性,代词性)的多层嵌套识别进行了研究。我们将嵌套实体识别分成两个子任务: 嵌套实体边界检测和实体多层信息标注。首先,本文提出了一种层次结构信息编码方法,将多层嵌套边界检测问题转化为传统的序列标注问题,利用条件随机场模型融合多种特征进行统计决策。其次,将多层信息标注问题看作分类问题,从实现的角度设计了含有两个分类引擎的并行SVM分类器,避免了对每层信息标注都设计一个分类器,比采用单一分类器在性能上有明显提高。在标准ACE语料上的实验表明,基于条件随机场的多层实体边界检测模型正确率达到71%,融合特征选择策略的两个并行分类引擎的正确率也分别达到了89.05%和82.17%。
Abstract
Entity recognition plays a significantly important role in many natural language processing applications. Previous study on entity recognition is mainly focused on the Named Entity Recognition (NER) and nested NEs are not considered. This paper proposes a multi-scale nested entity mention recognition system in the context of ACE(Automatic Content Extraction), which aims to identify named, nominal,pronominal mentions of entities within unstructured texts and assign multiple attributes for all the mentions. We separate this task into two subtasks: multi-scale nested boundary detection and multiple information recognition. First, we propose a information encoding method for nested structure which provides an effective solution to recast the multi-scale nested boundary detection problem to the classical sequential labeling problem. Second, a parallel two-agent classifier is presented to conduct multiple information recognition for each entity mention. Furthermore, abundant multi-level linguistic features are integrated in our machine learning based framework to achieve competitive performance. We evaluate the proposed framework on ACE standard corpus by extensive experiments and obtain the accuracy of 71% for nested boundary detection, the accuracy of 89.05%, 82.17% for the two classification agents respectively.
关键词
人工智能 /
自然语言处理 /
实体提及嵌套识别 /
条件随机场 /
支持向量机
{{custom_keyword}} /
Key words
artificial intelligence /
natural language processing /
nested entity mention recognition /
conditional random fields /
support vector machine
{{custom_keyword}} /
{{custom_sec.title}}
{{custom_sec.title}}
{{custom_sec.content}}
参考文献
[1] Yi-Feng Lin, Tzong-Han Tsai, Wen-Chi Chou, Kuen-Pin Wu, Ting-Yi Sung, Wen-Lian Hsu. A Maximum Entropy Approach to Biomedical Named Entity Recognition[A]. In: Proceedings of the 4th ACM SIGKDD Workshop on Data Mining in Bioinformatics (BIOKDD 2004) [C], 2004. 56-61.
[2] 刘非凡, 赵军, 吕碧波, 徐波, 于浩, 夏迎炬. 面向商务信息抽取的产品命名实体识别研究[J]. 中文信息学报, 2006, 20(1): 7-13.
[3] A. Borthwick, J. Sterling, E. Agichtein, and R. Grishman. Exploiting diverse knowledge sources via maximum entropy in named entity recognition[A]. In: Proceedings of Workshop on Very Large Corpora [C]. ACL.1998.
[4] Youzheng Wu, Jun Zhao, Bo Xu, Hao Yu. Chinese Named Entity Recognition Model Based on Multiple Features[A]. In: the Proceedings of HLT/EMNLP 2005 [C]. Vancouver, B.C., Canada, 2005. 427-434.
[5] D. M. Bikel, S. Miller, R. Schwartz, and R. Weischedel. Nymble: a High-performance Learning Name-finder [A]. In: Proceedings of ANLP-97 [C], 1997. 194-201.
[6] Huaping Zhang, Qun Liu, Hongkui Yu, Xueqi Cheng, Shuo Bai. Chinese Named Entity Recognition Using Role Model [J]. Special Issue “Word Formation and Chinese Language processing” of the International Journal of Computational Linguistics and Chinese Language Processing, 2003, 8(2): 29-60.
[7] Jian Sun, Jianfeng Gao, Lei Zhang, Ming Zhou, Changning Huang. Chinese Named Entity Identification Using Class-based Language Model [A]. In: COLING 2002 [C]. Taipei, 2002.
[8] Lance E. Ramhsaw and Mitchel P. Marcus. Text Chunking Using Transformation Based Learning [A]. In: Proceedings of the 3rd ACL Workshop on Very Large Corpora [C]. 1995. 82-94.
[9] J. Lafferty, A. McCallum, F. Pereira. Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data [A]. In: Proc. ICML-01 [C]. 2001, 282-289.
[10] Y.-W. Chen and C.-J. Lin. Combining SVMs with Various Feature Selection Strategies[M]. Feature extraction, foundations and applications. a (Eds.) Guyon, I., S.Gunn, M.Nikravesh, L.Zadeh, Springer-Verlag, Berkeley, Southampton, 2005.
[11] Ittycheriah A., Lita L.V., Kambhatla, N., Nicolov N., Roukos S., Stys, M. Identifying and Tracking Entity Mentions in a Maximum Entropy Framework [A]. In: Proceedings of HLT/NAACL-2003 [C]. 2003.
[12] R Florian, H Hassan, A Ittycheriah, H Jing, N Kambhatla, X Luo, N Nicolov, and S Roukos. A statistical model for multilingual entity detection and tracking[A]. In: Proc. of HLT/NAACL-04 [C]. Boston Massachusetts, USA, 2001, 1-8.
[13] Kadri Hacioglu, Benjamin Douglas and Ying Chen. Detection of Entity Mentions Occurring in English and Chinese Text [A]. In: Proceedings of (HLT/EMNLP) [C]. Vancouver, October 2005, 379-386.
[14] Hal Daum e III and Daniel Marcu. A Large-Scale Exploration of Effective Global Features for a Joint Entity Detection and Tracking Model [A]. In: Proceedings of HLT/EMNLP [C]. Vancouver, British Columbia, Canada, 97-104.
[15] Yaqian Zhou, Changning Huang, Jianfeng Gao, Lide Wu. Transformation Based Chinese Entity Detection and Tracking [A]. In: Proceedings of Second International Joint Conference on Natural Language Processing: Companion Volume including Posters/Demos and Tutorial Abstracts [C]. 2005. 232-237.
{{custom_fnGroup.title_cn}}
脚注
{{custom_fn.content}}
基金
国家自然科学基金资助项目(60372016);北京市自然科学基金资助项目(4052027)
{{custom_fund}}