目前在中文命名实体识别的任务中经常采用有监督的字序列标注模型。我们在实际应用中发现,基于字序列标注模型的中文命名实体识别模型对于词语边界的识别错误是影响识别效果的主要因素之一,边界错误平均占错误结果中的47.5%。该文通过在平均感知机模型中引入全局的词语边界特征,使得人名、地名、机构名识别的F值平均提升了0.04并降低了边界错误占错误结果的比例。
Abstract
Supervised character sequence labeling model is a popular method in Chinese named entity recognition(NER) task. It is found in practice suffering from word boundary error, covering roughly 47.5% of all errors. This paper incorporates global words boundary features in averaged perceptron model. Experiments indicate that the F value of recognizing people name, location names and organization names is improved by 0.04, reducing the proportion of boundary errors in overall errors.
关键词
命名实体识别 /
字序列标注 /
全局特征 /
词语边界特征
{{custom_keyword}} /
Key words
named entity recognition /
sequence labeling /
global feature /
word boundary feature
{{custom_keyword}} /
{{custom_sec.title}}
{{custom_sec.title}}
{{custom_sec.content}}
参考文献
[1] Lafferty J, McCallum A, Pereira F. Conditional Random Fields : Probabilistic Models for Segmenting and Labeling Sequence Data[C]//Proceedings of the Eighteenth International Conference on Machine Learning, 2001: 282-289.
[2] Collins M. Discriminative training methods for hidden markov models: Theory and experiments with perceptron algorithms[C]//Proceedings of the ACL-02 conference on Empirical methods in natural language processing-Volume 10, 2002: 1-8.
[3] Sutton C, McCallum A. An Introduction to Conditional Random Fields[J]. Machine Learning, 2011, 4(4):267-373.
[4] Ng A, Jordan M. On Discriminative vs. Generative classifiers: A comparison of logistic regression andnave Bayes[J]. Advances in neural information processing systems, 2002, 2:841-848.
[5] 孟凡东,谢军,刘群.中文分词和词性标注的在线重排序方法[C].第六届全国青年计算语言学会议论文集, 2012: 44-50.
[6] Chen W, Zhang Y, Isahara H. Chinese named entity recognition with conditional random fields[C]//Proceedings of the 5th SIGHAN Workshop on Chinese Language Processing, 2006.
[7] Zhao H, Kit C. Unsupervised Segmentation Helps Supervised Learning of Character Tagging for Word Segmentation and Named Entity Recognition[C]//Proceedings of IJCNLP 2008, 2008: 106-111.
[8] Zhao H, Kit C. Incorporating global information into supervised learning for Chinese word segmentation[C]//Proceedings of the 10th Conference of the Pacific Association for Computation Linguistics, 2007: 66-74.
[9] Feng H, Chen K, Deng X, et al. Accessor Variety Criteria for Chinese Word Extraction[J]. Computational Linguistics, 2004, 30(1): 75-93.
[10] Kazama J, Torisawa K. A new perceptron algorithm for sequence labeling with non-local features[C]//Proceedings of Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, 2007: 315-324.
[11] Sutton C, McCallum A. Collective segmentation and labeling of distant entities in information extraction[J]. University of Massachusetts TR, 2004,04(49): 1-7.
[12] Bunescu R, Mooney RJ. Collective information extraction with relational Markov networks[C]//Proceedings of the 42nd Annual Meeting on Association for Computational Linguistics, 2004.
[13] Finkel JR, Grenager T, Manning C. Incorporating non-local information into information extraction systems by gibbs sampling[C]//Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics, 2005: 363-370.
[14] Roth D, Yih W. Integer linear programming inference for conditional random fields[C]//Proceedings of the 22nd international conference on Machine learning, 2005: 736-743.
[15] Krishnan V, Manning C. An effective two-stage model for exploiting non-local dependencies in named entity recognition[C]//Proceedings of the 21st International Conference on Computational Linguistics and the 44th annual meeting of the Association for Computational Linguistics, 2006: 1121-1128.
[16] Jin Z, Tanaka-Ishii K. Unsupervised segmentation of Chinese text by use of branching entropy[C]//Proceedings of the COLING/ACL on Main conference poster sessions, 2006: 428-435.
[17] Harris Z. Morpheme boundaries within words: Report on a computer test[C]//Proceedings of the Papers in Structural and Transformational Linguistics in Spinger,1970: 68-77.
{{custom_fnGroup.title_cn}}
脚注
{{custom_fn.content}}
基金
国家自然科学基金(61232010,61100083);国家973课题(2012CB316303);国家863课题(2012AA011003);国家科技支撑计划(2012BAH46B04);国家安全专项(2013A140)
{{custom_fund}}