基于统计的中文分词方法由于训练语料领域的限制,导致其领域自适应性能力较差。相比分词训练语料,领域词典的获取要容易许多,而且能为分词提供丰富的领域信息。该文通过将词典信息以特征的方式融入到统计分词模型(该文使用CRF统计模型)中来实现领域自适应性。实验表明,这种方法显著提高了统计中文分词的领域自适应能力。当测试领域和训练领域相同时,分词的F-measure值提升了2%;当测试领域和训练领域不同时,分词的F-measure值提升了6%。
Abstract
Generally, statistical methods for Chinese Word Segmentation dont have good domain adaptability owing to the specific training corpus. In practice, domain dictionaries are more easily achieved than humanly annotated segmentation corpus, and it contains plenty of domain information. We propose an approach which integrates dictionary information into statistical models (i.e., CRF model in this paper) to realize domain adaption for Chinese Word Segmentation. Experimental results show that our approach have good domain adaption. When the test corpus is identical to the domain of training corpus, the F-measure value increases 2%; when test corpus is in a different domain of the training corpus, the F-measure value increases 6%.
Key wordsChinese word segmentation; CRF; domain adaption
关键词
中文分词 /
CRF /
领域自适应
{{custom_keyword}} /
Key words
Chinese word segmentation /
CRF /
domain adaption
{{custom_keyword}} /
{{custom_sec.title}}
{{custom_sec.title}}
{{custom_sec.content}}
参考文献
[1] 骆正清,陈增武,胡尚序.一种改进的MM分词方法的算法设计[J].中文信息学报,1996, 10(3):30-36.
[2] 吴春颖,王士同.基于二元文法的N-最大概率中文粗分模型[J].计算机应用,2007, 27(12):332-339.
[3] Nianwen Xue. Chinese word segmentation as character tagging[J]. International Journal of Computational Linguistics and Chinese Language Processing, 2003, 8(1):29-48.
[4] Huihsin Tseng, Pichuan Chang, Galen Andrew, et al. A conditional random field word segmenter for sighan bakeoff 2005[C]//Proceedings of the fourth SIGHAN workshop. 2005:168-171.
[5] Yue Zhang, Stephen Clark. Chinese segmentation with a word-based perceptron algorithm[C]//Proceedings of the 45th ACL. 2007:840-847.
[6] Xu Sun, Yaozhong Zhang, Takuya Matsuzaki, et al. A discriminative latent variable chinese segmenter with hybrid word/character information[C]//Proceedings of NAACL. 2009:56-64.
[7] Hai Zhao, Chang-Ning Huang, Mu Li. An Improved Chinese Word Segmentation System with Conditional Random Field[C]//Proceedings of the Fifth SIGHAN Workshop on Chinese Language Processing. 2006:162-165.
[8] Pi-Chuan Chang, Michel Galley, Christopher D.Manning. Optimizing Chinese Word Segmentation for Machine Translation Performance[C]//ACL Workshop on Statistical Machine Translation. 2008:224-232.
[9] John D. Lafferty, Andrew McCallum, Fernando C. N. Pereira. Conditional random fields: Probabilistic models for segmenting and labeling sequence data[C]//Proceedings of ICML. 2001:282-289.
{{custom_fnGroup.title_cn}}
脚注
{{custom_fn.content}}
基金
国家自然科学基金重点项目(61133012);国家自然科学基金资助项目(60803093);国家863重大项目(2011AA01A207);核高基重大专项(2011ZX01042-001-001);哈尔滨工业大学科研创新基金(HIT.NSRIF.2009069);中央高校基本科研业务费专项资金(HIT.KLOF.2010064)
{{custom_fund}}