中文分词是中文信息处理领域的一项关键基础技术。随着中文信息处理应用的发展,专业领域中文分词需求日益增大。然而,现有可用于训练的标注语料多为通用领域(或新闻领域)语料,跨领域移植成为基于统计的中文分词系统的难点。在跨领域分词任务中,由于待分词文本与训练文本构词规则和特征分布差异较大,使得全监督统计学习方法难以获得较好的效果。该文在全监督CRF中引入最小熵正则化框架,提出半监督CRF分词模型,将基于通用领域标注文本的有指导训练和基于目标领域无标记文本的无指导训练相结合。同时,为了综合利用各分词方法的优点,该文将加词典的方法、加标注语料的方法和半监督CRF模型结合起来,提高分词系统的领域适应性。实验表明,半监督CRF较全监督CRF OOV召回率提高了3.2个百分点,F-值提高了1.1个百分点;将多种方法混合使用的分词系统相对于单独在CRF模型中添加标注语料的方法OOV召回率提高了2.9个百分点,F-值提高了2.5个百分点。
Abstract
Applying the minimum entropy regularization framework to the supervised CRF model, this paper proposes a semi-supervised CRF model that combing the supervised learning on the labeled text in common domain with the unsupervised learning on the unlabeled text in the target professional domain. The domain adaptation is further improved by introducing a domain dictionary and a tagged corpus. Experiments on a cross domain segmentation task show that proposed method out-performs supervised CRF in terms of OOV recall and F-value.
关键词
跨领域 /
中文分词 /
半监督CRF
{{custom_keyword}} /
Key words
cross domain /
Chinese word segmentation /
semi-supervised conditional random field
{{custom_keyword}} /
{{custom_sec.title}}
{{custom_sec.title}}
{{custom_sec.content}}
参考文献
[1] 黄昌宁,赵海. 中文分词十年回顾[J]. 中文信息学报,2007,21(3): 8-20.
[2] Xue Nianwen. Chinese word segmentation as character tagging[J]. Computational Linguistics and Chinese Language Processing, 2003, 8(1): 29-48.
[3] 张梅山,邓知龙,车万翔,等.统计与词典相结合的领域自适应中文分词[J].中文信息学报, 2012, 26(2): 8-12.
[4] 许华婷,张玉洁,杨晓晖,等.基于Active Learning的中文分词领域自适应[J].中文信息学报, 2015, 29(5): 55-62.
[5] Fan Yang, Paul Vozila. Semi-supervised chinese word segmentation using partial-label learning with conditional random fields[C]//Proceedings of the 2014 conference on empirical methods in natural language processing(EMNLP), 2014: 90-98.
[6] Y Grandvalet, Y Bengio. Semi-supervised learning by entropy minimization[C]//Proceedings of the Advances in neural information processing systems 17, Cambridge, MA: MIT Press, 2005: 529-536.
[7] Lafferty, A. McCallum, F. Pereira. Conditional random fields: probabilistic models for segmenting and labeling sequence data[C]// Proceedings of the 18th International Conference on Machine Learning, 2001: 282-289.
[8] 李航. 统计学习方法[M]. 北京: 清华大学出版社,2012: 191-209.
[9] O. Chapelle, B. Schlkopf, A. Zien. Semi-supervised learning[M]. Cambridge, MA: The MIT Press, London, 2006.
[10] 宗成庆. 统计自然语言处理[M]. 北京: 清华大学出版社,2008: 19-20.
[11] Mann G S, McCallum A. Efficient computation of entropy gradient for semi-supervised conditional random fields[C]//Proceedings of the 2007 Conference of the North American Chapter of the Association for Computational Linguistics. Stroudsburg, USA: Association for Computational Linguistics, 2007.
[12] 俞士汶,段慧明,朱学锋,等. 北大语料加工规范: 切分·词性标注·注音[J]. 汉语语言与计算学报,2004,13(2): 121-158.
[13] 罗智勇,宋柔.基于多特征的自适应新词识别[J].北京工业大学学报, 2007, 33(7): 718-725.
[14] Jiao Feng, Wang Shaojun, Lee Chi-Hoon, et al. Semi supervised conditional random fields for improved sequence segmentation and Labeling[C]//Proceedings of the 21st International Conference on Computational Linguistics and the 44th Annual Meeting of the Association for Computational Linguistics. Sydney, Australia: Association for Computational Linguistics, 2006.
[15] Stephen P. Boyd, Lieven Vandenberghe.Convex optimization [M]. Cambridge University Press, 2004.
{{custom_fnGroup.title_cn}}
脚注
{{custom_fn.content}}
基金
北京市哲学社会科学规划研究基地项目(13JDZHB005);中央高校基本科研业务费专项资金(09YB09)
{{custom_fund}}