该文提出了一种基于CRFs的分布式策略及错误驱动的方法识别汉语组块。该方法首先将11种类型的汉语组块进行分组,结合CRFs构建不同的组块识别模型来识别组块;之后利用基于CRFs的错误驱动技术自动对分组组块进行二次识别;最后依据各分组F值大小顺序处理类型冲突。实验结果表明,基于CRFs的分布式策略及错误驱动方法识别汉语组块是有效的,系统开放式测试的精确率、召回率、F值分别达到94.90%、91.00%和92.91%,好于单独的CRFs方法、分布式策略方法及其他组合方法。
Abstract
This paper proposes a distributed strategy for Chinese text chunking on the basis Conditional Random Fields(CRFs) and Error-driven technique. First eleven types of Chinese chunks are divided into different groups to build CRFs model respectively. Then, the error-driven technique is applied over CRFs chunking results for further modification. Finally, a method is described to deal with the conflicting chunking according to the F-measure values. The experimental results show that this approach is effective, outperforming the single CRFs-based approach, distributed method and other hybrid approaches in the open test by achieving reaches 94.90%, 91.00% ,and 92.91% in recall, precision, and F-measure respectively.
关键词
计算机应用 /
中文信息处理 /
组块识别 /
条件随机域(CRFs) /
分布式策略 /
基于CRFs的错误驱动 /
浅层句法分析
{{custom_keyword}} /
Key words
computer application /
Chinese information processing /
chunking /
conditional random fields(CRFs) /
distributed strategy /
error-driven technique /
shallow parsing
{{custom_keyword}} /
{{custom_sec.title}}
{{custom_sec.title}}
{{custom_sec.content}}
参考文献
[1] 李珩,杨峰,朱靖波,等.基于增益的隐马尔科夫模型的文本组块分析[J].计算机科学,2004,31(2): 152-154.
[2] 李珩,朱靖波,姚天顺.基于SVM的中文组块分析[J].中文信息学报,2004,18(2): 1-7.
[3] 李素建,刘群,杨志峰.基于最大熵模型的组块分析[J].计算机报,2003,26(12): 1722-1727.
[4] Sha F, Pereira F. Shallow parsing with conditional random fields[C]// Proceedings of Human Language Technology/North American chapter of the Association for Computational Linguistics annual meeting.Edmonton: 2003: 213-220.
[5] Tan Y M,Yao T S,Chen Q, etc.Applying conditional random fields to Chinese shallow parsing[C]//Proceedings of CICLing-2005.Mexico: 2005: 167-176.
[6] 李珩,朱靖波,姚天顺.基于Stacking算法的组合分类器及其应用于中文组块分析[J].计算机研究与发展,2005,42(5): 844-848.
[7] 徐昉,宗成庆,王霞.中文 Base NP识别:错误驱动的组合分类器方法[J].中文信息学报,2007,21(1): 115-119.
[8] 黄德根,王莹莹.基于SVM的组块识别及其错误驱动学习方法[J].中文信息学报,2006,20(6): 17-24.
[9] Ying-Hong Liang,Tie-Jun Zhao,Lei Mao.A Multi-Agent Strategy For Chinese Text Chunki- ng[C]//Proceedings of the Fourth International Conference on Machine Learning and Cyber- netics.Guangzhou: 2005: 18-21.
[10] H.Q.Li,C.N.Huang,J.F.Gao,etc.Chinese Chunking with Another Type of Spec[C]// The Third SIGHAN Workshop on Chinese Language Processing.Barcelona: 2004: 24-26.
[11] 王莹莹.汉语组块识别的研究[D].大连: 大连理工大学,2006.
[12] Ramshaw L, Marcus M.Text chunking using transformation-based learning[C]//Proceedings of the Third ACL Workshop on Very Large Corpora.Boston: 1995: 82-94.
[13] J.Lafferty, A.McCallum,F.Pereira.Conditional random fields: Probabilistic models for segm- enting and labeling sequence data[C]//Proceedings of the 18th International Conference on Machine Learning.San Francisco:Morgan Kaufmann,2001: 282-289.
[14] 张昱琪,周强.汉语基本短语的自动识别[J].中文信息学报.2002,16(6): 1-8.
[15] 赵军,黄昌宁.基于转换的汉语基本名词短语识别模型[J].中文信息学报,1999,13(2): 46-6.
[16] 刘芳,赵铁军,于浩,等.基于统计的汉语组块分析[J].中文信息学报,2000,14(6): 28-32.
[17] 俞士汶,段慧明,朱学锋,等.北大语料库加工规范: 切分·词性标注·注音[J].汉语语言与计算学报,2003,13(2): 121-158.
[18] 李优.汉语句子的组块识别[D].大连: 大连理工大学,2005.
[19] 罗雪兵,黄德根,等.基于组合方法的组块识别[C]//孙茂松,陈群秀主编.内容计算的研究与应用前沿.北京: 清华大学出版社,2007: 83-88.
{{custom_fnGroup.title_cn}}
脚注
{{custom_fn.content}}
基金
国家863高技术研究发展计划资助项目(2006AA012140);国家自然科学基金资助项目(60673039)
{{custom_fund}}