针对汉语词法分析中分词、词性标注、命名实体识别三项子任务分步处理时多类信息难以整合利用,且错误向上传递放大的不足,该文提出一种三位一体字标注的汉语词法分析方法,该方法将汉语词法分析过程看作字序列的标注过程,将每个字的词位、词性、命名实体三类信息融合到该字的标记中,采用最大熵模型经过一次标注实现汉语词法分析的三项任务。并在Bakeoff2007的PKU语料上进行了封闭测试,通过对该方法和传统分步处理的分词、词性标注、命名实体识别的性能进行大量对比实验,结果表明,三位一体字标注方法的分词、词性标注、命名实体识别的性能都有不同程度的提升,汉语分词的F值达到了96.4%,词性标注的标注精度达到了95.3%,命名实体识别的F值达到了90.3%,这说明三位一体字标注的汉语词法分析性能更优。
Abstract
To integrate multi-information without error accumulation in the pipeline approach, a unified character-based tagging approach is proposed for Chinese lexical analysis, including word segmentation, part-of-speech tagging and named entity recognition. Treating Chinese lexical analysis as a character sequence tagging problem, each character tagging could be integrated with three kinds of information that is word-position, part-of-speech and named entity. After the tagging process, the maximum entropy model is applied to complete the three subtasks. The closed evaluation is performed on PKU corpus from Bakeoff2007, and the results show a F-score of 96.4% on word segmentation, 95.3% on POS tagging and 90.3% on named entity recognition.
Key words Chinese lexical analysis; maximum entropy model; trinity; character-based tagging
关键词
汉语词法分析 /
最大熵模型 /
三位一体 /
字标注
{{custom_keyword}} /
Key words
Chinese lexical analysis /
maximum entropy model /
trinity /
character-based tagging
{{custom_keyword}} /
{{custom_sec.title}}
{{custom_sec.title}}
{{custom_sec.content}}
参考文献
[1] 姜维,王晓龙,关毅,等. 基于多知识源的中文词法分析系统[J]. 计算机学报, 2007,30(1):137-145.
[2] 刘群,张华平,俞鸿魁,等. 基于层叠隐马模型的汉语词法分析[J]. 计算机研究与发展, 2004, 41(8):1421-1429.
[3] 孙晓,黄德根. 基于最长次长匹配分词的一体化中文词法分析[J].大连理工大学学报, 2010,50(6):1028-1034.
[4] 白栓虎.汉语词切分及词性自动标注一体化方法[J].中文信息学报,1996,(2): 46-48.
[5] Hwee Tou Ng, Jin Kiat Low. Chinese part-of-speech tagging: One-at-a-time or all-at-once? Word-based or character-based? [C]//Proceedings of the Conference on Empirical Methods in Natural Language Processing, Barcelona: ACL Press, 2004: 277-284.
[6] 石民,李斌,陈小荷.基于CRF的先秦汉语分词标注一体化研究[J].中文信息学报,2010,24(2): 39-45.
[7] LUO Xiaoqiang. A maximum entropy Chinese character-based parser[C]//Proceedings of the 2003 Conference on Empirical Methods in Natural Language Processing, Sapporo, Japan: ACL Press, 2003: 192-199.
[8] Jiang Wenbin, Huang Liang, Liu Qun, et al. A cascaded linear model for joint Chinese word segmentation and part-of-speech tagging [C]//Proceedings of the 46th Annual Meeting of the Association for Computational Linguistics, Columbus: ACL Press, 2008: 897-904.
[9] 朱聪慧,赵铁军,郑德权. 基于无向图序列标注模型的中文分词词性标注一体化系统[J].电子与信息学报,2010,32(3):700-704.
[10] Berger A L, Della-Pietra S A, Della-Pietra V J. A maximum entropy approach to natural language processing [J]. Computational Linguistics,1996,22(1):39-71.
[11] 刘挺,车万翔,李生. 基于最大熵分类器的语义角色标注[J]. 软件学报,2007, 18(3):565-573.
[12] 何径舟,王厚峰. 基于特征选择和最大熵模型的汉语词义消歧[J]. 软件学报,2010, 21(6):1287-1295.
[13] 赵岩,王晓龙,刘秉权,等. 融合聚类触发对特征的最大熵词性标注模型[J]. 计算机研究与发展, 2006,43(2):268-274.
[14] 张贯虹,斯·劳格劳,乌达巴拉. 融合形态特征的最大熵模型蒙古文词性标注模型[J]. 计算机研究与发展,2011,48(12):2385-2390.
{{custom_fnGroup.title_cn}}
脚注
{{custom_fn.content}}
基金
国家自然科学基金(60863011),河南省基础与前沿技术研究计划项目(112300410182),河南省教育厅科学技术研究重点项目(14A520077)
{{custom_fund}}