基于SVMTool的中文词性标注

王丽杰,车万翔,刘挺

PDF(486 KB)
PDF(486 KB)
中文信息学报 ›› 2009, Vol. 23 ›› Issue (4) : 16-22.
综述

基于SVMTool的中文词性标注

  • 王丽杰,车万翔,刘挺
作者信息 +

An SVMTool-Based Chinese POS Tagger

  • WANG Lijie, CHE Wanxiang, LIU Ting
Author information +
History +

摘要

SVMTool是建立在支持向量机(SVM)原理上的序列标注工具,具有简单、灵活、高效的特点,可以融入大量的语言特征。该文将SVMTool应用于中文词性标注任务,将基于隐马尔科夫模型的基线系统准确率提升了2.07%。针对未登录词准确率不高的问题,该文加入了中文字、词的特征,包括构成汉字的部首特征和词重叠特征,并从理论上分析了这两个特征的可行性,实验显示加入这些特征后,未登录词标注的准确率提升了1.16%,平均错误率下降了7.40%。

Abstract

The SVMTool is a simple, flexible and effective generator of sequential tagger based on Support Vector Machines, capable of dealing with a large number of linguistic features. In this paper, SVMTool is applied in Chinese POS tagging task and improves the accuracy by 2.07% compared with the baseline system on the Hidden Markov Model. To further improve the accuracy of unknown words, we introduce some features of Chinese characters and words, such as radicals of Chinese characters and reduplicate words, and probe into a theoretical analysis for their feasibility. Experiments indicate that these features can improve the accuracy of unknown words by 1.16% as well as reduce the error rate by 7.40%.
Key words computer application; Chinese information processing; part of speech tagging; SVMTool; unknown word; radicals of Chinese

关键词

计算机应用 / 中文信息处理 / 词性标注 / SVMTool / 未登录词 / 偏旁部首

Key words

computer application / Chinese information processing / part of speech tagging / SVMTool / unknown word / radicals of Chinese

引用本文

导出引用
王丽杰,车万翔,刘挺. 基于SVMTool的中文词性标注. 中文信息学报. 2009, 23(4): 16-22
WANG Lijie, CHE Wanxiang, LIU Ting. An SVMTool-Based Chinese POS Tagger. Journal of Chinese Information Processing. 2009, 23(4): 16-22

参考文献

[1] 郭永辉,吴保民,王炳锡.一种用于词性标注的相关投票融合策略[J].中文信息学报,2007,21(2): 9-13.
[2] 苏祺,胡景贺,等.词性标注对信息检索系统性能的影响[J].中文信息学报,2004,19(2):58-65.
[3] 张民,李生,等.统计与规则并举的汉语词性自动标注算法[J]. 软件学报,1998,9(2):134-138.
[4] 梁以敏,黄德根.基于完全二阶隐马尔科夫模型的汉语词性标注[J].计算机工程,2005,31(10):177-179.
[5] 洪铭材,张阔,等.基于条件随机场(CRFs)的中文词性标注方法[J].计算机科学,2006,33(10):148-155.
[6] Zhou Qiang. An algorithm of tagging Chinese POS based on statistics and rule[J]. Chinese Information Journal, 1996, 9(3):1-9.
[7] Jesus Gimenez and Luis Marquez. SVMTool:A general pos tagger generator based on support vector machines[C]//Proceedings of the 4th LREC Conference. Lisbon, Portugal, 2004: 43-46.
[8] T. Joachims. Making large-Scale SVM Learning Practical[M]. Cambridge, MA, USA: MIT-Press, 1999: 41-56.
[9] T.Brants. TnT A Statistical Part-of-Speech Tagger[C]//Proceedings of the Sixth ANLP Conference. Seattle, WA, 2000: 224-231.
[10] 张孝飞,陈肇雄,等.词性标注中生词处理算法研究[J].中文信息学报,2003,17(5):1-5.
[11] Aitao Chen, Yang Zhang and Gordan Sun. A Two-Stage Approach to Chinese Part-of-Speech Tagging[C]//Sixth SIGHAN Workshop on Chinese Language processing. Indian, 2007: 82-85.

基金

国家自然科学基金资助(60803093,60675034);国家863计划资助项目(2008AA01Z144)
PDF(486 KB)

1058

Accesses

0

Citation

Detail

段落导航
相关文章

/