基于SVM的中文组块分析

李珩,朱靖波,姚天顺

PDF(462 KB)
PDF(462 KB)
中文信息学报 ›› 2004, Vol. 18 ›› Issue (2) : 2-8.

基于SVM的中文组块分析

  • 李珩,朱靖波,姚天顺
作者信息 +

SVM Based Chinese Text Chunking

  • LI Heng,ZHU Jing-bo,YAO Tian-shun
Author information +
History +

摘要

基于SVM(support vector machine)理论的分类算法,由于其完善的理论基础和良好的实验结果,目前已逐渐引起国内外研究者的关注。和其他分类算法相比,基于结构风险最小化原则的SVM在小样本模式识别中表现较好的泛化能力。文本组块分析作为句法分析的预处理阶段,通过将文本划分成一组互不重叠的片断,来达到降低句法分析的难度。本文将中文组块识别问题看成分类问题,并利用SVM加以解决。实验结果证明,SVM算法在汉语组块识别方面是有效的,在哈尔滨工业大学树库语料测试的结果是F=88.67%,并且特别适用于有限的汉语带标信息的情况。

Abstract

The classification algorithm based on SVM (support vector machine) attracts more attention from researchers due to its perfect theoretical properties and good empirical results. Compared with other classification algorithms , structural risk minimizations based SVM achieve high generalization performance with small number of samples. The text chunking , as a preprocessing step for parsing , is to divide text into syntactically related non-overlapping groups of words (chunks) , reducing the complexity of the full parsing. In this paper , we treat Chinese text chunking as a classification problem , and apply SVM to solve it . The chunking experiments were carried out on the HIT Chinese Treebank corpus. Experimental results show that it is an effective approach , achieving an F score of 88.67% , especially for a small number of Chinese labeled samples.

关键词

计算机应用 / 中文信息处理 / 支持向量机 / 结构风险最小化 / 文本组块

Key words

computer application / Chinese information processing / support vector machine / structural risk minimization / text chunking

引用本文

导出引用
李珩,朱靖波,姚天顺. 基于SVM的中文组块分析. 中文信息学报. 2004, 18(2): 2-8
LI Heng,ZHU Jing-bo,YAO Tian-shun. SVM Based Chinese Text Chunking. Journal of Chinese Information Processing. 2004, 18(2): 2-8

参考文献

[1] Ramshaw L , Marcus M. Text Chunking Using Transformation-Based Learning [A] . Proceedings of third Workshop on Very Large Corpora[C] . Massachusetts : Association for Computational Linguistics , 1995. 82 - 94.
[2] Daelemans W, Buchholz S , Veenstra J. Memory-Based Shallow Parsing[A] . Proceedings of CoNLL [C] , Bergen : Association for Computational Linguistics , 1999. 53 - 60.
[3] Pla Ferran , Molina Antonio , Prieto Natividad. Improving chunking by means of lexical-contextual information in statistical language models[A] . Proceedings of CoNLL - 2000 and LLL - 2000[C] , Lisbon : Association for Computational Linguistics , 2000. 148 - 150.
[4] Koeling Rob. Chunking with maximum entropy models[A] . Proceedings of CoNLL - 2000 and LLL - 2000[C] , Lisbon : Association for Computational Linguistics , 2000. 139 - 141.
[5] Taku Kudo and Yuji Matsumoto. Chunking with Support Vector Machines [A] . In : Proceedings of NAACL 2001[C] , Pittsburgh , USA , 2001. Morgan Kaufman Publishers.
[6] Taku Kudo and Yuji Matsumoto. Use of Support Vector Learning for Chunk Identification [A] . In : Proceedings of CoNLL - 2000 and LLL - 2000[C] , Lisbon , Portugal , September 2000.
[7] Tetsji Nakagawa , Taku Kudoh , and Yuji Matsumoto , Unknown word guessing and part-of-speech tagging using support vector machines [A] , In : Proceedings of the Sixth Natural Language Processing Pacific Rim Symposium[C] , 2001 , 325 - 331.
[8] Hiroyasu Yamada , Taku Kudoh , and Yuji Matsumoto , Japanese named entity extraction using support vector machines (in Japanese) [A] , In : IPSJSIG Notes NL - 142 - 17[C] , 2001.
[9] T. Joachims. Text categorization with support vector machines : learning with many relevant features [A] . In : European Conference on Machine Learing , ECML98[C] , pages 137 - 142 , 1998.
[10] Steven Abney. Parsing by chunk [J] . Berwick , A. and Tenny , editors , Principle-Based Parsing. Kluwer. 1991.
[11] Ratnaparkhi A. Maximum Entropy Models for Natural Language Ambiguity Resolution[D] . Pennsylvania : University of Pennsylvania , 1998. 55 - 61.
[12] Kreβel. U. Pairwise Classification and Support Vector Machines [J] . B. Sch?lkopf , C. J. C. Burges , and A. J. Smola (Eds.) , Advances in Kernel Methods — Support Vector Learning , Cambridge , MA , 255 - 268. MIT Press. 1999.
[13] Tong S. & Koller D. Support vector machine active learning with applications to text classification [A] . Seventeenth International Conference on Machine Learning[C] . 2000.
[14] Joachims , T. Making large scale svm learning practical [J] . Sch?lkopf , B. , Burges , C. , and Smola , A. , editors , Advances in Kernel Methods - Support Vector Learning. MIT Press. 1999.
[15] Vapnik , V. Statistical Learning Theory [M] . Wiley. 1998.
[16] Vapnik V. The Nature of Statistical Learning Theory [M] . New York : Springer-Verlag , 1995.
[17] 李珩,等. 基于增益的隐马尔科夫模型的文本组块分析[J] ,计算机科学,已录用.
[18] 张昱琪,周强. 汉语基本短语的自动识别[J] ,中文信息学报,2002 ,16 (6) :1 - 8.
[19] 周强,孙茂松,黄昌宁. 汉语句子的组块分析体系[J] ,计算机学报, 1999 ,22 (11) :1158 - 1165.
[20] 赵军,黄昌宁. 基于转换的汉语基本名词短语识别模型[J] ,中文信息学报,1998 ,13 (2) :1 - 7.
[21] 姚天顺,等. 自然语言理解——一种让机器懂得人类语言的研究(第二版) [M] ,北京:清华大学出版社,2002 ,10.

基金

国家自然科学基金资助项目(60083006);国家重点基础研究发展规划973资助项目(G19980305011);国家自然科学基金和微软亚洲研究院联合资助项目(60203019)
PDF(462 KB)

825

Accesses

0

Citation

Detail

段落导航
相关文章

/