基于SVM(support vector machine)理论的分类算法,由于其完善的理论基础和良好的实验结果,目前已逐渐引起国内外研究者的关注。和其他分类算法相比,基于结构风险最小化原则的SVM在小样本模式识别中表现较好的泛化能力。文本组块分析作为句法分析的预处理阶段,通过将文本划分成一组互不重叠的片断,来达到降低句法分析的难度。本文将中文组块识别问题看成分类问题,并利用SVM加以解决。实验结果证明,SVM算法在汉语组块识别方面是有效的,在哈尔滨工业大学树库语料测试的结果是F=88.67%,并且特别适用于有限的汉语带标信息的情况。
Abstract
The classification algorithm based on SVM (support vector machine) attracts more attention from researchers due to its perfect theoretical properties and good empirical results. Compared with other classification algorithms , structural risk minimizations based SVM achieve high generalization performance with small number of samples. The text chunking , as a preprocessing step for parsing , is to divide text into syntactically related non-overlapping groups of words (chunks) , reducing the complexity of the full parsing. In this paper , we treat Chinese text chunking as a classification problem , and apply SVM to solve it . The chunking experiments were carried out on the HIT Chinese Treebank corpus. Experimental results show that it is an effective approach , achieving an F score of 88.67% , especially for a small number of Chinese labeled samples.
关键词
计算机应用 /
中文信息处理 /
支持向量机 /
结构风险最小化 /
文本组块
{{custom_keyword}} /
Key words
computer application /
Chinese information processing /
support vector machine /
structural risk minimization /
text chunking
{{custom_keyword}} /
{{custom_sec.title}}
{{custom_sec.title}}
{{custom_sec.content}}
参考文献
[1] Ramshaw L , Marcus M. Text Chunking Using Transformation-Based Learning [A] . Proceedings of third Workshop on Very Large Corpora[C] . Massachusetts : Association for Computational Linguistics , 1995. 82 - 94.
[2] Daelemans W, Buchholz S , Veenstra J. Memory-Based Shallow Parsing[A] . Proceedings of CoNLL [C] , Bergen : Association for Computational Linguistics , 1999. 53 - 60.
[3] Pla Ferran , Molina Antonio , Prieto Natividad. Improving chunking by means of lexical-contextual information in statistical language models[A] . Proceedings of CoNLL - 2000 and LLL - 2000[C] , Lisbon : Association for Computational Linguistics , 2000. 148 - 150.
[4] Koeling Rob. Chunking with maximum entropy models[A] . Proceedings of CoNLL - 2000 and LLL - 2000[C] , Lisbon : Association for Computational Linguistics , 2000. 139 - 141.
[5] Taku Kudo and Yuji Matsumoto. Chunking with Support Vector Machines [A] . In : Proceedings of NAACL 2001[C] , Pittsburgh , USA , 2001. Morgan Kaufman Publishers.
[6] Taku Kudo and Yuji Matsumoto. Use of Support Vector Learning for Chunk Identification [A] . In : Proceedings of CoNLL - 2000 and LLL - 2000[C] , Lisbon , Portugal , September 2000.
[7] Tetsji Nakagawa , Taku Kudoh , and Yuji Matsumoto , Unknown word guessing and part-of-speech tagging using support vector machines [A] , In : Proceedings of the Sixth Natural Language Processing Pacific Rim Symposium[C] , 2001 , 325 - 331.
[8] Hiroyasu Yamada , Taku Kudoh , and Yuji Matsumoto , Japanese named entity extraction using support vector machines (in Japanese) [A] , In : IPSJSIG Notes NL - 142 - 17[C] , 2001.
[9] T. Joachims. Text categorization with support vector machines : learning with many relevant features [A] . In : European Conference on Machine Learing , ECML98[C] , pages 137 - 142 , 1998.
[10] Steven Abney. Parsing by chunk [J] . Berwick , A. and Tenny , editors , Principle-Based Parsing. Kluwer. 1991.
[11] Ratnaparkhi A. Maximum Entropy Models for Natural Language Ambiguity Resolution[D] . Pennsylvania : University of Pennsylvania , 1998. 55 - 61.
[12] Kreβel. U. Pairwise Classification and Support Vector Machines [J] . B. Sch?lkopf , C. J. C. Burges ,
and A. J. Smola (Eds.) , Advances in Kernel Methods — Support Vector Learning , Cambridge , MA , 255 - 268. MIT Press. 1999.
[13] Tong S. & Koller D. Support vector machine active learning with applications to text classification [A] . Seventeenth International Conference on Machine Learning[C] . 2000.
[14] Joachims , T. Making large scale svm learning practical [J] . Sch?lkopf , B. , Burges , C. , and Smola , A. , editors , Advances in Kernel Methods - Support Vector Learning. MIT Press. 1999.
[15] Vapnik , V. Statistical Learning Theory [M] . Wiley. 1998.
[16] Vapnik V. The Nature of Statistical Learning Theory [M] . New York : Springer-Verlag , 1995.
[17] 李珩,等. 基于增益的隐马尔科夫模型的文本组块分析[J] ,计算机科学,已录用.
[18] 张昱琪,周强. 汉语基本短语的自动识别[J] ,中文信息学报,2002 ,16 (6) :1 - 8.
[19] 周强,孙茂松,黄昌宁. 汉语句子的组块分析体系[J] ,计算机学报, 1999 ,22 (11) :1158 - 1165.
[20] 赵军,黄昌宁. 基于转换的汉语基本名词短语识别模型[J] ,中文信息学报,1998 ,13 (2) :1 - 7.
[21] 姚天顺,等. 自然语言理解——一种让机器懂得人类语言的研究(第二版) [M] ,北京:清华大学出版社,2002 ,10.
{{custom_fnGroup.title_cn}}
脚注
{{custom_fn.content}}
基金
国家自然科学基金资助项目(60083006);国家重点基础研究发展规划973资助项目(G19980305011);国家自然科学基金和微软亚洲研究院联合资助项目(60203019)
{{custom_fund}}