代六玲,黄河燕,陈肇雄. 中文文本分类中特征抽取方法的比较研究[J]. 中文信息学报, 2004, 18(1): 27-33.
DAI Liu-ling,HUANG He-yan,CHEN Zhao-xiong. A Comparative Study on Feature Selection in Chinese Text Categorization. , 2004, 18(1): 27-33.
中文文本分类中特征抽取方法的比较研究
代六玲1,2,黄河燕2,陈肇雄2
1.南京理工大学计算机科学系 2.中国科学院计算机语言信息工程研究中心
A Comparative Study on Feature Selection in Chinese Text Categorization
DAI Liu-ling1,2,HUANG He-yan2,CHEN Zhao-xiong2
1.Department of Computer Science , NUST 2.Language Information Engineering , CAS
Abstract:This paper is a comparative study of feature selection methods in text categorization. Four methods were evaluated, including document frequency (DF) , information gain (IG) , mutual information (MI) and χ2-test (CHI) . A Support Vector Machine (SVM) and a k-nearest neighbor (KNN) were selected as the evaluating classifiers. We found IG, MI and CHI had poor performance in our test , though they behave well in English text categorization. We analyzed the reasons theoretically and put forwarded the possible solutions. A furthermore experiment proved that the combined feature selection method is effective.
[1] 孙丽华,等. 一种改进的KNN方法及其在文本分类中的应用[J] . 应用科技第29卷第2期2002年2月 [2] 朱寰,等. 文本分割算法对中文信息过滤影响研究[J] . 计算机工程与应用,第13期,2002 [3] 何新贵,等. 中文文本的关键词自动抽取和模糊分类[J] . 中文信息学报,1999 :13 (1) [4] Y. Yang. A Comparative Study on Feature Selection in Text Categorization[C] . In : Proceeding of the
Fourteenth International Conference on Machine Learning (ICML'97) , 412 - 420 ,1997. [5] Tom Mitchell. Machine Learning[M] . McCraw Hill , 1996. [6] T. E. Dunning. Accurate methods or the statistics of surprise and coincidence [C] . In : Computational Linguistics , Volume 19 :1 , pages 61 - 74 , 1993. [7] Kenneth Ward Church and Patric K Hanks. Word association norms , mutual information and lexicography[C] . In : Proceedings of ACL27 , pages 76 - 83 , Vancouver , Canada , 1989. [8] Salton G, Wong A , Yang C S. A vector space model for automatic indexing[C] . Communications of the ACM , 1975 , 18 (5) : 613 - 620. [9] Salton G. Introduction to Modern Information Retrieval[M] . New York : McGraw-Hill Book Company , 1983. [10] Belur V. Dasarathy. Nearest Neighbor (NN) Norms : NN Pattern Classification Techniques [M] . IEEE Computer Society Press , Las Alamitos , California , 1991. [11] Y. Yang and X. Liu. A re-examination of text categorization methods[C] . In : Proceedings , 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR'99) , 42 - 49 , 1999. [12] Vladmimir N. VapniK. The Nature of Statistical Learning Theory[M] . Springer , New York , 1998. [13] Burges , C. J. C. A tutorial on support vector machines for pattern recognition[C] . Data Mining and Knowledge Discovery , 1998 ,2 (2) : 955 - 974. [14] J. Platt . Fast training of support vector machines using sequential minimal optimization [C] . In : B. ScholKopf , C. Burges , and A. Smola , editors , Advances in Kernel methods : support vector learning. MIT Press , 1998. [15] S. S. Keerthi , S. K. Shevade , C. Bhattacharyya , and K. R. K. Murthy. Improvementsto plattps SMO algorithm for SVM classifier design[R] . Neural Computation , 13 (3) :637 - 649 , March 2001. [16] 黄昌宁,等. 对自动分词的反思[C] ,语言计算与基于内容的文本处理,北京:清华大学出版社. 26 - 38 , 2003 ,7.