周茜,赵明生,扈旻. 中文文本分类中的特征选择研究[J]. 中文信息学报, 2004, 18(3): 18-24.
ZHOU Qian,ZHAO Ming-sheng,HU min. Study on Feature Selection in Chinese Text Categorization. , 2004, 18(3): 18-24.
中文文本分类中的特征选择研究
周茜,赵明生,扈旻
清华大学电子工程系
Study on Feature Selection in Chinese Text Categorization
ZHOU Qian,ZHAO Ming-sheng,HU min
Department of Electronic Engineering , Tsinghua University
Abstract:This paper introduces and compares eight feature selection methods in text categorization. Among the eight methods , Multi-Class Odds Ratio (MC-OR) , a variant of Odds Ratio which is often used in binary classification , and a new feature selection method based on Class-Discriminating Words (CDW) are proposed. Combined with the classic VSM classifier based on cosine similarity and the Na?ve Bayes classifier , training and test are carried out on two text sets with different class distribution. As the results indicate , MC-OR and CDW gain the best selecting effect.
[1] 史忠植. 知识发现[M] . 北京:清华大学出版社,2002. [2] Yang Yiming , Pederson J O. A Comparative Study on Feature Selection in Text Categorization [A] . Proceedings of the 14th International Conference on Machine learning[C] . Nashville : Morgan Kaufmann , 1997 : 412 - 420. [3] Mlademnic ,D. , Grobelnik ,M. Feature Selection for unbalanced class distribution and Na?ve Bayees[A] . Proceedings of the Sixteenth International Conference on Machine Learning[C] . Bled : Morgan Kaufmann , 1999 : 258 - 267. [4] 王梦云,曹素青. 基于字频向量的中文文本自动分类系统[J] . 情报学报,2000 ,19 (6) : 644 - 649. [5] Y. Yang. Noise reduction in a statistical approach to text categorization[A] . Proceedings of the 18th Ann Int ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR'95) [C] . Seattle : ACM Press , 1995 : 256 - 263. [6] 范焱,郑诚,等. 用Na?ve Bayes方法协调分类Web网页[J] . 软件学报,2001 , 12 (9) : 1386 - 1392. [7] 刘斌,黄铁军,程军,高文. 一种新的基于统计的自动文本分类方法[J] . 中文信息学报,2002 ,16 (6) : 18 - 24. [8] 梁久祯,兰东俊,扈旻. 基于先验知识的网页特征压缩与线性分类器设计[A] . 第十二届全国神经计算学术大会论文集[C] . 北京:人民邮电出版社,2002 ,494 - 501. [9] Thorsten Joachims , Text Categorization with Support Vector Machines : Learning with Many Relevant Features [A] , In : European Conferrence on Machine Learning (ECML) [C] . Berlin : Springer , 1998 ,137 - 142.