Abstract:In text classification tasks, these well-known feature selection methods such as information gain adopt conditional independence assumption between various features. However, this assumption would result in serious redundancy problems among various selected features. To alleviate the redundancy problem within the selected feature subset, this paper proposed a method based on minimal redundancy principle (MRP) for feature selection, in which correlations between different features are considered in feature selection process , and a feature subset with less redundancy can be built. Experimental results showed that MRP method can improve the effectiveness of feature selection, and results in better text classification performance (in most cases).
[1] F. Sebastiani.. Machine learning in automated text categorization[J]. ACM computing surveys., 2002, 34(1): 1-47. [2] A. McCallum, K. Nigam. A comparison of event models for nave bayes text classification[A]. In: AAAI-98 Workshop on Learning for Text Categorization [C]. 1998. [3] 陈文亮,朱慕华,朱靖波,姚天顺.基于Bootstrapping的文本分类模型[J].中文信息学报,2005,19(2):86-92. [4] D. Lewis, R. Schapire, J. Callan, and R. Papka. Training Algorithms for Linear Text Classifiers[A]. In: Proceedings of ACM SIGIR[C]. 1996, 298-306. [5] G. F. Cooper. The computational complexity of probabilistic inference using Bayesian belief networks[J]. Artificial Intelligence, 1990, 42:393-405. [6] S. Cooper. Some inconsistencies and Misnomers in probabilistic information retrieval[A]. In: Proceedings of the 14th ACM SIGIR International Conference on Research and Development in Information Retrieval[C]. 1991. [7] Y. Yang, J. O. Pedersen. A comparative study on feature selection in text categorization[A]. In: Proceedings of ICML-97, 14th International Conference on Machine Learning[C]. 1997. 412-420. [8] A. Appice, M. Ceci, S. Rawles, P Flach. Redundant feature elimination for multi-class problems[A]. In: ACM International Conference Proceeding Series[C]. 2004. [9] Hanchuan Peng, Fuhui Long, and Chris Ding. Feature selection based on mutual information criteria of max-dependency, max-relevance, and min-redundancy[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2005, 27(8). [10] McCallum, Andrew Kachites. Bow: A toolkit for statistical language modeling, text retrieval, classification and clustering [DB/OL]. Published in http://www.cs.cmu.edu/~mccallum/bow. 1996.