面向文本分类的基于最小冗余原则的特征选取

PDF(325 KB)

中文信息学报 ›› 2007, Vol. 21 ›› Issue (5) : 56-60.

综述

面向文本分类的基于最小冗余原则的特征选取

张希娟,王会珍,朱靖波

作者信息 +

Feature Selection Based on Minimal Redundancy Principle for Text Classification

ZHANG Xi-juan, WANG Hui-zhen, ZHU Jing-bo

Author information +

History +

摘要

在文本分类中,为了降低计算复杂度,常用的特征选取方法(如IG)都假设特征之间条件独立。该假设将引入严重的特征冗余现象。为了降低特征子集的冗余度,本文提出了一种基于最小冗余原则(minimal Redundancy Principle, MRP)的特征选取方法。通过考虑不同特征之间的相关性,选择较小冗余度的特征子集。实验结果显示基于最小冗余原则方法能够改善特征选取的效果,提高文本分类的性能。

Abstract

In text classification tasks, these well-known feature selection methods such as information gain adopt conditional independence assumption between various features. However, this assumption would result in serious redundancy problems among various selected features. To alleviate the redundancy problem within the selected feature subset, this paper proposed a method based on minimal redundancy principle (MRP) for feature selection, in which correlations between different features are considered in feature selection process , and a feature subset with less redundancy can be built. Experimental results showed that MRP method can improve the effectiveness of feature selection, and results in better text classification performance (in most cases).

导出引用

张希娟,王会珍,朱靖波. 面向文本分类的基于最小冗余原则的特征选取. 中文信息学报. 2007, 21(5): 56-60

ZHANG Xi-juan, WANG Hui-zhen, ZHU Jing-bo. Feature Selection Based on Minimal Redundancy Principle for Text Classification. Journal of Chinese Information Processing. 2007, 21(5): 56-60

参考文献

[1] F. Sebastiani.. Machine learning in automated text categorization[J]. ACM computing surveys., 2002, 34(1): 1-47.
[2] A. McCallum, K. Nigam. A comparison of event models for nave bayes text classification[A]. In: AAAI-98 Workshop on Learning for Text Categorization [C]. 1998.
[3] 陈文亮,朱慕华,朱靖波,姚天顺.基于Bootstrapping的文本分类模型[J].中文信息学报,2005,19(2):86-92.
[4] D. Lewis, R. Schapire, J. Callan, and R. Papka. Training Algorithms for Linear Text Classifiers[A]. In: Proceedings of ACM SIGIR[C]. 1996, 298-306.
[5] G. F. Cooper. The computational complexity of probabilistic inference using Bayesian belief networks[J]. Artificial Intelligence, 1990, 42:393-405.
[6] S. Cooper. Some inconsistencies and Misnomers in probabilistic information retrieval[A]. In: Proceedings of the 14^th ACM SIGIR International Conference on Research and Development in Information Retrieval[C]. 1991.
[7] Y. Yang, J. O. Pedersen. A comparative study on feature selection in text categorization[A]. In: Proceedings of ICML-97, 14^thInternational Conference on Machine Learning[C]. 1997. 412-420.
[8] A. Appice, M. Ceci, S. Rawles, P Flach. Redundant feature elimination for multi-class problems[A]. In: ACM International Conference Proceeding Series[C]. 2004.
[9] Hanchuan Peng, Fuhui Long, and Chris Ding. Feature selection based on mutual information criteria of max-dependency, max-relevance, and min-redundancy[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2005, 27(8).
[10] McCallum, Andrew Kachites. Bow: A toolkit for statistical language modeling, text retrieval, classification and clustering [DB/OL]. Published in http://www.cs.cmu.edu/~mccallum/bow. 1996.

基金

国家自然科学基金资助项目(60473140);国家863高科技计划课题资助(2006AA01Z154);国家教育部新世纪优秀人才计划项目资助(NCET-05-0287);国家985工程计划项目资助(985-2-DB-C03)

PDF(325 KB)

678

Accesses

Citation

Detail

段落导航

摘要
Abstract
关键词
Key words
引用本文
参考文献
基金

Received	Published
2007-04-15	2007-10-15
Issue Date
2007-10-15

选择文件类型/文献管理软件名称

选择包含的内容

摘要

Abstract

关键词

Key words

引用本文

{{custom_sec.title}}

{{custom_sec.title}}

参考文献

{{custom_fnGroup.title_cn}}

脚注

基金