张延祥,潘海侠. 一种基于区分能力的多类不平衡文本分类特征选择方法[J]. 中文信息学报, 2015, 29(4): 111-119.
ZHANG Yanxiang, PAN Haixia. A Feature Selection Method Based on Discriminative Ability for Multiclass Text Categorization on Imbalanced Data. , 2015, 29(4): 111-119.
一种基于区分能力的多类不平衡文本分类特征选择方法
张延祥,潘海侠
北京航空航天大学 软件学院,北京 100191
A Feature Selection Method Based on Discriminative Ability for Multiclass Text Categorization on Imbalanced Data
ZHANG Yanxiang, PAN Haixia
School of Software,BeiHang University, Beijing 100191, China
Abstract:Imbalanced data in text categorization is pervasive in reality. Conventional feature selection(FS) methods prefer to choose features in large classes rather than rare classes. This paper proposes a quantitative method to measure the dominance. Then, this paper dscribes a new FS method, namely DA method, based on category discriminative ability takes the minimum absolute difference of documental probability between classes as a criterion to partly ensure the fairness of FS method on large classes and rare classes. Experimental results show the DA method outperforms CHI, IG and DFICF especially on macro-average F1 measure.
[1]苏金树, 张博锋, 徐昕. 基于机器学习的文本分类技术研究进展[J]. 软件学报, 2006, 17(9): 1848-1859. [2] Van Hulse J, Khoshgoftaar T M, Napolitano A, et al. Feature selection with high-dimensional imbalanced data[C]//Proceedings of IEEE International Conference. 2009: 507-514. [3] 刘海峰, 王元元, 张学仁, 等. 文本分类中基于位置和类别信息的一种特征降维方法[J]. 计算机应用研究, 2008, 25(8): 2292-2294. [4] Yang Y, Pedersen J O. A comparative study on feature selection in text categorization[C]//Proceedings of the ICML. 1997, 97: 412-420. [5] Rogati M, Yang Y. High-performing feature selection for text classification[C]//Proceedings of the eleventh international conference on Information and knowledge management. ACM, 2002: 659-661. [6] 代六玲, 黄河燕, 陈肇雄. 中文文本分类中特征抽取方法的比较研究[J]. 中文信息学报, 2004, 18(1): 26-32. [7] 廖莎莎, 江铭虎. 中文文本分类中基于概念屏蔽层的特征提取方法[J]. 中文信息学报, 2006, 20(3): 22-28. [8] 张希娟, 王会珍, 朱靖波. 面向文本分类的基于最小冗余原则的特征选取[J]. 中文信息学报, 2007, 21(5): 56-60. [9] 熊忠阳, 蒋健, 张玉芳. 新的CDF文本分类特征提取方法[J]. 计算机应用, 2009, 9(7): 1755-1757. [10] 徐燕, 李锦涛, 王斌, 等. 基于区分类别能力的高性能特征选择方法[J]. 软件学报, 2008, 19(1): 82-89. [11] Sun A, Lim E P, Liu Y. On strategies for imbalanced text classification using SVM: A comparative study[J]. Decision Support Systems, 2009, 48(1): 191-201. [12] Yin L, Ge Y, Xiao K, et al. Feature selection for high-dimensional imbalanced data[J]. Neurocomputing, 2013,105:3-11. [13] Forman G. A pitfall and solution in multi-class feature selection for text classification[C]//Proceedings of the twenty-first international conference on Machine learning. ACM, 2004: 38. [14] Zheng Z, Wu X, Srihari R. Feature selection for text categorization on imbalanced data[J]. ACM SIGKDD Explorations Newsletter, 2004, 6(1): 80-89. [15] 徐燕, 李锦涛, 王斌, 等. 不均衡数据集上文本分类的特征选择研究[J]. 计算机研究与发展,2007,44(22):58-62. [16] Zhang H P, Yu H K, Xiong D Y, et al. HHMM-based Chinese lexical analyzer ICTCLAS[C]//Proceedings of the second SIGHAN workshop on Chinese language processing-Volume 17. Association for Computational Linguistics, 2003: 184-187. [17] Fan R E, Chang K W, Hsieh C J, et al. LIBLINEAR: A library for large linear classification[J]. The Journal of Machine Learning Research, 2008, 9: 1871-1874. [18] McCallum A, Nigam K. A comparison of event models for naive bayes text classification[C]//Proceedings of the AAAI-98 workshop on learning for text categorization. 1998, 752: 41-48.