文本分类中的不平衡数据问题在现实应用中比较普遍。传统的特征选择方法在不平衡问题上倾向于多数类而忽略稀有类。针对这种倾向性该文提出了一种主导性分析量化方法,并基于对该方法的优化提出了一种基于类别区分能力的特征选择方法,即DA(Discriminative Ability)方法,该方法使用文档概率的最小绝对值差作为评分标准,一定程度上保证了特征选择在稀有类与多数类上的公平性。实验表明,DA优于CHI、IG、DFICF,尤其在F1宏平均指标上,DA在不平衡问题上能够取得更好的降维效果。
Abstract
Imbalanced data in text categorization is pervasive in reality. Conventional feature selection(FS) methods prefer to choose features in large classes rather than rare classes. This paper proposes a quantitative method to measure the dominance. Then, this paper dscribes a new FS method, namely DA method, based on category discriminative ability takes the minimum absolute difference of documental probability between classes as a criterion to partly ensure the fairness of FS method on large classes and rare classes. Experimental results show the DA method outperforms CHI, IG and DFICF especially on macro-average F1 measure.
关键词
文本分类 /
不平衡问题 /
特征选择 /
主导性分析 /
区分能力
{{custom_keyword}} /
Key words
text categorization /
imbalanced problem /
feature selection /
dominance analysis /
discriminative ability
{{custom_keyword}} /
{{custom_sec.title}}
{{custom_sec.title}}
{{custom_sec.content}}
参考文献
[1]苏金树, 张博锋, 徐昕. 基于机器学习的文本分类技术研究进展[J]. 软件学报, 2006, 17(9): 1848-1859.
[2] Van Hulse J, Khoshgoftaar T M, Napolitano A, et al. Feature selection with high-dimensional imbalanced data[C]//Proceedings of IEEE International Conference. 2009: 507-514.
[3] 刘海峰, 王元元, 张学仁, 等. 文本分类中基于位置和类别信息的一种特征降维方法[J]. 计算机应用研究, 2008, 25(8): 2292-2294.
[4] Yang Y, Pedersen J O. A comparative study on feature selection in text categorization[C]//Proceedings of the ICML. 1997, 97: 412-420.
[5] Rogati M, Yang Y. High-performing feature selection for text classification[C]//Proceedings of the eleventh international conference on Information and knowledge management. ACM, 2002: 659-661.
[6] 代六玲, 黄河燕, 陈肇雄. 中文文本分类中特征抽取方法的比较研究[J]. 中文信息学报, 2004, 18(1): 26-32.
[7] 廖莎莎, 江铭虎. 中文文本分类中基于概念屏蔽层的特征提取方法[J]. 中文信息学报, 2006, 20(3): 22-28.
[8] 张希娟, 王会珍, 朱靖波. 面向文本分类的基于最小冗余原则的特征选取[J]. 中文信息学报, 2007, 21(5): 56-60.
[9] 熊忠阳, 蒋健, 张玉芳. 新的CDF文本分类特征提取方法[J]. 计算机应用, 2009, 9(7): 1755-1757.
[10] 徐燕, 李锦涛, 王斌, 等. 基于区分类别能力的高性能特征选择方法[J]. 软件学报, 2008, 19(1): 82-89.
[11] Sun A, Lim E P, Liu Y. On strategies for imbalanced text classification using SVM: A comparative study[J]. Decision Support Systems, 2009, 48(1): 191-201.
[12] Yin L, Ge Y, Xiao K, et al. Feature selection for high-dimensional imbalanced data[J]. Neurocomputing, 2013,105:3-11.
[13] Forman G. A pitfall and solution in multi-class feature selection for text classification[C]//Proceedings of the twenty-first international conference on Machine learning. ACM, 2004: 38.
[14] Zheng Z, Wu X, Srihari R. Feature selection for text categorization on imbalanced data[J]. ACM SIGKDD Explorations Newsletter, 2004, 6(1): 80-89.
[15] 徐燕, 李锦涛, 王斌, 等. 不均衡数据集上文本分类的特征选择研究[J]. 计算机研究与发展,2007,44(22):58-62.
[16] Zhang H P, Yu H K, Xiong D Y, et al. HHMM-based Chinese lexical analyzer ICTCLAS[C]//Proceedings of the second SIGHAN workshop on Chinese language processing-Volume 17. Association for Computational Linguistics, 2003: 184-187.
[17] Fan R E, Chang K W, Hsieh C J, et al. LIBLINEAR: A library for large linear classification[J]. The Journal of Machine Learning Research, 2008, 9: 1871-1874.
[18] McCallum A, Nigam K. A comparison of event models for naive bayes text classification[C]//Proceedings of the AAAI-98 workshop on learning for text categorization. 1998, 752: 41-48.
{{custom_fnGroup.title_cn}}
脚注
{{custom_fn.content}}