随着网络的发展,情感分类任务受到广大研究人员的密切关注。针对情感分类中的不平衡数据分布和高维特征问题,该文比较研究了四种经典的特征选择方法在不平衡情感分类中的应用。同时,该文提出了三种不同的特征选择模式并实验比较了这三种模式在分类和降维性能方面的表现。实验结果表明在不平衡数据的情感分类任务中,特征选择方法能够在不损失分类效果的前提下显著降低特征向量的维度。此外,特征选择方法中信息增益(IG)结合“先随机欠采样后特征选择”模式能够取得最佳的分类效果。
Abstract
With the rapid development of Internet, the task of sentiment classification has attracted a great attention by many researchers in the area of natural language processing. In this paper, we focus on the sentiment classification tasks where the data distribution is imbalanced (named imbalanced sentiment classification). To reduce the high-dimensional feature space in imbalanced sentiment classification, we investigate four classic feature selection (FS) methods that are popularly studied in traditional text categorization. Furthermore, three different feature selection modes are proposed and compared in the specific task. The experimental results demonstrate that using the feature selection methods is capable of significantly reducing the dimension of the feature vector without any loss in the classification performance. Besides, the results show that the FS method of information gain (IG) combined with the mode “Feature selction after random under-sampling” performs best.
Key wordssentiment classification; imbalanced data; feature selection
关键词
情感分类 /
不平衡数据 /
特征选择
{{custom_keyword}} /
Key words
sentiment classification /
imbalanced data /
feature selection
{{custom_keyword}} /
{{custom_sec.title}}
{{custom_sec.title}}
{{custom_sec.content}}
参考文献
[1] Pang B, L Lee, S Vaithyanathan. Thumbs up? Sentiment classification using machine learning techniques[C]//Proceedings of EMNLP-02, 2002.
[2] Liu B, M Hu, J Cheng. Opinion Observer:Analyzing and Comparing Opinions on the Web[C]//Proceedings of WWW-05, 2005.
[3] Wiebe J, T Wilson, C Cardie. Annotating Expressions of Opinions and Emotions in Language. Language Resources and Evaluation, 2005.
[4] Cui H, V Mittal, M Datar. Comparative Experiments on Sentiment Classification for Online Product Reviews[C]//Proceedings of AAAI-06, 2006.
[5] Li S, C Huang, G Zhou, et al. Employing Personal/Impersonal Views in Supervised and Semi-supervised Sentiment Classification[C]//Proceedings of ACL-10, 2010.
[6] Li S, G Zhou, Z Wang, et al. Imbalanced Sentiment Classification[C]//Proceeding of CIKM-11, 2011.
[7] Kubat M. and S. Matwin. Addressing the Curse of Imbalanced Training Sets:One-Sided Selection[C]//Proceedings of ICML-97, 1997.
[8] Barandela R, J Sánchez, V García, et al. Strategies for Learning in Class Imbalance Problems[J]. Pattern Recognition, 2003.
[9] Chawla N, N Japkowicz, A. Kotcz.Editorial. Special Issue on Learning from Imbalanced Data Sets[J]. SIGKDD Exploration Newsletter, 2004.
[10] Chawla N, K Bowyer, L Hall, et al. SMOTE: Synthetic Minority Over-Sampling Technique[J]. Journal of Artificial Intelligence Research, 2002.
[11] Yen S, Y Lee. Cluster-Based UnderSampling Approaches for Imbalanced Data Distributions. Expert Systems with Applications, 2009.
[12] Li S, Z Wang, G Zhou, et al. Semi-Supervised Learning for Imbalanced Sentiment Classification[C]//Proceeding of IJCAI-11, 2011.
[13] Li S, S Ju, G Zhou. Active Learning for Imbalanced Sentiment Classification[C]//Proceedings of EMNLP-12, 2012.
[14] Yang Y. and J. Pedersen. A comparative study on feature selection in text categorization[C]//Proceedings of ICML-97, 1997.
[15] Li S, S Ju, G Zhou. A Framework of Feature Selection Methods for Text Categorization[C]//Proceedings of IJCNLP-09, 2009.
[16] 王中卿, 李寿山, 朱巧明, 等. 基于不平衡数据的中文情感分类. 中文信息学报, 2012,26(3): 33-37.
[17] Japkowicz N, S Stephen. The class imbalance problem: A systematic study[J]. Intelligent Data Analysis, 2001.
{{custom_fnGroup.title_cn}}
脚注
{{custom_fn.content}}
基金
国家自然科学基金资助项目(61070123, 61003155);中科院自动化所模式识别国家重点实验室开放课题资助项目
{{custom_fund}}