Feature Selection Method for Semi-Supervised Sentiment Classification
WANG Zhihao,WANG Zhongqing,LI Shoushan,LI Peifeng,SHI Hanxiao
School of Computer Sciences and Technology,Soochow University,Suzhou,Jiangsu 215006,China (School of Computer Science and Information Engineering,Zhejiang Gongshang University, Hangzhou, Zhejiang 310018,China
Abstract:Feature selection aims to reduce the high-dimensional feature space so as to simplify the problem and improve the learning method. Existing studies have shown that feature selection is effective in reducing feature space in sentiment classification. In this paper, we focus on feature selection method. Different from all previous studies, we attempt to conduct the research on feature selection on semi-supervised sentiment classification. We propose a novel feature selection method based on bipartite graph which focuses on semi-supervised sentiment classification. First, we formulate the relations between documents and words with the help of bipartite graph model. Then, with a small amount of labeled data and the bipartite graph, a label propagation algorithm is applied to calculate the feature probabilities belonging to sentimental categories. Third, the features are then selected according the sentimental probabilities. The experimental results across multiple domains demonstrate that our feature selection method achieves much better performances than random feature selection method. Our approach is capable of significantly reducing the dimension of the feature vector without any loss in the classification performance. Key wordssentiment classification; semi-supervised learning; label propagation; bipartite graph; feature selection
[1] Pang Bo, Lee L, Vaithyanathan S. Thumbs up? Sentiment classification using machine learning techniques[C]//Proceedings of EMNLP-02, 2002:79-86. [2] Liu Bing, Hu Minqing, Cheng Junsheng. Opinion Observer: Analyzing and Comparing Opinions on the Web[C]//Proceedings of WWW-05, 2005: 342-351. [3] Wiebe J, Wilson T, Cardie C. Annotating Expressions of Opinions and Emotions in Language. Language Resources and Evaluation, 2005, 39: 165-210. [4] 唐慧丰, 谭松波, 程学旗. 基于监督学习的中文情感分类技术比较研究[J]. 中文信息学报, 2007, 6(2): 88-94. [5] Zagibalov T, Carroll J. Automatic Seed Word Selection for Unsupervised Sentiment Classification of Chinese Test[C]//Proceedings of COLING, 2008: 1073-1080. [6] Yarowsky D. Unsupervised Word Sense Disambiguation Rivaling Supervised Methods[C]//Proceedings of ACL-05, 2005: 189-196. [7] Dasgupta S, Ng V. Mine the Easy, Classify the Hard: A Semi-Supervised Approach to Automatic Sentiment Classification[C]//Proceeding of ACL-IJCNLP-09, 2009: 701-709. [8] Wan Xiaojun. Co-Training for Cross-Lingual Sentiment Classification[C]//Proceedings of ACL-IJCNLP-09, 2009: 235-243. [9] Li Shoushan, Huang Chu-Ren, Zhou Guodong, et al. Employing Personal/Impersonal Views in Supervised and Semi-supervised Sentiment Classification[C]//Proceedings of ACL-10, 2010: 414-423. [10] Li Shoushan, Xia Rui, Zong Chengqing, et al. A Framework of Feature Selection Methods for Text Categorization[C]//Proceedings of IJCNLP-09, 2009: 692-700. [11] 苏艳, 王中卿, 居胜峰, 等. 基于随机特征子空间的半监督情感分类方法研究[J]. 中文信息学报, 2012, 26(4): 85-92. [12] Li Tao, Zhang Yi, Sindhwani V. A Non-negative Matrix Tri-factorization Approach to Sentiment Classification with Lexical Prior Knowledge[C]//Proceeding of ACL-IJCNLP-09, 2009: 244-252. [13] 高伟, 王中卿, 李寿山. 基于随机特征子空间的半监督情感分类方法研究[J]. 中文信息学报,2012,27(3): 120-126. [14] Yang Yiming, Pedersen J. A comparative study on feature selection in text categorization[C]//Proceedings of ICML-97, 1997. [15] Cui Hang, Mittal V, Datar M. Comparative Experiments on Sentiment Classification for Online Product Reviews[C]//Proceedings of AAAI-06, 2006: 611-618. [16] Ng V, Dasgupta S, Niaz Arifin S. Examining the role of linguistic knowledge sources in the automatic identification and classification of reviews[C]//Proceedings of the COLING/ACL Main Conference Poster Sessions, 2006. [17] 宗成庆. 统计自然语言处理[M]. 清华大学出版社, 2008.5 [18] Zhu Xiaojin, Ghahramani Z. Learning from Labeled and Unlabeled Data with Label Propagation[C]//Proceedings of CMU CALD Technical Report, 2002.