随着Web 2.0时代的兴起,微博作为一个新的信息分享平台已经成为人们生活中一个重要的信息来源和传播渠道。近年来针对微博的情感分类问题研究也越来越多地引起人们的关注。该文深入分析了传统的情感文本分类和微博情感分类在特征表示和特征筛选上存在的差异,针对目前微博情感分类在特征选择和使用上存在的缺陷,提出了三种简单但十分有效的特征选取和加入方法,包括词汇化主题特征、情感词内容特征和概率化的情感词倾向性特征。实验结果表明,通过使用该文提出的特征选择和特征加入方法,微博情感分类准确率由传统方法的73.17%提高到了84.17%,显著改善了微博情感分析的性能。
Abstract
Micro blog, a new information-sharing platform, is now playing an important role in people’s daily live with the rise of Web 2.0. And micro blog sentiment analysis research also attracts more attention in recent years. This paper provides an in-depth analysis on the difference of feature representation and feature selection between the traditional sentiment classification and micro blog sentiment analysis. To avoid the drawbacks of feature selection of existing methods, we propose three simple but effective approaches for feature representation and selection, including the lexicalization hashtag feature, the sentiment word feature, and the probabilistic sentiment lexicon feature. Experimental results show that our proposed methods significantly boost the micro blog sentiment classification accuracy from 73.17% to 84.17%, outperforming the state-of-the-art method significantly.
关键词
中文微博 /
情感分类 /
机器学习 /
特征选择
{{custom_keyword}} /
Key words
Chinese micro blog /
sentiment analysis /
machine learning /
feature selection
{{custom_keyword}} /
{{custom_sec.title}}
{{custom_sec.title}}
{{custom_sec.content}}
参考文献
[1]A Das, S Bandyopadhyay. Dr Sentiment knows everything![C]//Proceedings of the ACL-HLT, 2011: 50-55.
[2] A Joshi, A Balamurali, P Bhattacharyya, et al. C-feel-it: A sentiment analyzer for micro-blogs[C]//Proceedings of the ACL-HLT, 2011 :127-132.
[3] P Chesley, B Vincent, L Xu, et al. Using verbs and adjectives to automatically classify blog sentiment[J] .Training, 2006, 580(263).
[4] 刘鲁,刘志明. 基于机器学习的中文微博情感分类实证研究[J]. 计算机工程与应用, 2012,48(1):1-4.
[5] L Jiang, M Yu, M Zhou, et al. Target -dependent twitter sentiment classification[C]//Proceedings of ACL-HLT, 2011:151-160.
[6] S Prasad. Micro-blogging Sentiment Analysis Using Bayesian Classification Methods[N]. Technical Report, Stanford University, 2010, Available at http://www-nlp.stanford.edu/courses/
[7] Y Lu, M Castellanos, U Dayal, et al. Automatic construction of a context-aware sentiment lexicon: an optimization approach[C]//Proceedings of the 20th international conference on World wide web, 2011:347-356.
[8] P D Turney. Thumbs up or thumbs down?: semantic orientation applied to unsupervised classification of reviews[C]//Proceedings of the 40th Annual Meeting on Association for Computational Linguistics, 2002 :417-424.
[9] B Pang, L Lee, S Vaithyanathan. Thumbs up?: sentiment classification using machine learning techniques[C]//Proceedings of EMNLP, 2002:79-86.
[10] T Mullen, N Collier. Sentiment Analysis using Support Vector Machines with Diverse Information Sources[C]//Proceedings of EMNLP, 2004: 412-418.
[11] A Go, R Bhayani, L Huang. Twitter sentiment classification using distant supervision[J]. CS224N Project Report, Stanford University, 2009: 1-12.
[12] A Pak, P Paroubek. Twitter as a corpus for sentiment analysis and opinion mining[C]//Proceedings of LREC, 2010:1320-1326.
[13] D Davidov, O Tsur, A Rappoport. Enhanced sentiment learning using twitter hashtags and smileys[C]//Proceedings of the 23rd International Conference on Computational Linguistics,2010:241-249.
[14] 谢丽星, 周明,孙茂松. 基于层次结构的多策略中文微博情感分析和特征抽取[J]. 中文信息学报, 2012, 26(1):73-82.
[15] 宗成庆. 统计自然语言处理[M]. 北京: 清华大学出版社, 2008.
[16] T Dunning. Accurate methods for the statistics of surprise and coincidence[J]. Computational linguistics, 1993, 19(1): 61-74.
[17] Dong Z, Dong Q. HowNet [EB/OL]. Available at http://www.keenage.com/ 2000
[18] C C Chang, C J Lin. LIBSVM: a library for support vector machines[J]. ACM Transactions on Intelligent Systems and Technology (TIST),2011,2(3):1-27.
[19] K Wang, C Zong, K Y Su. A character-based joint model for Chinese word segmentation[C]//Proceedings of the 23rd International Conference on Computational Linguistics, 2010:1173-1181.
{{custom_fnGroup.title_cn}}
脚注
{{custom_fn.content}}