Abstract:Microblogging sites are one of the most popular information sharing platforms today. However, among the large amount of posted published every day, spam texts are seen everywhere: users utilize spam posts to advertise, broadcast, boast their own products, and defame their competitors. Therefore, filtering spam tweets is a critical and fundamental problem. In this paper, we propose a semi-supervised algorithm based on Expectation Maximization and Naive Bayesian Classifier (EM-NB), which is able to filter spam tweets effectively using only a small amount of labeled data. The experimental results on more than 140 thousand tweets from Sina Weibo show that our method achieves higher accuracy and F-score than baselines.
[1] By The Numbers: 220 Amazing Twitter Statics [OL].2014.http://expandedramblings.com/index.php/march-2013-by-the-numbers-a-few-amazing-twitter-stats/#.VCdgtaiSzI0 [2] 陈倩. 微博广告发展现状与传播效果分析[J]. 产业与科技论坛,2012,11(2): 33-35. [3] 垃圾营销信息管理规定征求意见稿[OL]. http://weibo.com/p/1001603697836242954625,2014. [4] Jindal,Nitin, Bing Liu.Opinion spam and analysis[C]//Proceedings of the 2008 International Conference on Web Search and Data Mining.ACM,2008: 219-230. [5] Jindal N, Liu B. Reviewspam detection[C]//Proceedings of the 16th International Conference on World Wide Web, New York, NY, USA: ACM, 2007: 1189-1190. [6] Li Jiwei, Claire Cardie, Sujian Li. Topic Spam: a Topic-Model based approach for spam detection[C]//Proceedings of the ACL,2013. [7] Ren,Yafeng,Donghong Ji,and Hongbin Zhang.Positive Unlabeled Learning for Deceptive Reviews Detection[C]//Proceedings of the EMNLP,2014. [8] Lim,Ee-Peng,et al.Detecting product review spammers using rating behaviors[C]//Proceedings of the 19th ACM International Conference on Information and Knowledge Management.ACM,2010: 939-948. [9] Wang Guan, et al. Review graph based online store review spammer detection[C]//Proceedings of Data Mining (ICDM), 2011 IEEE 11th International Conference on. IEEE, 2011. [10] Druck Gregory, Gideon Mann, Andrew McCallum. Learning from labeled features using generalized expectation criteria[C]//Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval. ACM, 2008. [11] YULAN He, Deyu Zhou. Self-training from labeled features for sentiment analysis[C]//Proceedings of Information Processing & Management 2011,47(4): 606-616. [12] Liu Bing, et al. Partially supervised classification of text documents[C]//ICML,Vol.2.2002. [13] Lang Ken.Newsweeder: Learning to filter netnews[C]//Proceedings of the 12th international conference on machine learning.1995: 331-339. [14] Lucas, Michael, and Doug Downey. Scaling Semi-supervised Naive Bayes with FeatureMarginals[C]//Proceedings of ACL,2013. [15] Settles Burr. Closing the loop: Fast, interactive semi-supervised annotation with queries on features and instances[C]//Proceedings of the Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 2011.