微博作为目前国内外最活跃的信息分享平台之一,其中却充斥着大量的垃圾内容。因此,如何从给定话题的微博数据中,过滤掉与话题不相关的垃圾微博、保留话题相关微博,成为迫切需要解决的问题。该文提出了一种半监督的中文微博过滤方法,基于朴素贝叶斯分类模型和最大期望算法,实现了利用少量标注数据的垃圾微博过滤算法,其优势是仅仅利用少量标注数据就可以获得较为理想的过滤性能。分别对十个话题140 000余条新浪微博数据进行过滤,该文提出的模型准确度和F值优于朴素贝叶斯和支持向量机模型。
Abstract
Microblogging sites are one of the most popular information sharing platforms today. However, among the large amount of posted published every day, spam texts are seen everywhere: users utilize spam posts to advertise, broadcast, boast their own products, and defame their competitors. Therefore, filtering spam tweets is a critical and fundamental problem. In this paper, we propose a semi-supervised algorithm based on Expectation Maximization and Naive Bayesian Classifier (EM-NB), which is able to filter spam tweets effectively using only a small amount of labeled data. The experimental results on more than 140 thousand tweets from Sina Weibo show that our method achieves higher accuracy and F-score than baselines.
关键词
垃圾微博过滤 /
半监督学习 /
EM算法 /
朴素贝叶斯
{{custom_keyword}} /
Key words
spam tweet /
naive bayesian classifier /
expectation maximization /
semi-supervised learning
{{custom_keyword}} /
{{custom_sec.title}}
{{custom_sec.title}}
{{custom_sec.content}}
参考文献
[1] By The Numbers: 220 Amazing Twitter Statics [OL].2014.http://expandedramblings.com/index.php/march-2013-by-the-numbers-a-few-amazing-twitter-stats/#.VCdgtaiSzI0
[2] 陈倩. 微博广告发展现状与传播效果分析[J]. 产业与科技论坛,2012,11(2): 33-35.
[3] 垃圾营销信息管理规定征求意见稿[OL]. http://weibo.com/p/1001603697836242954625,2014.
[4] Jindal,Nitin, Bing Liu.Opinion spam and analysis[C]//Proceedings of the 2008 International Conference on Web Search and Data Mining.ACM,2008: 219-230.
[5] Jindal N, Liu B. Reviewspam detection[C]//Proceedings of the 16th International Conference on World Wide Web, New York, NY, USA: ACM, 2007: 1189-1190.
[6] Li Jiwei, Claire Cardie, Sujian Li. Topic Spam: a Topic-Model based approach for spam detection[C]//Proceedings of the ACL,2013.
[7] Ren,Yafeng,Donghong Ji,and Hongbin Zhang.Positive Unlabeled Learning for Deceptive Reviews Detection[C]//Proceedings of the EMNLP,2014.
[8] Lim,Ee-Peng,et al.Detecting product review spammers using rating behaviors[C]//Proceedings of the 19th ACM International Conference on Information and Knowledge Management.ACM,2010: 939-948.
[9] Wang Guan, et al. Review graph based online store review spammer detection[C]//Proceedings of Data Mining (ICDM), 2011 IEEE 11th International Conference on. IEEE, 2011.
[10] Druck Gregory, Gideon Mann, Andrew McCallum. Learning from labeled features using generalized expectation criteria[C]//Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval. ACM, 2008.
[11] YULAN He, Deyu Zhou. Self-training from labeled features for sentiment analysis[C]//Proceedings of Information Processing & Management 2011,47(4): 606-616.
[12] Liu Bing, et al. Partially supervised classification of text documents[C]//ICML,Vol.2.2002.
[13] Lang Ken.Newsweeder: Learning to filter netnews[C]//Proceedings of the 12th international conference on machine learning.1995: 331-339.
[14] Lucas, Michael, and Doug Downey. Scaling Semi-supervised Naive Bayes with FeatureMarginals[C]//Proceedings of ACL,2013.
[15] Settles Burr. Closing the loop: Fast, interactive semi-supervised annotation with queries on features and instances[C]//Proceedings of the Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 2011.
{{custom_fnGroup.title_cn}}
脚注
{{custom_fn.content}}
基金
国家自然科学基金(61332007,61272227)
{{custom_fund}}