微博客的出现改变了我们获取信息的方式。然而,大量垃圾消息却此起彼伏,危害着微博的健康发展。该文研究了中文微博客中的垃圾用户检测问题。我们首先对垃圾用户的行为进行了分析,提出了基于用户图、用户资料、微博内容的3大类7种检测特征。随后,讨论了基于SVM分类器的垃圾用户检测方法。最后,我们对采集的微博数据进行了标注,并评价了分类器的效果。实验表明: 分类器具有较高的准确率和召回率,该文提出的特征具有较好的区分度。
Abstract
Micro-blogs changes the way people obtain information. However, Micro-blogs has been infiltrated by large amount of spam, which is a challenge to normal user. In this paper, we research on spam in Chinese Micro-blogs. We study the behavior of spam user and propose 7 new features for detecting them. Then, we describe how to apply features into detecting spammer via a SVM classifier. The experiment results indicate that the accuracy and recall of the proposed method is satisfactory.
关键词
微博客 /
垃圾用户 /
检测
{{custom_keyword}} /
Key words
Micro-blogs /
spam /
detection
{{custom_keyword}} /
{{custom_sec.title}}
{{custom_sec.title}}
{{custom_sec.content}}
参考文献
[1] 新浪科技. 新浪微博用户数超3亿 [EB/OL]. 2012-05-16. http://is.gd/Qfn4Z9.
[2] Grier C,Thomas K,Paxson V,et al. @spam: The Underground on 140 Characters or Less [C]//Proceedings of the 17th ACM Conference on Computer and Communications Security (CCS 2010). New York,US,2010: 27-37.
[3] Wang A. Dont follow me: Spam detection in Twitter [C]//Proceedings of the International Conference on Security and Cryptography. Athens,Greece,2011: 142-151.
[4] Song J,Lee S,Kim J. Spam Filtering in Twitter Using Sender-ReceiverRelationship [M]. Berlin,German: Springer,2006: 301-317.
[5] 王宇,陆余良,郭浩,等. 中文微博僵尸粉检测技术研究[C]//中国自动化学会.第三届全国社会计算会议、平行控制会议、平行管理会议论文集. 北京: 中国自动化学会,2011.
[6] Benevenuto F,Magno G,Rodrigues T,et al. Detecting Spammers on Twitter[C]//Proceedings of Seventhannual Collaboration, Electronic Messaging, Anti-Abuseand Spam Conference (CEAS 2010). Redmond,US,2010.
[7] 张学工. 关于统计学习理论与支持向量[J]. 自动化学报,2001,26(1): 32-41.
[8] Chang C. LIBSVM—A Library for Support Vector Machines [EB/OL]. 2006-2012. http://is.gd/rocwn9.
[9] Guyon I,Gunn S,Nikravesh M. Feature extraction, foundations and applications[M]. Berlin,German: Springer,2006: 188-191.
{{custom_fnGroup.title_cn}}
脚注
{{custom_fn.content}}
基金
国家自然科学基金(61100083);国家863计划基金(2012AA011003);国家242专项(2011F45, 2011F65)
{{custom_fund}}