一种半监督的中文垃圾微博过滤方法

姚子瑜,屠守中,黄民烈,朱小燕

PDF(3626 KB)
PDF(3626 KB)
中文信息学报 ›› 2016, Vol. 30 ›› Issue (5) : 176-186.
综述

一种半监督的中文垃圾微博过滤方法

  • 姚子瑜,屠守中,黄民烈,朱小燕
作者信息 +

A Semi-supervised Method for Filtering Chinese Spam Tweets

  • YAO Ziyu, TU Shouzhong, HUANG Minlie , ZHU Xiaoyan
Author information +
History +

摘要

微博作为目前国内外最活跃的信息分享平台之一,其中却充斥着大量的垃圾内容。因此,如何从给定话题的微博数据中,过滤掉与话题不相关的垃圾微博、保留话题相关微博,成为迫切需要解决的问题。该文提出了一种半监督的中文微博过滤方法,基于朴素贝叶斯分类模型和最大期望算法,实现了利用少量标注数据的垃圾微博过滤算法,其优势是仅仅利用少量标注数据就可以获得较为理想的过滤性能。分别对十个话题140 000余条新浪微博数据进行过滤,该文提出的模型准确度和F值优于朴素贝叶斯和支持向量机模型。

Abstract

Microblogging sites are one of the most popular information sharing platforms today. However, among the large amount of posted published every day, spam texts are seen everywhere: users utilize spam posts to advertise, broadcast, boast their own products, and defame their competitors. Therefore, filtering spam tweets is a critical and fundamental problem. In this paper, we propose a semi-supervised algorithm based on Expectation Maximization and Naive Bayesian Classifier (EM-NB), which is able to filter spam tweets effectively using only a small amount of labeled data. The experimental results on more than 140 thousand tweets from Sina Weibo show that our method achieves higher accuracy and F-score than baselines.

关键词

垃圾微博过滤 / 半监督学习 / EM算法 / 朴素贝叶斯

Key words

spam tweet / naive bayesian classifier / expectation maximization / semi-supervised learning

引用本文

导出引用
姚子瑜,屠守中,黄民烈,朱小燕. 一种半监督的中文垃圾微博过滤方法. 中文信息学报. 2016, 30(5): 176-186
YAO Ziyu, TU Shouzhong, HUANG Minlie , ZHU Xiaoyan. A Semi-supervised Method for Filtering Chinese Spam Tweets. Journal of Chinese Information Processing. 2016, 30(5): 176-186

参考文献

[1] By The Numbers: 220 Amazing Twitter Statics [OL].2014.http://expandedramblings.com/index.php/march-2013-by-the-numbers-a-few-amazing-twitter-stats/#.VCdgtaiSzI0
[2] 陈倩. 微博广告发展现状与传播效果分析[J]. 产业与科技论坛,2012,11(2): 33-35.
[3] 垃圾营销信息管理规定征求意见稿[OL]. http://weibo.com/p/1001603697836242954625,2014.
[4] Jindal,Nitin, Bing Liu.Opinion spam and analysis[C]//Proceedings of the 2008 International Conference on Web Search and Data Mining.ACM,2008: 219-230.
[5] Jindal N, Liu B. Reviewspam detection[C]//Proceedings of the 16th International Conference on World Wide Web, New York, NY, USA: ACM, 2007: 1189-1190.
[6] Li Jiwei, Claire Cardie, Sujian Li. Topic Spam: a Topic-Model based approach for spam detection[C]//Proceedings of the ACL,2013.
[7] Ren,Yafeng,Donghong Ji,and Hongbin Zhang.Positive Unlabeled Learning for Deceptive Reviews Detection[C]//Proceedings of the EMNLP,2014.
[8] Lim,Ee-Peng,et al.Detecting product review spammers using rating behaviors[C]//Proceedings of the 19th ACM International Conference on Information and Knowledge Management.ACM,2010: 939-948.
[9] Wang Guan, et al. Review graph based online store review spammer detection[C]//Proceedings of Data Mining (ICDM), 2011 IEEE 11th International Conference on. IEEE, 2011.
[10] Druck Gregory, Gideon Mann, Andrew McCallum. Learning from labeled features using generalized expectation criteria[C]//Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval. ACM, 2008.
[11] YULAN He, Deyu Zhou. Self-training from labeled features for sentiment analysis[C]//Proceedings of Information Processing & Management 2011,47(4): 606-616.
[12] Liu Bing, et al. Partially supervised classification of text documents[C]//ICML,Vol.2.2002.
[13] Lang Ken.Newsweeder: Learning to filter netnews[C]//Proceedings of the 12th international conference on machine learning.1995: 331-339.
[14] Lucas, Michael, and Doug Downey. Scaling Semi-supervised Naive Bayes with FeatureMarginals[C]//Proceedings of ACL,2013.
[15] Settles Burr. Closing the loop: Fast, interactive semi-supervised annotation with queries on features and instances[C]//Proceedings of the Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 2011.

基金

国家自然科学基金(61332007,61272227)
PDF(3626 KB)

Accesses

Citation

Detail

段落导航
相关文章

/