一种半监督的中文垃圾微博过滤方法

PDF(3626 KB)

中文信息学报 ›› 2016, Vol. 30 ›› Issue (5) : 176-186.

综述

一种半监督的中文垃圾微博过滤方法

姚子瑜,屠守中,黄民烈,朱小燕

作者信息 +

A Semi-supervised Method for Filtering Chinese Spam Tweets

YAO Ziyu, TU Shouzhong, HUANG Minlie , ZHU Xiaoyan

Author information +

History +

摘要

微博作为目前国内外最活跃的信息分享平台之一,其中却充斥着大量的垃圾内容。因此,如何从给定话题的微博数据中,过滤掉与话题不相关的垃圾微博、保留话题相关微博,成为迫切需要解决的问题。该文提出了一种半监督的中文微博过滤方法,基于朴素贝叶斯分类模型和最大期望算法,实现了利用少量标注数据的垃圾微博过滤算法,其优势是仅仅利用少量标注数据就可以获得较为理想的过滤性能。分别对十个话题140 000余条新浪微博数据进行过滤,该文提出的模型准确度和F值优于朴素贝叶斯和支持向量机模型。

Abstract

Microblogging sites are one of the most popular information sharing platforms today. However, among the large amount of posted published every day, spam texts are seen everywhere: users utilize spam posts to advertise, broadcast, boast their own products, and defame their competitors. Therefore, filtering spam tweets is a critical and fundamental problem. In this paper, we propose a semi-supervised algorithm based on Expectation Maximization and Naive Bayesian Classifier (EM-NB), which is able to filter spam tweets effectively using only a small amount of labeled data. The experimental results on more than 140 thousand tweets from Sina Weibo show that our method achieves higher accuracy and F-score than baselines.

导出引用

姚子瑜,屠守中,黄民烈,朱小燕. 一种半监督的中文垃圾微博过滤方法. 中文信息学报. 2016, 30(5): 176-186

YAO Ziyu, TU Shouzhong, HUANG Minlie , ZHU Xiaoyan. A Semi-supervised Method for Filtering Chinese Spam Tweets. Journal of Chinese Information Processing. 2016, 30(5): 176-186

参考文献

[1] By The Numbers: 220 Amazing Twitter Statics [OL].2014.http://expandedramblings.com/index.php/march-2013-by-the-numbers-a-few-amazing-twitter-stats/#.VCdgtaiSzI0
[2] 陈倩. 微博广告发展现状与传播效果分析[J]. 产业与科技论坛,2012,11(2): 33-35.
[3] 垃圾营销信息管理规定征求意见稿[OL]. http://weibo.com/p/1001603697836242954625,2014.
[4] Jindal,Nitin, Bing Liu.Opinion spam and analysis[C]//Proceedings of the 2008 International Conference on Web Search and Data Mining.ACM,2008: 219-230.
[5] Jindal N, Liu B. Reviewspam detection[C]//Proceedings of the 16th International Conference on World Wide Web, New York, NY, USA: ACM, 2007: 1189-1190.
[6] Li Jiwei, Claire Cardie, Sujian Li. Topic Spam: a Topic-Model based approach for spam detection[C]//Proceedings of the ACL,2013.
[7] Ren,Yafeng,Donghong Ji,and Hongbin Zhang.Positive Unlabeled Learning for Deceptive Reviews Detection[C]//Proceedings of the EMNLP,2014.
[8] Lim,Ee-Peng,et al.Detecting product review spammers using rating behaviors[C]//Proceedings of the 19th ACM International Conference on Information and Knowledge Management.ACM,2010: 939-948.
[9] Wang Guan, et al. Review graph based online store review spammer detection[C]//Proceedings of Data Mining (ICDM), 2011 IEEE 11th International Conference on. IEEE, 2011.
[10] Druck Gregory, Gideon Mann, Andrew McCallum. Learning from labeled features using generalized expectation criteria[C]//Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval. ACM, 2008.
[11] YULAN He, Deyu Zhou. Self-training from labeled features for sentiment analysis[C]//Proceedings of Information Processing & Management 2011,47(4): 606-616.
[12] Liu Bing, et al. Partially supervised classification of text documents[C]//ICML,Vol.2.2002.
[13] Lang Ken.Newsweeder: Learning to filter netnews[C]//Proceedings of the 12th international conference on machine learning.1995: 331-339.
[14] Lucas, Michael, and Doug Downey. Scaling Semi-supervised Naive Bayes with FeatureMarginals[C]//Proceedings of ACL,2013.
[15] Settles Burr. Closing the loop: Fast, interactive semi-supervised annotation with queries on features and instances[C]//Proceedings of the Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 2011.

基金

国家自然科学基金(61332007,61272227)

PDF(3626 KB)

Accesses

Citation

Detail

段落导航

摘要
Abstract
关键词
Key words
引用本文
参考文献
基金

Received	Published
2015-09-21	2016-10-15
Issue Date
2016-10-15

选择文件类型/文献管理软件名称

选择包含的内容

摘要

Abstract

关键词

Key words

引用本文

{{custom_sec.title}}

{{custom_sec.title}}

参考文献

{{custom_fnGroup.title_cn}}

脚注

基金