|
|
The Application of Resampling in Recognition of Microblog Robots |
LUO Yunsong1,HUANG Muyu2,JIATao1 |
1.School of Computerand Information Science,Southwest University,Chongqing 400715,China;
2.Chang'an Automobile Finance,Chongqing 400020,China |
|
|
Abstract With the increasing number of microblog robot accounts, its identification has become a prominent problem in the current data mining field. To deal with the imbalance data issue in this task, we choose a large data set to explore the influence of resampling on the supervised learning algorithms and propose a novel microblog robot recognition framework combined with resampling. A variety of indexes have been used to evaluate the performance of 7 supervised learning algorithms on imbalanced validation sets based on 5 different resampling methods. The experimental results show that the Recall of the trained model from the small balanced training set will be seriously reduced in real situations, while the framework combined with resampling can significantly improve the recognition of robot accounts. The NearMiss undersampling method can increase the Recall, while the ADASYN oversampling method will improve the G_mean measure. Generally speaking, the release time, publishing region, and release interval are important features to distinguish normal users from robots. At the same time, resampling can adjust the rank of features that the machine learning algorithm depends on so that the model can get better performance.
|
|
|
|
|
|
[1]Morstatter F, Wu L, Nazer T H, et al. A new approach to bot detection: striking the balance between precision and recall[C]//Proceedings of IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining, San Francisco, 2016: 533-540.
[2]刘小洋,唐婷,何道兵.融合社交网络用户自身属性的信息传播数学建模与舆情演化分析[J].中文信息学报,2019,33(09): 115-122.
[3]Qin M, Ke Y. Overview of web spammer detection[J]. J Softw, 2014, 25(7): 1505-1526.
[4]张晨昕,饶元,樊笑冰,等.基于社交媒体的事件脉络挖掘研究进展[J].中文信息学报,2019,33(11): 15-30.
[5]韩鹏宇,高盛祥,余正涛,等.基于案件要素指导的涉案舆情新闻文本摘要方法[J].中文信息学报,2020,34(05): 56-63.
[6]吕少卿. 在线社交网络中异常账号检测研究[D].西安:西安电子科技大学博士学位论文,2016.
[7]张玉清,吕少卿,范丹.在线社交网络中异常账号检测方法研究[J].计算机学报,2015,38(10): 2011-2027.
[8]Shuwen L, Yang X, Binglu W. Water army detection of weibo using user representation learning[J]. Journal of Intelligence, 2018, 37(7): 95-100.
[9]张艳梅,黄莹莹,甘世杰,等.基于贝叶斯模型的微博网络水军识别算法研究[J].通信学报,2017,38(01): 44-53.
[10]Gargari S M, güdücü S G. A novel framework for spammer detection in social bookmarking systems[C]//Proceedings of IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining, Istanbul, Turkey, 2012: 827-834.
[11]Chen K, Chen L, Zhu P D, et al. Interaction based on method for spam detection in online social networks[J]. Journal on Communications, 2015, 36(7): 120-127.
[12]Xuping Y, Renwu W, Boqu Z. Automatic recognition of micro-blog water army based on multi-index comprehensive index method and entropy method[J]. Journal of Intelligence, 2014, 7: 176-179.
[13]Kudugunta S, Ferrara E. Deep neural networks for bot detection[J]. Information Sciences, 2018, 467: 312-322.
[14]Badjatiya P, Gupta S, Gupta M, et al. Deep learning for hate speech detection in tweets[C]//Proceedings of the 26th International Conference on World Wide Web Companion, 2017: 759-760.
[15]Cai C, Li L, Zengi D. Behavior enhanced deep bot detection in social media[C]//Proceedings of the IEEE International Conference on Intelligence and Security Informatics, 2017: 128-130.
[16]Han Z, Yang K, Xu F, et al. Probabilistic graphical model for detecting spammers in microblog websites[J]. International Journal of Embedded Systems, 2016, 8(1): 12-23.
[17]Wang G, Xie S, Liu B, et al. Identify online store review spammers via social review graph[J]. ACM Transactions on Intelligent Systems and Technology, 2012, 3(4): 1-21.
[18]Wang G, Xie S, Liu B, et al. Review graph based online store review spammer detection[C]//Proceedings of the 11th International Conference on Data Mining. IEEE, 2011: 1242-1247.
[19]程晓涛,刘彩霞,刘树新.基于关系图特征的微博水军发现方法[J].自动化学报,2015,41(09): 1533-1541.
[20]Jiang J, Wen S, Yu S, et al. Rumor source identification in social networks with time-varying topology[J]. IEEE Transactions on Dependable and Secure Computing, 2016, 15(1): 166-179.
[21]Akoglu L, Tong H, Koutra D. Graph based anomaly detection and description: a survey[J]. Data Mining and Knowledge Discovery, 2015, 29(3): 626-688.
[22]Savage D, Zhang X, Yu X, et al. Anomaly detection in online social networks[J]. Social Networks, 2014, 39: 62-70.
[23]张蕾,崔勇,刘静,等.机器学习在网络空间安全研究中的应用[J].计算机学报,2018,41(09): 1943-1975.
[24]Miller Z, Dickinson B, Deitrick W, et al. Twitter spammer detection using data stream clustering[J]. Information Sciences, 2014, 260: 64-73.
[25]Chu Z, Widjaja I, Wang H. Detecting social spam campaigns on twitter[C]//Proceedings of the International Conference on Applied Cryptography and Network Security. Springer, Berlin, Heidelberg, 2012: 455-472.
[26]Cateni S, Colla V, Vannucci M. A method for resampling imbalanced datasets in binary classification tasks for real-world problems[J]. Neurocomputing, 2014, 135: 32-41.
[27]Dodangeh E, Choubin B, Eigdir A N, et al. Integrated machine learning methods with resampling algorithms for flood susceptibility prediction[J]. Science of the Total Environment, 2020, 705: 135983.
[28]Wang Z, Yang H, Wu Z, et al. In silico prediction of blood-brain barrier permeability of compounds by machine learning and resampling methods[J].ChemMedChem, 2018, 13(20): 2189-2201.
[29]Borges T A, Neves R F. Ensemble of machine learning algorithms for cryptocurrency investment with different data resampling methods[J]. Applied Soft Computing, 2020, 90: 106187.
[30]Liang Z, Chapa-Martell M A. Combining resampling and machine learning to improve sleep-wake detection of Fitbit wristbands[C]//Proceedings of the IEEE International Conference on Healthcare Informatics, 2019: 1-3.
[31]Gosain A, Sardana S. Handling class imbalance problem using oversampling techniques: a review[C]//Proceedings of the International Conference on Advances in Computing, Communications and Informatics. IEEE, 2017: 79-85.
[32]Chawla N V, Bowyer K W, Hall L O, et al. SMOTE: synthetic minority over-sampling technique[J]. Journal of Artificial Intelligence Research, 2002, 16: 321-357.
[33]Bao L, Juan C, Li J, et al. Boosted Near Miss under-sampling on SVM ensembles for concept detection in large-scale imbalanced datasets[J]. Neurocomputing, 2016, 172: 198-206.
[34]Zeng M, Zou B, Wei F, et al. Effective prediction of three common diseases by combining SMOTE with Tomek links technique for imbalanced medical data[C]//Proceedings of the International 〖JP2〗Conference of Online Analysis and Computing Science.IEEE, 2016:〖JP〗 225-228.
[35]Batista G E, Prati R C, Monard M C. A study of the behavior of several methods for balancing machine learning training data[J]. ACM SIGKDD Explorations Newsletter, 2004, 6(1): 20-29.
[36]Probst P, Wright M N, Boulesteix A L. Hyperparameters and tuning strategies for random forest[J]. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 2019, 9(3): e1301.
[37]Chen T, Guestrin C. XGboost: a scalable tree boosting system[C]//Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2016: 785-794.
[38]Ke G, Meng Q, Finley T, et al. Lightgbm: a highly efficient gradient boosting decision tree[C]//Proceedings of the Advances in Neural Information Processing Systems, 2017, 30: 3146-3154.
[39]Prokhorenkova L, Gusev G, Vorobev A, et al. CatBoost: unbiased boosting with categorical features[C]//Proceedings of the Advances in Neural Information Processing Systems, 2018: 6638-6648.
[40]Guo H, Liu H, Wu C, et al. Logistic discrimination based on G-mean and F-measure for imbalanced problem[J]. Journal of Intelligent and Fuzzy Systems, 2016, 31(3): 1155-1166.
[41]Rodríguez Ruiz J, Mata Sánchez J I, Monroy R, et al. A one-class classification approach for bot detection on Twitter[J]. Computers and Security, 2020, 91: 101715.
[42]Orabi M, Mouheb D, Al Aghbari Z, et al. Detection of bots in social media: A systematic review[J]. Information Processing and Management, 2020, 57(4): 102250.
[43]Ran Y, Deng X, Wang X, et al. A generalized linear threshold model for an improved description of thespreading dynamics[J]. Chaos: An Interdisciplinary Journal of Nonlinear Science, 2020, 30(8): 083127.
[44]周涛.计算社会经济学: 一门正在形成的交叉研究方向[J].电子科技大学学报(社科版),2020,22(01): 1-4.
[45]刘玮,贺敏,王丽宏,等.基于用户行为特征的微博转发预测研究[J].计算机学报,2016,39(10): 1992-2006.
[46]Jia T, Wang D, Szymanski B K. Quantifying patterns of research-interest evolution[J]. Nature Human Behaviour, 2017, 1(4): 1-7.
[47]Cornelissen L A, Barnett R J, Schoonwinkel P, et al. A network topology approach to bot classification[C]//Proceedings of the 4 Annual Conference of the South African Institute of Computer Scientists and Information Technologists, 2018: 79-88.
|
|
|
|