基于随机森林的产品垃圾评论识别

何 珑

PDF(542 KB)
PDF(542 KB)
中文信息学报 ›› 2015, Vol. 29 ›› Issue (3) : 150-154.
情感分析与社会计算

基于随机森林的产品垃圾评论识别

  • 何 珑1,2
作者信息 +

Identification of Product Review Spam by Random Forest

  • HE Long1,2
Author information +
History +

摘要

目前的产品垃圾评论识别方法只考虑评论特征的选取,忽略了评论数据集的不平衡性。因此该文提出基于随机森林的产品垃圾评论识别方法,即对样本中的大、小类有放回的重复抽取同样数量样本或者给大、小类总体样本赋予同样的权重以建立随机森林模型。通过对亚马逊数据集的实验结果表明,基于随机森林的产品评论识别方法优于其他基线方法。

Abstract

Current review spam identification methods are focused on the feature selection, without addressing the imbalance of the data set. This paper presents a product review spam identification method based on the random forest, with the same number of samples extracted from the large and small class with replacement repeatedly, or with the same weight assigned to the large and small class. The experimental results on Amazon dataset show that the random forest method outperforms other baseline methods.

关键词

产品垃圾评论 / 不平衡问题 / 随机森林

Key words

product review spam / imbalance problem / random forest

引用本文

导出引用
何 珑. 基于随机森林的产品垃圾评论识别. 中文信息学报. 2015, 29(3): 150-154
HE Long. Identification of Product Review Spam by Random Forest. Journal of Chinese Information Processing. 2015, 29(3): 150-154

参考文献

[1] 赵妍妍,秦兵,刘挺. 文本情感分析[J]. 软件学报,2010,21(8): 1834-848.
[2] N Jindal and B Liu. Review Spam Detection[C]//Proceedings of the 16th international conference on World Wide Web. New York: ACM, 2007:1189-1190.
[3] G Wu, D Greene, B Smyth et al. Distortion as a validation criterion in the identification of suspicious reviews[C]//Proceedings of the First Workshop on
Social Media Analytics.New York: ACM, 2010:10-13.
[4] 何海江, 凌云. 由Logistic回归识别Web社区的垃圾评论[J].计算机工程与应用,2009,45(23):140-143.
[5] F Li, M Huang, Y Yang et al. Learning to identify review Spam[C]//Proceeding of the 22nd International Joint Conference on Artificial Intelligence. 2011: 2488-2493.
[6] 吴敏, 何珑. 融合多特征的产品垃圾评论识别[J]. 微型机与应用, 2012, 31(22): 85-87.
[7] J Staddon and R Chow. Detecting reviewer bias through web-based association mining[C]//Proceedings of the 2nd ACM workshop on Information Credibility on the Web. New York: ACM, 2008: 5-10.
[8] N Jindal, B Liu, and EP Lim. Finding Unusual Review Patterns Using Unexpected Rules[C]//Proceedings of the 19th ACM International Conference on Information and Knowledge Management. New York: ACM, 2009: 1549-1552.
[9] E Lim, VA Nguyen, N Jindal et al. Detecting Product Review Spammers using Rating Behaviors[C]//Proceedings of the 19th ACM International Conference on Information and Knowledge Management. New York: ACM, 2010:939-948.
[10] A Mukherjee, B Liu, J Wang et al. Detecting Group Review Spam[C]//Proceedings of the 20th international conference companion on World Wide Web. New York: ACM, 2011: 93-94.
[11] G Wang, S Xie, B Liu et al. Review Graph based Online Store Review Spammer Detection[C]//Data Mining (ICDM), 2011 IEEE 11th International Conference on. IEEE, 2011: 1242-1247.
[12] A Mukherjee, B Liu, N Glance. Spotting Fake Reviewer Groups in Consumer Reviews[C]//Proceedings of the 21st international conference on World Wide Web. ACM, 2012: 191-200.
[13] H He, E A Garcia. Learning from imbalanced data[J]. IEEE Transactions on Knowledge and Data Engineering, 2009, 21(9): 1263-1284.
[14] 陈振伟, 廖祥文. 结合AB-SMOTE和C-SVW的中文倾向性句子识别[J]. 福州大学学报(自然科学版),2012,40(3):49-54.
[15] L Breiman. Random forests[J]. Machine Learning, 2001. 45(1):5-32.
[16] C Chen, A Liaw, L Breiman. Using random forest to learn imbalanced data[C]//Proceedings of University of California, Berkeley, 2004.
[17] L Breiman. Bagging Predictors[J]. Machine Learning, 1996, 24(2):123-140.
[18] Y Freund, R Shapire. Experiments with a new boosting Algorithm[C]// Proceedings of the 13th International Conference. 1996: 148-156.
[19] http://www.cs.waikato.ac.nz/ml/weka/[OL]

基金

福建省自然科学基金(2010J05133)
PDF(542 KB)

571

Accesses

0

Citation

Detail

段落导航
相关文章

/