高性能中文垃圾邮件过滤器

齐浩亮1,程晓龙1, 杨沐昀 2,何晓宁3, 李 生2,雷国华1

PDF(1161 KB)
PDF(1161 KB)
中文信息学报 ›› 2010, Vol. 24 ›› Issue (2) : 76-84.
综述

高性能中文垃圾邮件过滤器

  • 齐浩亮1,程晓龙1, 杨沐昀 2,何晓宁3, 李 生2,雷国华1
作者信息 +

High Performance Chinese Spam Filter

  • QI Haoliang1, CHENG Xiaolong1, YANG Muyun2, HE Xiaoning3, LI Sheng2, LEI Guohua1
Author information +
History +

摘要

设计并实现了基于在线过滤模式高性能中文垃圾邮件过滤器,能够较好地识别不断变化的垃圾邮件。以逻辑回归模型为基础,该文提出了字节级n元文法提取邮件特征,并采用TONE(Train On or Near Error)方法训练过滤器。在多个大规模中文垃圾邮件过滤公开评测数据上的实验结果表明,该文过滤器的性能在TREC 06C数据上优于当年评测的最好成绩,在SEWM 07立即反馈上1-ROCA值达到了0.000 0%,并明显优于SEWM 08评测在线过滤任务中的所有其他方法。

Abstract

We designed and implemented a high performance Chinese spam filter. Online filtering mode is adopted in order to defense the evolution of spam emails. Logistic regression model is used as its filtering model; byte level N-gram is put forward to extract email’s features; and the filter is trained with TONE (Train On or Near Error) method. The performance of our filter is evaluated on Chinese spam corpora. It outperforms the best system in TREC 06 spam filtering track, gets 0.000 0% of 1-ROCA on SEWM07 immediate feedback task and ranks top in all SEWM 08 online learning tasks.
Key wordscomputer application; Chinese information processing;Chinese spam filtering; online learning; logistic regression model; byte N-gram; TONE

关键词

计算机应用 / 中文信息处理 / 中文垃圾邮件过滤 / 在线学习 / 逻辑回归模型 / 字节级n元文法 / TONE

Key words

computer application / Chinese information processing / Chinese spam filtering / online learning / logistic regression model / byte N-gram / TONE

引用本文

导出引用
齐浩亮1,程晓龙1, 杨沐昀 2,何晓宁3, 李 生2,雷国华1. 高性能中文垃圾邮件过滤器. 中文信息学报. 2010, 24(2): 76-84
QI Haoliang1, CHENG Xiaolong1, YANG Muyun2, HE Xiaoning3, LI Sheng2, LEI Guohua1. High Performance Chinese Spam Filter. Journal of Chinese Information Processing. 2010, 24(2): 76-84

参考文献

[1] V. N. Vapnik. Statistical Learning Theory[M]. New York, USA: John Wiley & Sons, Inc. 1998:1-18.
[2] A. Bratko, B. Filipi?, G.V. Cormack et al. Spam Filtering Using Statistical Data Compression Models[J]. The Journal of Machine Learning Research archive, 2006,7:2673-2698.
[3] G. Hulten and J. Goodman. Tutorial on Junk E-mail Filtering[C]//The Twenty-First International Conference on Machine Learning (ICML 2004). 2004: (Invited Talk, http://research.microsoft.com/en-us/um/people/joshuago/ icmltutorialannounce.htm).
[4] D. Sculley, G. M. Wachman. Relaxed Online SVMs for Spam Filtering[C]//The 30th Annual International ACM SIGIR Conference (SIGIR’07). New York, NY, USA:ACM, 2007:415-422.
[5] J. Goodman and W. Yih. Online Discriminative Spam Filter Training[C]//Third Conference on Email and Anti-Spam (CEAS 2006). Mountain View, California, USA. 2006:113-115. (http://www.ceas.cc/2006/22.pdf).
[6] D. Sculley. Advances in Online Learning-based Spam Filtering [D]. Medford, MA, USA: Tufts University. 2008.
[7] 苏绥, 林鸿飞, 叶正. 基于字符语言模型的垃圾邮件过滤[J]. 中文信息学报, 2009, 23(2):41-47.
[8] P. Hayati, V. Potdar. Evaluation of spam detection and prevention frameworks for email and image spam: a state of art[C]//International Conference on Information Integration and web-based Applications and Services (iiWAS 2008) workshops: Proceedings of the 10th International Conference on Information Integration and Web-based Applications & Services(AIIDE 2008). New York, NY, USA: ACM. 2008: 520-527.
[9] G. V. Cormack, A. Bratko. Batch and Online Spam Filter Comparison. [C]//Third Conference on Email and Anti-Spam (CEAS 2006). Mountain View, California, USA. 2006.
[10] J.M. M. Cruz, G. V. Cormack. Using old Spam and Ham Samples to Train Email Filters[C]//6th Conference on Email and Anti-Spam. in Mountain View, California, USA, 2009.
[11] G.V. Cormack. University of Waterloo Participation in the TREC 2007 Spam Track[C]//The Sixteenth Text REtrieval Conference (TREC 2007) Proceedings. Gaithersburg, Maryland, USA. 2007.
[12] 刘伍颖,王挺.基于多过滤器集成学习的在线垃圾邮件过滤[J]. 中文信息学报, 2008, 22(1): 67-73.
[13] G. Cormack, T. Lynam. TREC 2005 Spam Track Overview[C]//The Fourteenth Text REtrieval Conference (TREC 2005) Proceedings. Gaithersburg, MD, USA. 2005.

基金

国家自然科学基金重点资助项目(60736044);国家自然科学基金资助项目(60873105);黑龙江省科技攻关计划资助项目(GZ07A108)
PDF(1161 KB)

Accesses

Citation

Detail

段落导航
相关文章

/