设计并实现了基于在线过滤模式高性能中文垃圾邮件过滤器,能够较好地识别不断变化的垃圾邮件。以逻辑回归模型为基础,该文提出了字节级n元文法提取邮件特征,并采用TONE(Train On or Near Error)方法训练过滤器。在多个大规模中文垃圾邮件过滤公开评测数据上的实验结果表明,该文过滤器的性能在TREC 06C数据上优于当年评测的最好成绩,在SEWM 07立即反馈上1-ROCA值达到了0.000 0%,并明显优于SEWM 08评测在线过滤任务中的所有其他方法。
Abstract
We designed and implemented a high performance Chinese spam filter. Online filtering mode is adopted in order to defense the evolution of spam emails. Logistic regression model is used as its filtering model; byte level N-gram is put forward to extract email’s features; and the filter is trained with TONE (Train On or Near Error) method. The performance of our filter is evaluated on Chinese spam corpora. It outperforms the best system in TREC 06 spam filtering track, gets 0.000 0% of 1-ROCA on SEWM07 immediate feedback task and ranks top in all SEWM 08 online learning tasks.
Key wordscomputer application; Chinese information processing;Chinese spam filtering; online learning; logistic regression model; byte N-gram; TONE
关键词
计算机应用 /
中文信息处理 /
中文垃圾邮件过滤 /
在线学习 /
逻辑回归模型 /
字节级n元文法 /
TONE
{{custom_keyword}} /
Key words
computer application /
Chinese information processing /
Chinese spam filtering /
online learning /
logistic regression model /
byte N-gram /
TONE
{{custom_keyword}} /
{{custom_sec.title}}
{{custom_sec.title}}
{{custom_sec.content}}
参考文献
[1] V. N. Vapnik. Statistical Learning Theory[M]. New York, USA: John Wiley & Sons, Inc. 1998:1-18.
[2] A. Bratko, B. Filipi?, G.V. Cormack et al. Spam Filtering Using Statistical Data Compression Models[J]. The Journal of Machine Learning Research archive, 2006,7:2673-2698.
[3] G. Hulten and J. Goodman. Tutorial on Junk E-mail Filtering[C]//The Twenty-First International Conference on Machine Learning (ICML 2004). 2004: (Invited Talk, http://research.microsoft.com/en-us/um/people/joshuago/ icmltutorialannounce.htm).
[4] D. Sculley, G. M. Wachman. Relaxed Online SVMs for Spam Filtering[C]//The 30th Annual International ACM SIGIR Conference (SIGIR’07). New York, NY, USA:ACM, 2007:415-422.
[5] J. Goodman and W. Yih. Online Discriminative Spam Filter Training[C]//Third Conference on Email and Anti-Spam (CEAS 2006). Mountain View, California, USA. 2006:113-115. (http://www.ceas.cc/2006/22.pdf).
[6] D. Sculley. Advances in Online Learning-based Spam Filtering [D]. Medford, MA, USA: Tufts University. 2008.
[7] 苏绥, 林鸿飞, 叶正. 基于字符语言模型的垃圾邮件过滤[J]. 中文信息学报, 2009, 23(2):41-47.
[8] P. Hayati, V. Potdar. Evaluation of spam detection and prevention frameworks for email and image spam: a state of art[C]//International Conference on Information Integration and web-based Applications and Services (iiWAS 2008) workshops: Proceedings of the 10th International Conference on Information Integration and Web-based Applications & Services(AIIDE 2008). New York, NY, USA: ACM. 2008: 520-527.
[9] G. V. Cormack, A. Bratko. Batch and Online Spam Filter Comparison. [C]//Third Conference on Email and Anti-Spam (CEAS 2006). Mountain View, California, USA. 2006.
[10] J.M. M. Cruz, G. V. Cormack. Using old Spam and Ham Samples to Train Email Filters[C]//6th Conference on Email and Anti-Spam. in Mountain View, California, USA, 2009.
[11] G.V. Cormack. University of Waterloo Participation in the TREC 2007 Spam Track[C]//The Sixteenth Text REtrieval Conference (TREC 2007) Proceedings. Gaithersburg, Maryland, USA. 2007.
[12] 刘伍颖,王挺.基于多过滤器集成学习的在线垃圾邮件过滤[J]. 中文信息学报, 2008, 22(1): 67-73.
[13] G. Cormack, T. Lynam. TREC 2005 Spam Track Overview[C]//The Fourteenth Text REtrieval Conference (TREC 2005) Proceedings. Gaithersburg, MD, USA. 2005.
{{custom_fnGroup.title_cn}}
脚注
{{custom_fn.content}}
基金
国家自然科学基金重点资助项目(60736044);国家自然科学基金资助项目(60873105);黑龙江省科技攻关计划资助项目(GZ07A108)
{{custom_fund}}