垃圾邮件过滤具有处理规模巨大,数据无限递增、动态变化等流数据特征,传统的垃圾邮件过滤方法利用静态的文本特征提取方法,无法体现流数据特征随时间动态变化的特点。该文提出一种基于时间流特性来实时调整有效特征的垃圾邮件过滤方法,在TREC Spam Track语料集上的测试结果表明,该方法在保证垃圾邮件过滤高准确率的同时,使垃圾邮件过滤计算的时间性能和空间性能更加优化。
Abstract
Spam filtering has some characteristics in common with stream data processing, such as high-volume scale, infinite increase and dynamical change. Traditional spam filtering methods use static feature selection approaches which cannot reflect that features of stream data are always dynamically changing as time goes by. In this paper, we propose a spam filtering method based on the characteristics of time stream which can adjust the effective features used for filtering in real time. The experimental results based on TREC spam track corpus show that our method could optimize the temporal and spatial cost of the filtering computation, while keeping the accuracy of the spam filter at a high level.
关键词
计算机应用 /
中文信息处理 /
垃圾邮件 /
流数据 /
时间流 /
文本分类 /
特征选择
{{custom_keyword}} /
Key words
computer application /
Chinese information processing /
spam /
stream data /
time stream /
text classification /
feature selection
{{custom_keyword}} /
{{custom_sec.title}}
{{custom_sec.title}}
{{custom_sec.content}}
参考文献
[1] 王斌, 潘文峰. 基于内容的垃圾邮件过滤技术综述[J]. 中文信息学报, 2005, 19(5): 1-10.
[2] Andrej Bratko and Bogdan Filipic. Spam Filtering using Character-level Markov Models: Experiments for the TREC 2005 Spam Track[C]//TREC 2005 Proceedings.
[3] D. Sculley, Gabriel M. Wachman, and Carla E. Brodley. Spam Filtering using Inexact String Matching in Explicit in Explicit Feature Space with On-Line Linear Classifiers [C]// TREC 2006 Proceedings.
[4] Gordon Cormack, etc TREC 2006 Spam Track Overview[C]//TREC 2006 Proceedings.
[5] Gordon Cormack, etc TREC 2005 Spam Track Overview[C]//TREC 2005 Proceedings.
[6] Jun Xu. Jing Yao. Jiaqian Zheng. Qi Sun. Junyu Niu. WIM at TREC 2007 [C]//TREC 2007 Proceedings.
[7] Y. Dora Cai, David Clutter, Greg Pape, Jiawei Han, Michael Welge, Loretta Auvil. MAIDS: Mining Alarming Incidents from Data Streams[C]//Proceedings of the 23rd ACM SIGMOD (International Conference on Management of Data), June 13-18, 2004, Paris, France.
[8] R. Khardon and G. Wachman. Noise tolerant variants of the perceptron algorithm[J]. Technical report, Tufts University, 2005. To appear in Journal of Machine Learning Research; available at http://www.cs.tufts.edu/tr/techreps/TR-2005-8.
[9] YimingYang, Jan O Pedersen, A Comparative Study on Feature Selection in Text Categorization[C]//Proceeding of the Fourteenth International Conference on Machine Learning ( ICMLp97), 1997: 412-420.
[10] 代六玲, 黄河燕,陈肇雄.中文文本分类中特征抽取方法的比较研究[J].中文信息学报,2004,18(1): 26-32.
[11] Available at http://plg.uwaterloo.ca/~gvcormac/jig/.
[12] Gordon Cormack. TREC 2007 Spam Track Overview[C]//. TREC 2007 Proceedings.
[13] William S. Yerazunis, Shalendra Chhabra, Christian Siefkes, Fidelis Assis, Dimitrios Gunopulos. A Unified Model of Spam Filtration [C]//MIT Spam Conference January 2005.
[14] C. Leslie and R. Kuang. Fast string kernels using inexact matching for protein sequences[J]. Journal of Machine Learning Research,2004, 5:1435-1455.
[15] W. Krauth and M. M′ezard. Learning algorithms with optimal stability in neural networks[J]. Journal of Physics A, 1987,20(11):745-752.
[16] Huma Lodhi, Craig Saunders, John Shawe-Taylor, Nello Cristianini and Chris Watkins.Text Classification using String Kernels[J]. To appear in Advances in Neural Information Processing Systems 13, MIT Press. http://citeseer.ist.psu.edu/lodhi02text.html.
{{custom_fnGroup.title_cn}}
脚注
{{custom_fn.content}}
基金
国家自然科学基金资助项目(60305006)
{{custom_fund}}