垃圾邮件是Internet上亟待解决的问题,目前许多垃圾邮件过滤技术已经被使用。基于偏最小二乘的方法可以解决垃圾邮件的内容中普遍存在的数据稀疏性、高特征维数和多重相关性问题。但邮件内容之间的内在联系往往不是线性的,该文通过在偏最小二乘方法上引入核函数,去解决这一类的非线性问题。Enron-Spam垃圾数据集实验表明,同PLSR等方法比较,模型表现出了较好的过滤性能。
Abstract
The spam is one of the most serious problems to be resolved in the Internet. Recently, several spam filtering technologies have been proposed and applied to spam filtering, such as the Partial Least Squares (PLS) method. The PLS method can deal with the sparse data, the high dimensionalities and the multi-colinearity issues existing in the e-mail dataset. However, the latent content relationships among the e-mail data are, more often than not, nonlinear. This paper introduces the kernel function over PLS method to capture such non-linearity. Compared with PLSR method, the proposed KPLS model is proved with superior efficiency in the experiments on the Enron-Spam dataset.
关键词
计算机应用 /
中文信息处理 /
垃圾邮件过滤 /
非线性 /
核偏最小二乘 /
回归 /
分类 /
潜在语义
{{custom_keyword}} /
Key words
computer application /
Chinese information processing /
spam filtering /
nonlinear /
kernel partial least square /
regression /
classification /
latent semantic
{{custom_keyword}} /
{{custom_sec.title}}
{{custom_sec.title}}
{{custom_sec.content}}
参考文献
[1] 詹川.反垃圾邮件技术的研究[D] .电子科技大学计算机系统结构系 博士毕业论文, 2005. 1-3.
[2] Konstantin Tretyakov. Machine Learning Techniques in Spam Filtering[C]//Data Mining Problem-oriented Seminar, MTAT.03.177,May 2004: 60-79.
[3] Wegelin J A. A Survey of Partial Least Squares (PLS) Methods,with Emphasis on the Two-block Case [R]. Seattle: Department of Statistics, University of Washington, 2000: 21-28.
[4] Hoskuldsson A. PLS regression methods [J]. Journal of Chemo2metrics, 1988, 3 (2): 211-228.
[5] Peng-Ming Wang, Ming-Wen Wang, Guo-Bing Huang. Spam filtering based on PLS Feature Extraction[C]//NCIRCS-2007,Soochow University,2007: 74-79.
[6] Scholkopf, Smola A: Learning with Kernels[M]. Cambridge: MIT Press,2002: 18-19.
[7] Shawe-Taylor J, Cristianini N. Kernel Methods for Pattern Analysis[M]. Beijing: China Machine Press,2005: 60-74.
[8] Roman Rosipal,Leonard J. TrejoKernel Partial Least Squares Regression in Reproducing Kernel Hilbert Space[J].Journal of Machine Learning Research 2 (2001): 97-123.
[9] Suykens J A K. Nonlinear modelling and support vector machines[C]//Proceedings of the 18 th IEEE Instrumentation and Measurement Technology Conference. Piscataway, NJ, USA: IEEE, 2001: 67-71.
[10] Suykens J A K, Vandewalle J. Least squares support vector machine classifiers [J]. Neural Processing Letters, 1999, 9 (3): 293-300.
[11] Ming-wen Wang, Jian-Yun Nie. A Latent Sematic Classification Model[C]//ACM 14th Conference on Information and Knowledge Management(CIKM),2005,31st October-5th November,2005 Bremen, Germany: 2-6.
[12] Roman Rosipal. Kernel Partial Least Squares for Nonlinear Regression and Discrimination[J].Neural Network World .2003,13(3): 2-8.
[13] Wen-bin Li , Chun-nian Liu , Yi-ying Chen. Design and Implement Cost2Sensitive Email Fitering Algorithms[C]//Proceedings of the Artificial Intelligence Applications and Innovations. Beijing (CN): 2005: 325-334.
{{custom_fnGroup.title_cn}}
脚注
{{custom_fn.content}}
基金
国家自然科学基金资助项目(60663007);江西省科技攻关项目(2006-184);江西省教育厅科技项目(2007-129)
{{custom_fund}}