1. Key Laboratory of Intelligent Computing & Signal Processing, Ministry of Education, Anhui University, Hefei, Anhui 230039,China; 2.Anhui Shengzhi Radio and TV University, Hefei, Anhui 230001,China; 3. Hefei University of Technology,Hefei, Anhui 230009,China
Abstract:The aim of spam filtering is to distinguish the spam and the ham. The traditional methods used vector space model and feature selection approaches to extract features representing the contents of emails. However, these methods do not take the semantic information among words into account. In this paper, a new method is proposed to extract email features by combining the vector space model and the term co-occurrence. The covering algorithm is then employed to classify emails. Experiments show that the proposed method significantly improves the filtering performances compared with traditional ones. The features selected by utilizing term co-occurrence model are more representative than those chosen by the vector space model. Key words computer application; Chinese information processing; vector space model; spam filter; term co-occurrence model; covering algorithm
[1] William W. Cohen. Fast effective rule induction[C]//Machine Learning Proceedings of the Twelfth International Conference on Machine Learning. Tahoe City, California, USA: Morgan Kaufmann, 1995: 115-123. [2] X. Carreras, L. Marquez. Boosting Trees for Anti-Spam Email Filtering [C]//Proceedings of Euro Conference Recent Advances in NLP (RANLP-2001). 2001: 58-64. [3] 刘洋,杜孝平,罗平,等. 垃圾邮件的智能分析、过滤及Rough集讨论[C]//第十二届中国计算机学会网络与数据通信学术会议. 武汉: 2002. [4] I. Androutsopoulos, G. Paliouras, V. Karkaletsis, etc, Learning to Filter Spam E-Mail: A Comparison of a Naive Bayesian and a Memory-Based Approach[C]//Proc. 4th European Conference on Principles and Practice of Knowledge Discovery in Databases (PKDD 2000). 2000: 1-13. [5] H. Drucker, D. Wu, V. N. Vapnik, Support Vector Machines for Spam Categorization[J/OL]. IEEE Transactions on Neural Networks, 1999, 20(5): 1048-1054. [6] M. Sahami, S. Dumais, D. Heckerman etc, A Bayesian approach to filtering junk e-mail[C]//Proc. of AAAI Workshop on Learning for Text Categorization. 1998: 55-62. [7] 刘伍颖, 王挺. 基于多过滤器集成学习的在线垃圾邮件过滤[J]. 中文信息学报, 2008,22(1):67-73. [8] Peat H J, Willet P. The limitations of term co-occurrence data for query expansion in document retrieval systems [J/OL]. JASIS, 1991, 42(5):378-383. [9] G Salton, A Wong, C S Yang. On the specification of term values in automatic indexing [J/OL]. Journal of Documentation, 1973, 29(4):351-372. [10] 代六玲, 黄河燕, 陈肇雄. 中文文本分类中特征抽取方法的比较研究[J]. 中文信息学报, 2004,18(1): 26-32. [11] Y. Yang. A Comparative Study on Feature Selection in Text Categorization [C]//Proceeding of the Fourteenth International Conference on Machine Learning (ICML’97) . 1997, 412-420. [12] 张铃,张钹. M-P神经元模型的几何意义及其应用[J]. 软件学报, 1998,9(5):334-338. [13] 王倩倩. 基于覆盖算法的中文垃圾邮件过滤[D]. 合肥: 安徽大学,2007. [14] 王斌,潘文峰.基于内容的垃圾邮件过滤技术综述[J].中文信息学报,2005,19(5): 1-10. [15] Sebastiani F. Machine learning in automated text categorization [J]. ACM Computing Surveys, 2002,34 (1) : 1-47.