武永亮,赵书良,李长镜,魏娜娣,王子晏. 基于TF-IDF和余弦相似度的文本分类方法[J]. 中文信息学报, 2017, 31(5): 138-145.
WU Yongliang, ZHAO Shuliang, LI Changjing, WEI Nadi, WANG Ziyan. Text Classification Method Based on TF-IDF and Cosine Similarity. , 2017, 31(5): 138-145.
Text Classification Method Based on TF-IDF and Cosine Similarity
WU Yongliang1, 2, ZHAO Shuliang1, 2, LI Changjing1, 2, WEI Nadi3, WANG Ziyan4
1.College of Mathematics and Information Science, HeBei Normal University, Shijiazhuang, Hebei 050024, China;
2.Hebei Key Laboratory of Computational Mathematics and Applications, Shijiazhuang, Hebei 050024, China;
3.Huihua College of Hebei Normal University, Shijiazhuang, Hebei 050091, China;
4. College of Computer Science and Technology, University of Science&Technology China, Hefei, Anhui 230022, China
Abstract:Text classification is the fundamental task for text mining. Many text classification algorithms have been presented in previous literatures, such as KNN, Nave Bayes, Support Vector Machine, and some improved algorithms. The performance of these algorithms depends on the data set and does not have self-learning function. This paper proposes an effective approach for text classification. The three key points of the approach are: 1)extracting the keywords of category (KWC) of labeled texts based on the TF-IDF approach, 2) classifying unlabeled text by the relevancy of category and unlabeled text, and 3) improving the performance of the approach via updating the KWC in the process of classification. Simulation experiment results show that the new approach can improve the accuracy of text classification to 90%, and even up to 95% when the data volume is large enough. The method can automatically update the keywords of category to improve the classification accuracy of the classifier.
[1] Joachims T. A Probabilistic Analysis of the Rocchio Algorithm with TF-IDF for Text Categorization[R]. Carnegie-mellon univpittsburgh pa dept of computer science, 1996.
[2] Cover T, Hart P. Nearest neighbor pattern classification[J]. IEEE Transactions on Information Theory, 1967, 13(1):21-27.
[3] Guo G, Wang H, Bell D, et al. Using kNN model for automatic text categorization[J]. Soft Computing, 2006, 10(5):423-430.
[4] Jiang S, Pang G, Wu M, et al. An improved K-nearest-neighbor algorithm for text categorization[J]. Expert Systems with Applications, 2012, 39(1):1503-1509.
[5] Soucy P, Mineau G W. A simple KNN algorithm for text categorization[C]//Proceedings IEEE International Conference on. IEEE, 2001:647-648.
[6] Boser B E, Guyon I M, Vapnik V N. A training algorithm for optimal margin classifiers[C]//Proceedings of the fifth Annual Workshop on Computational Learning Theory. ACM, 1992:144-152.
[7] Cortes C, Vapnik V. Support-vector networks[J]. Machine Learning, 1995, 20(3):273-297.
[8] Joachims T. Transductive inference for text classification using support vector machines[C]//Proceedings of the International Conference on Machine Learning. 1999(99):200-209.
[9] Tong S, Koller D. Support vector machine active learning with applications to text classification[J]. The Journal of Machine Learning Research, 2002(2):45-66.
[10] Kim H, Howland P, Park H. Dimension reduction in text classification with support vector machines[J]. Journal of Machine Learning Research, 2005:37-53.
[11] Kim S B, Han K S, Rim H C, et al. Some effective techniques for naive bayes text classification[J]. Knowledge and Data Engineering, IEEE Transactions, 2006, 18(11):1457-1466.
[12] Frank E, Bouckaert R R. Naive bayes for text classification with unbalanced classes[M]. Knowledge Discovery in Databases PKDD 2006. SpringerBerlin Heidelberg, 2006:503-510.
[13] Wang S, Jiang L, Li C. Adapting naive bayes tree for text classification[J]. Knowledge and Information Systems, 2015, 44(1):77-89.
[14] Rennie J D, Shih L, Teevan J, et al. Tackling the poor assumptions of naive bayes text classifiers[C]//Proceedings of the ICML, 2003, 3616-3623.
[15] Yu C T, Salton G. Precision weighting:an effective automatic indexing method[J]. Journal of the ACM (JACM), 1976, 23(1):76-88.
[16] Amati G, Van Rijsbergen C J. Probabilistic models of information retrieval based on measuring the divergence from randomness[J]. ACM Transactions on Information Systems (TOIS), 2002, 20(4):357-389.
[17] Lin J. Using distributional similarity to identify individual verb choice[C]//Proceedings of the Fourth International Natural Language Generation Conference. Association for Computational Linguistics, 2006:33-40.
[18] Liere R, Tadepalli P. Active learning with committees for text categorization[C]//Proceedings of the AAAI/IAAI. 1997:591-596.