网页分类可以看成是噪音环境下的文本分类问题。本文是在噪音环境下文本分类方法的一种探索: 把在传统文本分类中性能基本相当的基于N-gram模型的贝叶斯(NGBayes)、基于分词的朴素贝叶斯(NBayes)和基于分词的k近邻(kNN)分类方法应用到网页分类领域,在中文Web信息检索论坛提供的中文网页分类训练集——CCT2002-v1.1(Corp_1)和我们自己整理的中文网页集(Corp_2)进行了实验。验证了三种分类方法在非噪音环境下性能基本相当,而噪音环境下的实验结果表明,NGBayes的分类性能远远高于其他两种方法,这说明NGBayes对中文网页中的噪音不敏感。然后通过对特征的分析,探讨了NGBayes抗噪音的原因。从而得出结论: NGBayes是一种抗噪音的中文网页分类方法。
Abstract
Webpage classification can be regarded as a text classification problem under noisy environment. This paper aims at doing an exploratory research in this field. We re-examine three classifiers: Bayes based on N-gram model classifier (NGBayes), Nave Bayes classifier (NBayes) and k-Nearest Neighbor classifier (kNN), which almost have the same performance in traditional text classification field. Two corpora are used for this study: CCT2002-v1.1 (Corp_1) provided by Chinese Web Information Retrieval Forum and another Chinese webpage corpora (Corp_2) collected by ourselves. The conclusion that these classifiers have comparable performance under non-noisy conditions is validated. The experiment results show that NGBayes greatly outperforms NBayes and kNN under noisy environment, so NGBayes is least insensitive to noisy information among these classifiers. Deep analysis explains why NGBayes is better. Thus the conclusion is drawn: The NGBayes is an antinoise chinese webpage classification method.
关键词
计算机应用 /
中文信息处理 /
N-gram模型 /
NBayes /
kNN
{{custom_keyword}} /
Key words
computer application /
chinese information processing /
N-gram model /
NBayes /
kNN
{{custom_keyword}} /
{{custom_sec.title}}
{{custom_sec.title}}
{{custom_sec.content}}
参考文献
[1] Peng F, Schuurmans D. Combining Naive Bayes and n-gram language models for text classification [A]. In: Proceeding of the 25th European Conference on Information Retrieval Research [C]. Pisa, Italy: Springer, 2003: 335-350.
[2] F. Peng, X. Huang, D. Schuurmans, and S. Wang. Text Classification in Asian Languages without Word Segmentation [A]. In: Proceedings of the Sixth International Workshop on Information Retrieval with Asian Languages[C]. 2003.
[3] Aixin Sun, Ee-Peng Lim, Wee-Keong Ng. Web classification using support vector machine[A]. In: Proceedings of the 4th international workshop on Web information and data management[C]. McLean, Virginia, USA. November 08-08, 2002.
[4] Kan, M-K. Web page categorization without the web page [A]. In: Proceedings of WWW’200[C]. Alternate Track Papers and Posters, 2004: 262-263.
[5] Lan Yi, Bing Liu, Xiaoli Li. Eliminating noisy information in Web pages for data mining[A]. In: Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining[C]. Washington, D.C. August 24-27, 2003.
[6] SH Lin, JM Ho. Discovering informative content blocks from Web documents [A]. In: Proceedings of the Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining[C]. July 23-26, 2002.
[7] Yiming Yang and Xin Liu. A re-examination of text categorization methods [A]. In: Conference on Research and Development in Information Retrieval, SIGIR’99 [C]. 1999: 42-49.
[8] Y. Yang, J. Zhang and B. Kisiel. A scalability analysis of classifiers in text categorization [A]. In: ACM SIGIR’03 [C]. 2003: 96-103.
[9] Yiming Yang. An evaluation of statistical approaches to text categorization [J]. Journal of Information Retrieval, 1999, 1(1/2): 67-88.
[10] Hui Fang, Tao Tao, ChengXiang Zhai. A formal study of information retrieval heuristics [A]. In: Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval[C]. Sheffield, United Kingdom. July 25-29, 2004.
[11] Aixin Sun, Ee-Peng Lim. Hierarchical Text Classification and Evaluation [A]. In: Proceedings of the 2001 IEEE International Conference on Data Mining[C]. November 29-December 02, 2001: 521-528.
[12] Yang, Y., Pedersen J.P. A Comparative Study on Feature Selection in Text Categorization [A]. In: Proceedings of the Fourteenth International Conference on Machine Learning (ICML’97)[C]. 1997: 412-420.
[13] http://www.cwirf.org/.
[14] http://trec.nist.gov/.
[15] http://www.cs.waikato.ac.nz/ml/weka/.
[16] 冯是聪,王继民. 关于“中文网页自动分类竞赛”结果的分析[J]. 中文信息学报,2003,17(5): 34-40.
[17] Ricordo Baeza-Yates, Berthier Ribeiro-Neto.王知津,贾福新,郑红军等译. 现代信息检索[M]. 北京: 机械工业出版社,2005.3.
[18] 黄科,马少平. 基于统计分词的中文网页分类[J]. 中文信息学报,2002,16 (6): 25-31.
[19] Tom M.Mitchell.曾华军,张银奎等译. 机器学习[M]. 北京: 机械工业出版社,2003.1.
[20] 毛伟,徐蔚然,郭军. 基于N-gram语言模型和链状朴素贝叶斯分类器的中文文本分类系统[J]. 中文信息学报,2006,20(3): 29-35.
[21] 王洪君等. 中文文本聚类的特征单元比较[A]. 大规模信息检索和内容安全[C]. 北京: 清华大学出版社, 2005: 228-234.
[22] 许云, 樊孝忠, 张锋. 一种不需分词的中文文本分类方法[J]. 北京理工大学学报. 2005(9): 778-781.
(上接第47页)
[9] Tang Peili, Wang Shuming, Hu Ming. Algorithm of Thematic Words Extraction from Chinese Texts Based on Semantic[J]. Journal of Jilin University (Information Science Edition), 2005, 23(5): 535-540.
[10] 程涛,施水才,王霞,吕学强.基于同义词词林的中文文本主题词提取[J].广西师范大学学报,2007, 25(2):145-148.
[11] 梅家驹等.同义词词林(第二版)[M].上海: 上海辞书出版社,1996.
[12] Dileep Damle, Victoria Uren. Extracting significant words from corpora for ontology Extraction[A]. In: Proceedings of the 3rd international conference on Knowledge capture table of contents [C]. New York: 2005: 187-188.
[13] Wen-tau Yih, Joshua Goodman, Vitor R. Carvalho. Finding advertising keywords on web pages[A]. In: Proceedings of the 15th international conference on World Wide Web table of contents[C]. New York: 2006: 213-222.
[14] Aranyak Mehta, Amin Saberi, Umesh Vazirani, Vijay Vazirani. AdWords and Generalized Online Matching [A]. In: 46th Annual IEEE Symposium on Foundations of Computer Science (FOCS’05)[C]. 2005: 264-273.
{{custom_fnGroup.title_cn}}
脚注
{{custom_fn.content}}
基金
国家973资助项目(2004CB318109);北京市科技计划资助项目(D0106008040291)
{{custom_fund}}