黄科,马少平. 基于统计分词的中文网页分类[J]. 中文信息学报, 2002, 16(6): 26-32.
HUANG Ke,MA Shao-ping. Chinese Web Page Classification Based On Statistical Word Segmentation. , 2002, 16(6): 26-32.
基于统计分词的中文网页分类
黄科,马少平
清华大学计算机科学与技术系智能技术与系统国家重点实验室
Chinese Web Page Classification Based On Statistical Word Segmentation
HUANG Ke,MA Shao-ping
National Key Lab of Intelligent Technology and System Department of Computer Science and Technology Tsinghua University
Abstract:Word segmentation is an important step in Chinese natural language processing. This paper explores the problem of classifying Chinese web pages based on statistical word segmentation.We first construct a Chinese word list of binary words automatically from training Chinese web pages. Then the texts in testing Chinese web pages are segmented with the word list. Web pages are classified based on the segmentation results. Experiments show that statistical word segmentation can efficiently improve classification precision.Based on the experiment results ,we analyze the influence of statistical word segmentation on Chinese web page classification. Single Chinese characters and words play different roles in web page classification and the reason for the difference is also analyzed.
[1] Salton ,Gerard. Introduction to modern information retrieval.Auckland : McGraw-Hill ,1983 [2] Kjersti Aas ,Line Eikvil. Text Categorisation :A Survey. Rapport Nr. 941 , ISBN 82 - 539 - 0425 - 8 ,Oslo ,Norway :Norwegian Computing Center ,1999 [3] Erik Wiener ,Jan O. Pedersen ,Andreas S.Weigend.A Neural Network Approach to Topic Spotting. Proceedings of SDAIR-95 ,4th Annual Symposiumon Document Analysis and Information Retrieval.Las Vegas ,NV ,USA ,1995. 317 - 332 [4] Yiming Yang. An Evaluation of Statistical Approaches to Text Categorization. Journal of Information Retrieval ,1999 ,1 (1/ 2) :67 - 88 [5] Leah S.Larkey and W. Bruce Croft. Combining Classifiers in Text Categorization. Proceedings of SIGIR-96 ,19th ACM International Conference on Research and Development in Information Retrieval. New York ,USA:ACM Press ,1996. 289 - 297 [6] 殷建平. 汉语自动分词方法. 计算机工程与科学,1988 ,20 (3) [7] 丁丰,董娜,林碧琴,袁保宗. 自然语言处理系统中自动分词的研究. 北方交通大学学报,1999 ,23 (6) [8] Fuchun Peng ,Dale Schuurmans. Self-supervised Chinese Word Segmentation. Proceedings of the 4th International Symposium of Intelligent Data Analysis ,2001 ,238 - 247 [9] Xianping Ge ,Wanda Pratt , Padhraic Smyth. Discovering Chinese Words from Unsegmented Text. SIGIR’99 ,1999 ,271 - 272 [10] W. J. Teahan. Text Classification and Segmentation Using Minimum Cross-entropy. International Conference on Content-Based Multimedia Information Access (RIAO) ,2000. 943 - 961 [11] 王还,常宝儒. 现代汉语频率词典. 北京:北京语言学院出版社,1986 Aito Chen ,Jianzhang He ,Liangjie Xu ,Fredric C. Gey ,Jason Meggs. Chinese Text Retrieval Without Using a Dictionary. SIGIR’97 : Proceedings of the 20th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval ,ACM Press ,1997 ,42 - 49 [12] 新浪网新闻网站.http:∥news.sina.com.cn. [13] David Lewis. The Reuters-21578 Text Categorization Test Collection. http:∥www.research.att.com/~lewis/reuters21578.html.