一种基于向量空间模型的多层次文本分类方法

刘少辉,董明楷,张海俊,李蓉,史忠植

PDF(190 KB)
PDF(190 KB)
中文信息学报 ›› 2002, Vol. 16 ›› Issue (3) : 9-15,27.

一种基于向量空间模型的多层次文本分类方法

  • 刘少辉,董明楷,张海俊,李蓉,史忠植
作者信息 +

An Approach of Multi-hierarchy Text Classification Based on Vector Space Model

  • LIU Shao-hui,DONG Ming-kai,ZHANG Hai-jun,LI Rong,SHI Zhong-zhi
Author information +
History +

摘要

本文研究和改进了经典的向量空间模型(VSM)的词语权重计算方法,并在此基础上提出了一种基于向量空间模型的多层次文本分类方法。也就是把各类按照一定的层次关系组织成树状结构,并将一个类中的所有训练文档合并为一个类文档,在提取各类模型时只在同层同一结点下的类文档之间进行比较;而对文档进行自动分类时,首先从根结点开始找到对应的大类,然后递归往下直到找到对应的叶子子类。实验和实际系统表明,该方法具有较高的正确率和召回率。

Abstract

This paper does research and improves on the classical approach of calculating the term weight in Vector Space Model. Furthermore ,an approach of multi-hierarchy text classification based on Vector Space Model is proposed. In this approach ,all classes are organized as a tree according to some given hierarchical relations ,and all the training documents in a class are combined into a class-document . In order to construct the class models ,it is just only to compare among the class-documents attached to the same node of the same layer. When it is going to classify the documents ,one matching process is hierarchically performed from the root node to the leaf nodes until a corresponding subclass is found. The experiment and real systems indicate that the approach is of high classification Precision and Recall.

关键词

文本分类 / 向量空间模型 / 信息增益 / 特征提取

Key words

Text Classification / Vector Space Model / Information Gain / Feature Selection

引用本文

导出引用
刘少辉,董明楷,张海俊,李蓉,史忠植. 一种基于向量空间模型的多层次文本分类方法. 中文信息学报. 2002, 16(3): 9-15,27
LIU Shao-hui,DONG Ming-kai,ZHANG Hai-jun,LI Rong,SHI Zhong-zhi. An Approach of Multi-hierarchy Text Classification Based on Vector Space Model. Journal of Chinese Information Processing. 2002, 16(3): 9-15,27

参考文献

[1] 李晓黎,刘继敏,史忠植. 概念推理网及其在文本分类中的应用. 计算机研究与发展,2000 ,37 (9) : 1032 - 1038
[2] Vapnik V. The Nature of Statistical Learning Theory. New York , Springer-Verlag ,1995
[3] Yang Y. Expert network : effective and efficient learning from human decisions in text categorization and retrieval. In Proceedings of the Fourth Annual Symposium on Document Analysis and Information Retrieval (SIGIR’94) ,1994 ,13 - 22
[4] Yang Y. Chute C G. An example-based mapping method for text categorization and retrieval. ACM Transaction on Information Systems (TOIS) ,1994 ,12 (3) :252 - 277
[5] Apte C.Damerau F ,and Weiss S. Text mining with decision rules and decision trees. In Proceedings of the Conference on Automated Learning and Discovery ,Workshop 6 : Learning from Text and the Web ,1998
[6] Mitchell T. Machine Learning. McGraw :Hill ,1996
[7] Salton G,Buckley B. Term weighting approaches in automatic text retrieval. Information Processing and Management ,1998 ,24 (5) :513-523
[8] 鲁松,李晓黎,白硕等. 文档中词语权重计算方法的改进,中文信息学报,2000 ,14 (6) :8 - 13
[9] 李国臣. 文本分类中基于对数似然比测试的特征词选择方法,中文信息学报,1999 ,13 (4) :16 - 21
[10] 邹涛,王继成,黄源等. 中文文档自动分类系统的设计与实现,中文信息学报,1999 ,13 (3) :26 - 32
[11] 黄萱菁. 大规模中文文本的检索、分类与摘要研究,复旦大学博士学位论文,1998
[12] Yang Y. and Liu X. . A re-examination of text categorization methods. In Proceedings of the 22nd Annual ACM SIGIR Conference on Research and Development in Information Retrieval ,42 - 49 ,1999
[13] Rocchio Jr.,J. J. . Relevance feedback in information retrieval. In Salton ,G. ,editor ,The SMART Retrieval System: Experiments in Automatic Document Processing ,pp. 3132323. Prentice-Hall , Inc. , Englewood Cliffs ,New Jersey ,1971
[14] Widrow B. ,Stearns S. D. . Adaptive Signal Processing. Prentice-Hall ,Englewood Cliffs ,NJ ,1979
[15] 张月杰,姚天顺. 基于特征相关性的汉语文本自动分类模型的研究,小型微型计算机系统,1998 ,19 (8) :49 - 55

基金

国家自然科学基金(60173017);北京自然科学基金(4011003)
PDF(190 KB)

1354

Accesses

0

Citation

Detail

段落导航
相关文章

/