现有的分类系统通常忽略类别体系的层次结构,在对文献进行分类时,往往很难区分类别相近的文献属于哪一类。本文基于向量空间模型,提出根据类别体系的层次结构,自顶向下,逐层分类的方法。其目的是提高分类精度;并根据概念词典,将同义词或下位概念映射到单一的概念词上,由这些概念词构成一个规模很小的特征集,以缩小特征向量空间的维数,从而减少分类系统的计算量。此外,通过对类别层次体系的分析,压缩特征向量,从另一方面减少分类系统的计算量。
Abstract
Existing statistical document classification systems often ignore the hierarchical structure of the pre-defined topics. This makes it difficult to identify which category a document belongs to when the possible categories are somewhat similar. In this article , we propose a top-down classification method according to the hierarchical structure of topics. The purpose is to improve precision and reduce computation of classification systems. Through a concept dictionary (thesaurus) , we map the synonyms or lower-level concepts in a document to a small set of concept words that are used as terms. This reduces the computational complexity from another aspect by reducing the dimension of the vector space.
关键词
文献分类 /
向量空间模型 /
类别层次结构
{{custom_keyword}} /
Key words
Document classification /
Vector space model /
Topic category hierarchy
{{custom_keyword}} /
{{custom_sec.title}}
{{custom_sec.title}}
{{custom_sec.content}}
参考文献
[1] Yiming Yang. An Evaluation of Statistical Approach to Text Categorization , http://www.cs.cmu.edu//yiming
[2] Kennech W Church ,Lisa F Rau. Commercial Applications of Natural Language Processing ,Comm. of ACM ,Nov. 1995 ,38 (11)
[3] 吴立德等. 大规模中文文本处理. 上海:复旦大学出版社,1997
[4] Schutze H , Hull D , Pedersen J . A Comparison of Selective Bayesian Network Classifiers. In : ICML - 96 , 1996
[5] Koller D ,Sahami M. Toward Optimal Feature Selection. In :Proceedings of ICML - 96 ,1996
[6] Salton G. Automatic Text Processing : The Transformation ,Analysis , and Retrieval of Information by Computer. Addison2Wesley ,Reading ,Pennsylvania ,1989
[7] 姚天顺等. 自然语言理解. 北京:清华大学出版社,1995
[8] 战学刚,姚天顺. 基于汉语分析的中文分类方法. 见:1998中文信息处理国际会议论文集,北京:清华大学出版社,1998
{{custom_fnGroup.title_cn}}
脚注
{{custom_fn.content}}