基于规则的自动分类在文本分类中的应用

李渝勤,孙丽华

PDF(348 KB)
PDF(348 KB)
中文信息学报 ›› 2004, Vol. 18 ›› Issue (4) : 10-15.

基于规则的自动分类在文本分类中的应用

  • 李渝勤1,孙丽华2
作者信息 +

Rule-based Automatic Category Application on Text Category

  • LI Yu-qin,SUN Li-hua
Author information +
History +

摘要

文本自动分类是指将文本按一定的策略归于一个或多个类别中的应用技术。本文首先介绍三种基于统计的自动分类技术(k近邻分类器、支持向量机分类器和朴素贝叶斯分类器),剖析了基于统计的自动分类的优势及不足。基于统计的自动分类的不足主要表现为:当类别之间分类特征的交叉变大时,分类精度呈下降趋势,在多层分类的情况下,此局限尤为突出。针对此局限性,为了提高自动分类的精度,我们引入了基于规则的自动分类来对其进行改进和扩充,并整合两种自动分类技术的优点,设计出了混合分类器系统,从而获得了比较理想的分类效果。

Abstract

The technique of text automatic category is to classify texts into one or more classes according to some strategy. This paper firstly reports three kinds of technique of text automatic category based on statistic (k nearest neighbor ,support vector machine and na?ve bayes) ,and analyses their advantages and disadvantages. The weakness of statistic-based automatic category is the category precision decrease while the character intersect within classes increase , especially in the case of multi-layers classifying. In order to improve statistic-based automatic category performance , rule-based automatic category is used. we combine statistic-based category with rule-based classifying method , design and realize a systemof mixing category lastly , which has and has had very good performance in category.

关键词

计算机应用 / 中文信息处理 / 文本挖掘 / 文本分类 / 规则分类

Key words

computer application / Chinese information processing / text mining / text category / rule-based classifying

引用本文

导出引用
李渝勤,孙丽华. 基于规则的自动分类在文本分类中的应用. 中文信息学报. 2004, 18(4): 10-15
LI Yu-qin,SUN Li-hua. Rule-based Automatic Category Application on Text Category. Journal of Chinese Information Processing. 2004, 18(4): 10-15

参考文献

[1] 黄萱青,吴立德,石崎洋之,徐国伟. 独立于语种的文本分类方法[J] . 中文信息学报. 2000 ,14 (6) :1 - 7.
[2] Ji He , Ah-Hwee Tan , Chew-Lim Tan. A Comparative Study on Chinese Text Categorization Methods[J] . PRICAI Workshop on Text and Web Mining. 2000 ,24 - 35.
[3] 岳喜才,吴晓宇,郑崇勋,叶大田. 一种大类别数分类的神经网络方法[J] . 计算机研究与发展. 2000 (3) : 278 - 283.
[4] 孙学刚,陈群秀,马亮. 基于主题的Web文档聚类研究[J] . 中文信息学报. 2003 ,17 (3) :21 - 26.
[5] 边肇祺,张学工. 模式识别[M] . 第二版. 北京:清华大学出版社. 2000 ,284 - 304.
[6] Thorsten Joachims. Text Categorization with Support Vector Machines : Learning with Many Relevant Feature. Proceedings of ECML - 98 ,10th European Conference on Machine Learning[A] . In : Proceedings of ECML - 98 , 10th European Conference on Machine Learning[C] . Claire N line Rouveirol , 2000 :137 - 142.
[7] 李辉,史忠植,许卓群. 运用文本领域的常识改善基于支撑向量机的文本分类器性能[J] . 中文信息学报. 2003 ,16 (2) :7 - 13.
[8] 王伟强,高文. 段立娟. Internet上的文本数据挖掘[J] . 计算机科学. 2000 ,14 (4) :32 - 36.
[9] 刁倩,王永成,张惠惠,何骥. 文本自动分类中的词权重与分类算法[J] . 中文信息学报. 2000 ,14 (3) :25 - 29.
PDF(348 KB)

1237

Accesses

0

Citation

Detail

段落导航
相关文章

/