中文文本分类中的特征选择研究

周茜,赵明生,扈旻

PDF(389 KB)
PDF(389 KB)
中文信息学报 ›› 2004, Vol. 18 ›› Issue (3) : 18-24.

中文文本分类中的特征选择研究

  • 周茜,赵明生,扈旻
作者信息 +

Study on Feature Selection in Chinese Text Categorization

  • ZHOU Qian,ZHAO Ming-sheng,HU min
Author information +
History +

摘要

本文介绍和比较了八种用于文本分类的特征选择方法,其中把应用于二元分类器中的优势率改造成适用于多类问题的形式,并提出了一种新的类别区分词的特征选择方法,结合两种不同的分类方法:文本相似度方法和Na?ve Bayes方法,在两个不同的数据集上分别作了训练和测试,结果表明,在这八种文本特征选择方法中,多类优势率和类别区分词方法取得了最好的选择效果。其中,当用Na?ve Bayes分类方法对各类分布严重不均的13890样本集作训练和测试时,当特征维数大于8000以后,用类别区分词作特征选择得到的宏F1值比用IG作特征选择得到的宏F1值高出3%~5%左右。

Abstract

This paper introduces and compares eight feature selection methods in text categorization. Among the eight methods , Multi-Class Odds Ratio (MC-OR) , a variant of Odds Ratio which is often used in binary classification , and a new feature selection method based on Class-Discriminating Words (CDW) are proposed. Combined with the classic VSM classifier based on cosine similarity and the Na?ve Bayes classifier , training and test are carried out on two text sets with different class distribution. As the results indicate , MC-OR and CDW gain the best selecting effect.

关键词

计算机应用 / 中文信息处理 / 文本分类 / 特征选择 / 类别区分词

Key words

computer application / Chinese information processing / text categorization / feature selection / class-discriminating words

引用本文

导出引用
周茜,赵明生,扈旻. 中文文本分类中的特征选择研究. 中文信息学报. 2004, 18(3): 18-24
ZHOU Qian,ZHAO Ming-sheng,HU min. Study on Feature Selection in Chinese Text Categorization. Journal of Chinese Information Processing. 2004, 18(3): 18-24

参考文献

[1] 史忠植. 知识发现[M] . 北京:清华大学出版社,2002.
[2] Yang Yiming , Pederson J O. A Comparative Study on Feature Selection in Text Categorization [A] . Proceedings of the 14th International Conference on Machine learning[C] . Nashville : Morgan Kaufmann , 1997 : 412 - 420.
[3] Mlademnic ,D. , Grobelnik ,M. Feature Selection for unbalanced class distribution and Na?ve Bayees[A] . Proceedings of the Sixteenth International Conference on Machine Learning[C] . Bled : Morgan Kaufmann , 1999 : 258 - 267.
[4] 王梦云,曹素青. 基于字频向量的中文文本自动分类系统[J] . 情报学报,2000 ,19 (6) : 644 - 649.
[5] Y. Yang. Noise reduction in a statistical approach to text categorization[A] . Proceedings of the 18th Ann Int ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR'95) [C] . Seattle : ACM Press , 1995 : 256 - 263.
[6] 范焱,郑诚,等. 用Na?ve Bayes方法协调分类Web网页[J] . 软件学报,2001 , 12 (9) : 1386 - 1392.
[7] 刘斌,黄铁军,程军,高文. 一种新的基于统计的自动文本分类方法[J] . 中文信息学报,2002 ,16 (6) : 18 - 24.
[8] 梁久祯,兰东俊,扈旻. 基于先验知识的网页特征压缩与线性分类器设计[A] . 第十二届全国神经计算学术大会论文集[C] . 北京:人民邮电出版社,2002 ,494 - 501.
[9] Thorsten Joachims , Text Categorization with Support Vector Machines : Learning with Many Relevant Features [A] , In : European Conferrence on Machine Learning (ECML) [C] . Berlin : Springer , 1998 ,137 - 142.

基金

国家自然科学基金资助项目(60003014;60171037)
PDF(389 KB)

1079

Accesses

0

Citation

Detail

段落导航
相关文章

/