本文介绍和比较了八种用于文本分类的特征选择方法,其中把应用于二元分类器中的优势率改造成适用于多类问题的形式,并提出了一种新的类别区分词的特征选择方法,结合两种不同的分类方法:文本相似度方法和Na?ve Bayes方法,在两个不同的数据集上分别作了训练和测试,结果表明,在这八种文本特征选择方法中,多类优势率和类别区分词方法取得了最好的选择效果。其中,当用Na?ve Bayes分类方法对各类分布严重不均的13890样本集作训练和测试时,当特征维数大于8000以后,用类别区分词作特征选择得到的宏F1值比用IG作特征选择得到的宏F1值高出3%~5%左右。
Abstract
This paper introduces and compares eight feature selection methods in text categorization. Among the eight methods , Multi-Class Odds Ratio (MC-OR) , a variant of Odds Ratio which is often used in binary classification , and a new feature selection method based on Class-Discriminating Words (CDW) are proposed. Combined with the classic VSM classifier based on cosine similarity and the Na?ve Bayes classifier , training and test are carried out on two text sets with different class distribution. As the results indicate , MC-OR and CDW gain the best selecting effect.
关键词
计算机应用 /
中文信息处理 /
文本分类 /
特征选择 /
类别区分词
{{custom_keyword}} /
Key words
computer application /
Chinese information processing /
text categorization /
feature selection /
class-discriminating words
{{custom_keyword}} /
{{custom_sec.title}}
{{custom_sec.title}}
{{custom_sec.content}}
参考文献
[1] 史忠植. 知识发现[M] . 北京:清华大学出版社,2002.
[2] Yang Yiming , Pederson J O. A Comparative Study on Feature Selection in Text Categorization [A] . Proceedings of the 14th International Conference on Machine learning[C] . Nashville : Morgan Kaufmann , 1997 : 412 - 420.
[3] Mlademnic ,D. , Grobelnik ,M. Feature Selection for unbalanced class distribution and Na?ve Bayees[A] . Proceedings of the Sixteenth International Conference on Machine Learning[C] . Bled : Morgan Kaufmann , 1999 : 258 - 267.
[4] 王梦云,曹素青. 基于字频向量的中文文本自动分类系统[J] . 情报学报,2000 ,19 (6) : 644 - 649.
[5] Y. Yang. Noise reduction in a statistical approach to text categorization[A] . Proceedings of the 18th Ann Int ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR'95) [C] . Seattle : ACM Press , 1995 : 256 - 263.
[6] 范焱,郑诚,等. 用Na?ve Bayes方法协调分类Web网页[J] . 软件学报,2001 , 12 (9) : 1386 - 1392.
[7] 刘斌,黄铁军,程军,高文. 一种新的基于统计的自动文本分类方法[J] . 中文信息学报,2002 ,16 (6) : 18 - 24.
[8] 梁久祯,兰东俊,扈旻. 基于先验知识的网页特征压缩与线性分类器设计[A] . 第十二届全国神经计算学术大会论文集[C] . 北京:人民邮电出版社,2002 ,494 - 501.
[9] Thorsten Joachims , Text Categorization with Support Vector Machines : Learning with Many Relevant Features [A] , In : European Conferrence on Machine Learning (ECML) [C] . Berlin : Springer , 1998 ,137 - 142.
{{custom_fnGroup.title_cn}}
脚注
{{custom_fn.content}}
基金
国家自然科学基金资助项目(60003014;60171037)
{{custom_fund}}