中文文本分类中特征抽取方法的比较研究

代六玲,黄河燕,陈肇雄

PDF(340 KB)
PDF(340 KB)
中文信息学报 ›› 2004, Vol. 18 ›› Issue (1) : 27-33.

中文文本分类中特征抽取方法的比较研究

  • 代六玲1,2,黄河燕2,陈肇雄2
作者信息 +

A Comparative Study on Feature Selection in Chinese Text Categorization

  • DAI Liu-ling1,2,HUANG He-yan2,CHEN Zhao-xiong2
Author information +
History +

摘要

本文比较研究了在中文文本分类中特征选取方法对分类效果的影响。考察了文档频率DF、信息增益IG、互信息MI、χ2分布CHI四种不同的特征选取方法。采用支持向量机(SVM)和KNN两种不同的分类器以考察不同抽取方法的有效性。实验结果表明,在英文文本分类中表现良好的特征抽取方法(IG、MI和CHI)在不加修正的情况下并不适合中文文本分类。文中从理论上分析了产生差异的原因,并分析了可能的矫正方法包括采用超大规模训练语料和采用组合的特征抽取方法。最后通过实验验证组合特征抽取方法的有效性。

Abstract

This paper is a comparative study of feature selection methods in text categorization. Four methods were evaluated, including document frequency (DF) , information gain (IG) , mutual information (MI) and χ2-test (CHI) . A Support Vector Machine (SVM) and a k-nearest neighbor (KNN) were selected as the evaluating classifiers. We found IG, MI and CHI had poor performance in our test , though they behave well in English text categorization. We analyzed the reasons theoretically and put forwarded the possible solutions. A furthermore experiment proved that the combined feature selection method is effective.

关键词

计算机应用 / 中文信息处理 / 文本自动分类 / 特征抽取 / 支持向量机 / KNN

Key words

computer application / Chinese information processing / text categorization / feature selection / SVM / KNN

引用本文

导出引用
代六玲,黄河燕,陈肇雄. 中文文本分类中特征抽取方法的比较研究. 中文信息学报. 2004, 18(1): 27-33
DAI Liu-ling,HUANG He-yan,CHEN Zhao-xiong. A Comparative Study on Feature Selection in Chinese Text Categorization. Journal of Chinese Information Processing. 2004, 18(1): 27-33

参考文献

[1] 孙丽华,等. 一种改进的KNN方法及其在文本分类中的应用[J] . 应用科技第29卷第2期2002年2月
[2] 朱寰,等. 文本分割算法对中文信息过滤影响研究[J] . 计算机工程与应用,第13期,2002
[3] 何新贵,等. 中文文本的关键词自动抽取和模糊分类[J] . 中文信息学报,1999 :13 (1)
[4] Y. Yang. A Comparative Study on Feature Selection in Text Categorization[C] . In : Proceeding of the Fourteenth International Conference on Machine Learning (ICML'97) , 412 - 420 ,1997.
[5] Tom Mitchell. Machine Learning[M] . McCraw Hill , 1996.
[6] T. E. Dunning. Accurate methods or the statistics of surprise and coincidence [C] . In : Computational Linguistics , Volume 19 :1 , pages 61 - 74 , 1993.
[7] Kenneth Ward Church and Patric K Hanks. Word association norms , mutual information and lexicography[C] . In : Proceedings of ACL27 , pages 76 - 83 , Vancouver , Canada , 1989.
[8] Salton G, Wong A , Yang C S. A vector space model for automatic indexing[C] . Communications of the ACM , 1975 , 18 (5) : 613 - 620.
[9] Salton G. Introduction to Modern Information Retrieval[M] . New York : McGraw-Hill Book Company , 1983.
[10] Belur V. Dasarathy. Nearest Neighbor (NN) Norms : NN Pattern Classification Techniques [M] . IEEE Computer Society Press , Las Alamitos , California , 1991.
[11] Y. Yang and X. Liu. A re-examination of text categorization methods[C] . In : Proceedings , 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR'99) , 42 - 49 , 1999.
[12] Vladmimir N. VapniK. The Nature of Statistical Learning Theory[M] . Springer , New York , 1998.
[13] Burges , C. J. C. A tutorial on support vector machines for pattern recognition[C] . Data Mining and Knowledge Discovery , 1998 ,2 (2) : 955 - 974.
[14] J. Platt . Fast training of support vector machines using sequential minimal optimization [C] . In : B. ScholKopf , C. Burges , and A. Smola , editors , Advances in Kernel methods : support vector learning. MIT Press , 1998.
[15] S. S. Keerthi , S. K. Shevade , C. Bhattacharyya , and K. R. K. Murthy. Improvementsto plattps SMO algorithm for SVM classifier design[R] . Neural Computation , 13 (3) :637 - 649 , March 2001.
[16] 黄昌宁,等. 对自动分词的反思[C] ,语言计算与基于内容的文本处理,北京:清华大学出版社. 26 - 38 , 2003 ,7.

基金

国家自然科学基金资助项目(60272088)
PDF(340 KB)

1232

Accesses

0

Citation

Detail

段落导航
相关文章

/