用于文本分类的改进KNN算法

王煜,王正欧,白石

PDF(312 KB)
PDF(312 KB)
中文信息学报 ›› 2007, Vol. 21 ›› Issue (3) : 76-82.
综述

用于文本分类的改进KNN算法

  • 王煜1,2,王正欧2,白石3
作者信息 +

An Improved KNN Algorithm Applied to Text Categorization

  • WANG Yu1,2, WANG Zheng-ou2, BAI Shi3
Author information +
History +

摘要

最近邻分类器是假定局部的类条件概率不变,而这个假定在高维特征空间中无效。因此在高维特征空间中使用k最近邻分类器,不对特征权重进行修正就会引起严重的偏差。本文采用灵敏度法,利用前馈神经网络获得初始特征权重并进行二次降维。在初始权重下,根据样本间相似度采用SS树方法将训练样本划分成若干小区域,以此寻找待分类样本的近似k0个最近邻,并根据近似k0个最近邻和Chi-square距离原理计算新权重,搜索出新的k个最近邻。此方法在付出较小时间代价的情况下,在文本分离中可获得较好的分类精度的提高。

Abstract

Nearest neighbor classification assumes locally constant class conditional probabilities. The assumption becomes invalid in feature space with high dimension. When KNN classier is used in feature space high dimension, severe bias can be introduced if the weights of features are not amended. In this paper, initial weights of text features are acquired based on sensitivity method firstly, and the second dimension reduce is done. Then training samples are divided into many groups based on sample similarity and the initial weights by using SS tree, k0 approximate nearest neighbors of unknown sample are acquired by using SS tree. Weights are computed again based on k0 approximate nearest neighbors and chi-square distance theory. K nearest neighbors are acquired based on new weights. Little time is spent, but the better accuracy of text categorization is acquired.

关键词

计算机应用 / 中文信息处理 / 文本分类 / 神经网络 / Chi-square距离 / KNN算法

Key words

computer application / Chinese information processing / text categorization / neural network / Chi-square distance / KNN algorithm

引用本文

导出引用
王煜,王正欧,白石. 用于文本分类的改进KNN算法. 中文信息学报. 2007, 21(3): 76-82
WANG Yu, WANG Zheng-ou, BAI Shi. An Improved KNN Algorithm Applied to Text Categorization. Journal of Chinese Information Processing. 2007, 21(3): 76-82

参考文献

[1] 代六玲,黄河燕,陈肇雄.中文文本分类中特征抽取方法的比较研究[J].中文信息学报,2004,18(1): 26-32.
[2] Carlota Domeniconi, Jing Peng, Dimitrios Gunopulos. Locally Adaptive Metric Nearest-Neighbor Classification[J]. IEEE TRASACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIENGCE. 2002, 24(9): 1281-1285.
[3] Jing Peng, Douglas R.Heisterkamp, H.K.Dai. LDA/SVM Driven Nearest Neighbor Classification[J]. IEEE TRASACTIONS ON NEURAL NETWORKS. 2003,14 (4):940-942.
[4] 王晓晔,王正欧. K-最近邻分类技术的改进方法[J].电子信息学报, 2005,27(3): 487-491.
[5] Setiono R, Liu H. Neural network feature selector[J]. IEEE TRANSCATIONS ON NEURAL NETWORKS, 1977,8(3): 654-662.
[6] David A. White, Ramesh Jain. Similarity indexing with the SS-tree[A]. In: Proceedings of the 12th International Conference on Data Engineering[C]. 1996, 516-523.
[7] Weitschereck D,Aha D W, Mohri T. A review and empirical evaluation of feature weighting methods for a class of lazy learning algorithms[J]. AI Review. 1997, 11(2): 273-314.
[8] T.Hastie, R. Tibshirani.Discriminant Adaptive Nearest Neighbor Classification[J]. IEEE TRASACTIONS ON PATTERNANALYSIS and MACHINE INTELLIGENCE. 1996, 18(6): 607-615.
[9] 王煜,王正欧. 基于模糊决策树的文本分类规则抽取[J].计算机应用, 2005年, 25(7):634-1637.
[10] 周水庚,关佶红,胡运发. 隐含语义索引在中文文本处理中的应用研究[J].小型微型计算机系统,2001,22(2): 239-243.

基金

国家自然科学基金资助项目(60573097);国家科技计划资助项目(2004BA721A02);广东省自然科学基金资助项目(05200302、06104916);广东省科技计划资助项目(2005B10101032);高等学校博士学科点专项科研基金资助项目(20050558017);华南理工大学自然科学青年基金项目、学生研究计划资助项目
PDF(312 KB)

824

Accesses

0

Citation

Detail

段落导航
相关文章

/