用于文本分类的改进KNN算法

PDF(312 KB)

中文信息学报 ›› 2007, Vol. 21 ›› Issue (3) : 76-82.

综述

用于文本分类的改进KNN算法

王煜^1,2,王正欧²,白石³

作者信息 +

An Improved KNN Algorithm Applied to Text Categorization

WANG Yu^1,2, WANG Zheng-ou², BAI Shi³

Author information +

History +

摘要

最近邻分类器是假定局部的类条件概率不变,而这个假定在高维特征空间中无效。因此在高维特征空间中使用k最近邻分类器,不对特征权重进行修正就会引起严重的偏差。本文采用灵敏度法,利用前馈神经网络获得初始特征权重并进行二次降维。在初始权重下,根据样本间相似度采用SS树方法将训练样本划分成若干小区域,以此寻找待分类样本的近似k₀个最近邻,并根据近似k₀个最近邻和Chi-square距离原理计算新权重,搜索出新的k个最近邻。此方法在付出较小时间代价的情况下,在文本分离中可获得较好的分类精度的提高。

Abstract

Nearest neighbor classification assumes locally constant class conditional probabilities. The assumption becomes invalid in feature space with high dimension. When KNN classier is used in feature space high dimension, severe bias can be introduced if the weights of features are not amended. In this paper, initial weights of text features are acquired based on sensitivity method firstly, and the second dimension reduce is done. Then training samples are divided into many groups based on sample similarity and the initial weights by using SS tree, k₀ approximate nearest neighbors of unknown sample are acquired by using SS tree. Weights are computed again based on k₀ approximate nearest neighbors and chi-square distance theory. K nearest neighbors are acquired based on new weights. Little time is spent, but the better accuracy of text categorization is acquired.

导出引用

王煜,王正欧,白石. 用于文本分类的改进KNN算法. 中文信息学报. 2007, 21(3): 76-82

WANG Yu, WANG Zheng-ou, BAI Shi. An Improved KNN Algorithm Applied to Text Categorization. Journal of Chinese Information Processing. 2007, 21(3): 76-82

参考文献

[1] 代六玲,黄河燕,陈肇雄.中文文本分类中特征抽取方法的比较研究[J].中文信息学报,2004,18(1): 26-32.
[2] Carlota Domeniconi, Jing Peng, Dimitrios Gunopulos. Locally Adaptive Metric Nearest-Neighbor Classification[J]. IEEE TRASACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIENGCE. 2002, 24(9): 1281-1285.
[3] Jing Peng, Douglas R.Heisterkamp, H.K.Dai. LDA/SVM Driven Nearest Neighbor Classification[J]. IEEE TRASACTIONS ON NEURAL NETWORKS. 2003,14 (4):940-942.
[4] 王晓晔,王正欧. K-最近邻分类技术的改进方法[J].电子信息学报, 2005,27(3): 487-491.
[5] Setiono R, Liu H. Neural network feature selector[J]. IEEE TRANSCATIONS ON NEURAL NETWORKS, 1977,8(3): 654-662.
[6] David A. White, Ramesh Jain. Similarity indexing with the SS-tree[A]. In: Proceedings of the 12th International Conference on Data Engineering[C]. 1996, 516-523.
[7] Weitschereck D,Aha D W, Mohri T. A review and empirical evaluation of feature weighting methods for a class of lazy learning algorithms[J]. AI Review. 1997, 11(2): 273-314.
[8] T.Hastie, R. Tibshirani.Discriminant Adaptive Nearest Neighbor Classification[J]. IEEE TRASACTIONS ON PATTERNANALYSIS and MACHINE INTELLIGENCE. 1996, 18(6): 607-615.
[9] 王煜,王正欧. 基于模糊决策树的文本分类规则抽取[J].计算机应用, 2005年, 25(7):634-1637.
[10] 周水庚,关佶红,胡运发. 隐含语义索引在中文文本处理中的应用研究[J].小型微型计算机系统,2001,22(2): 239-243.

基金

国家自然科学基金资助项目(60573097);国家科技计划资助项目(2004BA721A02);广东省自然科学基金资助项目(05200302、06104916);广东省科技计划资助项目(2005B10101032);高等学校博士学科点专项科研基金资助项目(20050558017);华南理工大学自然科学青年基金项目、学生研究计划资助项目

PDF(312 KB)

991

Accesses

Citation

Detail

段落导航

摘要
Abstract
关键词
Key words
引用本文
参考文献
基金

Received	Published
2006-09-25	2007-06-15
Issue Date
2007-06-15

选择文件类型/文献管理软件名称

选择包含的内容

摘要

Abstract

关键词

Key words

引用本文

{{custom_sec.title}}

{{custom_sec.title}}

参考文献

{{custom_fnGroup.title_cn}}

脚注

基金