一种基于空间映射及尺度变换的聚类框架

PDF(906 KB)

中文信息学报 ›› 2010, Vol. 24 ›› Issue (3) : 81-89.

综述

一种基于空间映射及尺度变换的聚类框架

曾依灵¹,许洪波¹,吴高巍¹,程学旗¹,白硕^1,2

作者信息 +

A Mapping and Rescaling Framework for Document Clustering

ZENG Yiling¹, XU Hongbo¹, WU Gaowei¹, CHENG Xueqi¹, BAI Shuo^1,2

Author information +

History +

摘要

传统聚类算法通常建立在显式的模型之上,很少考虑泛化模型以适应不同的数据,由此导致了模型不匹配问题。针对此问题,该文提出了一种基于空间映射(Mapping)及尺度变换(Rescaling)的聚类框架(简称M-R框架)。具体而言,M-R框架首先将语料映射到一组具有良好区分度的方向所构建的坐标系中,以统计各个簇的分布特性,然后根据这些分布特性对各个坐标轴进行尺度变换,以归一化语料中各个类簇的分布。如上两步操作伴随算法迭代执行,直至算法收敛。该文将M-R框架应用到K-means算法及谱聚类算法上以验证其性能,在国际标准评测语料上的实验表明,应用了M-R框架的K-means及谱聚类在所有语料集上获得了全面的性能提升。

Abstract

Traditional clustering algorithms suffer from model mismatch problem when the distribution of real data does not fit the model assumptions. To address this problem, a mapping and rescaling framework (referred as M-R framework) is proposed for document clustering. Specifically, documents are first mapped into a discriminative coordinate so that the distribution statistics of each cluster could be analyzed on the corresponding dimension. With the statistics obtained, a rescaling operation is then applied to normalize the data distribution based on the model assumptions. These two steps are conducted iteratively along with the clustering algorithm to improve the clustering performance. In the experiment, the M-R framework is applied on traditional k-means and the state-of-art spectral clustering algorithm Ncut. Resultss on well known datasets show that M-R framework brings performance improvements in all datasets.
Key wordscomputer application; Chinese information processing; document clustering; space mapping; rescaling; model misfit

导出引用

曾依灵1,许洪波1,吴高巍1,程学旗1,白硕1,2. 一种基于空间映射及尺度变换的聚类框架. 中文信息学报. 2010, 24(3): 81-89

ZENG Yiling1, XU Hongbo1, WU Gaowei1, CHENG Xueqi1, BAI Shuo1,2. A Mapping and Rescaling Framework for Document Clustering. Journal of Chinese Information Processing. 2010, 24(3): 81-89

参考文献

[1] Dumais S. T. LSI Meets TREC: A Status Report[C]// D. Harman (Ed.) Prof. of The First Text REtrieval Conference (TREC1), National Institute of Standards and Technology Special Publication 500-207, 1993: 137-152.
[2] Liu X., Croft W.B. Cluster-Based Retrieval Using Language Models[C]// Proc. of SIGIR, 2004: 186-193.
[3] Zamir O., Etzioni O., Madani O., et al. Fast and Intuitive Clustering of Web Documents[C]// Proc. of KDD, 1997: 287-290.
[4] Han J. and Kamber M. Data Mining: Concepts and Techniques, Second Edition[M]. Morgan Kaufmann Publishes, 2006.
[5] Wu H., Phang T. H., Liu B., et al. A Refinement Approach to Handling Model Misfit in Text Categorization[C]// SIGKDD, 2002: 207-216.
[6] Tan S., Cheng X., Ghanem MM, et al. A Novel Refinement Approach for Text Categorization[C]// Proc. of the 14th ACM CIKM, 2005: 469-476.
[7] Shawe-Taylor J., Cristianini N. Kernel Methods for Pattern Analysis[M]. Cambridge University Press, 2004.
[8] Ng A., Jordan M., Weiss Y. On Spectral Clustering: Analysis and an Algorithm[J]. T. Dietterich, S. Becker, and Ghahramani Z. (Eds.), Advances in Neural Information Processing Systems 14, MIT Press, 2002.
[9] Shi, J. and Malik, J. Normalized Cuts and Image Segmentation[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2000, 22(8): 888-905.
[10] Chan P. K., Schlag D. F., Zien J. Y. Spectral K-way Ratio-Cut Partitioning and Clustering[J]. IEEE Trans. Computer-Aided Design, 1994, 13:1088-1096.
[11] Ding C., He X., Zha H., et al. A Min-Max Cut Algorithm for Graph Partitioning and Data Clustering[C]// Proc. of ICDM, 2001: 107-114.
[12] Zelnik-Manor L., Perona P. Self-Tuning Spectral Clustering[C]// NIPS 17. 2005.
[13] Duda R. O., Hart P. E., Stork D. G. Pattern Classification, Second Edition[M].Wiley-Interscience Publishes, 2000.
[14] Lewis D. D., Yang Y., Rose T., et al. RCV1: A New Benchmark Collection for Text Categorization Research[J].Journal of Machine Learning Research, 2004.
[15] 周昭涛. 文本聚类分析效果评价及文本表示研究[D]. 北京: 中国科学院计算技术研究所, 2005.

基金

国家973基础研究计划项目资助(2007CB311100);国家自然科学基金重点项目资助(60933005)

PDF(906 KB)

Accesses

Citation

Detail

段落导航

摘要
Abstract
关键词
Key words
引用本文
参考文献
基金

Received	Published
2009-09-08	2010-06-15
Issue Date
2010-06-15

选择文件类型/文献管理软件名称

选择包含的内容

摘要

Abstract

关键词

Key words

引用本文

{{custom_sec.title}}

{{custom_sec.title}}

参考文献

{{custom_fnGroup.title_cn}}

脚注

基金