传统聚类算法通常建立在显式的模型之上,很少考虑泛化模型以适应不同的数据,由此导致了模型不匹配问题。针对此问题,该文提出了一种基于空间映射(Mapping)及尺度变换(Rescaling)的聚类框架(简称M-R框架)。具体而言,M-R框架首先将语料映射到一组具有良好区分度的方向所构建的坐标系中,以统计各个簇的分布特性,然后根据这些分布特性对各个坐标轴进行尺度变换,以归一化语料中各个类簇的分布。如上两步操作伴随算法迭代执行,直至算法收敛。该文将M-R框架应用到K-means算法及谱聚类算法上以验证其性能,在国际标准评测语料上的实验表明,应用了M-R框架的K-means及谱聚类在所有语料集上获得了全面的性能提升。
Abstract
Traditional clustering algorithms suffer from model mismatch problem when the distribution of real data does not fit the model assumptions. To address this problem, a mapping and rescaling framework (referred as M-R framework) is proposed for document clustering. Specifically, documents are first mapped into a discriminative coordinate so that the distribution statistics of each cluster could be analyzed on the corresponding dimension. With the statistics obtained, a rescaling operation is then applied to normalize the data distribution based on the model assumptions. These two steps are conducted iteratively along with the clustering algorithm to improve the clustering performance. In the experiment, the M-R framework is applied on traditional k-means and the state-of-art spectral clustering algorithm Ncut. Resultss on well known datasets show that M-R framework brings performance improvements in all datasets.
Key wordscomputer application; Chinese information processing; document clustering; space mapping; rescaling; model misfit
关键词
计算机应用 /
中文信息处理 /
文本聚类 /
空间映射 /
尺度变换 /
模型不匹配
{{custom_keyword}} /
Key words
computer application /
Chinese information processing /
document clustering /
space mapping /
rescaling /
model misfit
/
/
/
/
/
/
/
/
{{custom_keyword}} /
{{custom_sec.title}}
{{custom_sec.title}}
{{custom_sec.content}}
参考文献
[1] Dumais S. T. LSI Meets TREC: A Status Report[C]// D. Harman (Ed.) Prof. of The First Text REtrieval Conference (TREC1), National Institute of Standards and Technology Special Publication 500-207, 1993: 137-152.
[2] Liu X., Croft W.B. Cluster-Based Retrieval Using Language Models[C]// Proc. of SIGIR, 2004: 186-193.
[3] Zamir O., Etzioni O., Madani O., et al. Fast and Intuitive Clustering of Web Documents[C]// Proc. of KDD, 1997: 287-290.
[4] Han J. and Kamber M. Data Mining: Concepts and Techniques, Second Edition[M]. Morgan Kaufmann Publishes, 2006.
[5] Wu H., Phang T. H., Liu B., et al. A Refinement Approach to Handling Model Misfit in Text Categorization[C]// SIGKDD, 2002: 207-216.
[6] Tan S., Cheng X., Ghanem MM, et al. A Novel Refinement Approach for Text Categorization[C]// Proc. of the 14th ACM CIKM, 2005: 469-476.
[7] Shawe-Taylor J., Cristianini N. Kernel Methods for Pattern Analysis[M]. Cambridge University Press, 2004.
[8] Ng A., Jordan M., Weiss Y. On Spectral Clustering: Analysis and an Algorithm[J]. T. Dietterich, S. Becker, and Ghahramani Z. (Eds.), Advances in Neural Information Processing Systems 14, MIT Press, 2002.
[9] Shi, J. and Malik, J. Normalized Cuts and Image Segmentation[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2000, 22(8): 888-905.
[10] Chan P. K., Schlag D. F., Zien J. Y. Spectral K-way Ratio-Cut Partitioning and Clustering[J]. IEEE Trans. Computer-Aided Design, 1994, 13:1088-1096.
[11] Ding C., He X., Zha H., et al. A Min-Max Cut Algorithm for Graph Partitioning and Data Clustering[C]// Proc. of ICDM, 2001: 107-114.
[12] Zelnik-Manor L., Perona P. Self-Tuning Spectral Clustering[C]// NIPS 17. 2005.
[13] Duda R. O., Hart P. E., Stork D. G. Pattern Classification, Second Edition[M].Wiley-Interscience Publishes, 2000.
[14] Lewis D. D., Yang Y., Rose T., et al. RCV1: A New Benchmark Collection for Text Categorization Research[J].Journal of Machine Learning Research, 2004.
[15] 周昭涛. 文本聚类分析效果评价及文本表示研究[D]. 北京: 中国科学院计算技术研究所, 2005.
{{custom_fnGroup.title_cn}}
脚注
{{custom_fn.content}}
基金
国家973基础研究计划项目资助(2007CB311100);国家自然科学基金重点项目资助(60933005)
{{custom_fund}}