文档聚类综述

刘远超,王晓龙,徐志明,关毅

PDF(411 KB)
PDF(411 KB)
中文信息学报 ›› 2006, Vol. 20 ›› Issue (3) : 57-64.

文档聚类综述

  • 刘远超,王晓龙,徐志明,关毅
作者信息 +

A Survey of Document Clustering

  • LIU Yuan-chao,WANG Xiao-long,XU Zhi-ming,GUAN Yi
Author information +
History +

摘要

聚类作为一种自动化程度较高的无监督机器学习方法,近年来在信息检索、多文档自动文摘等领域获得了广泛的应用。本文首先讨论了文档聚类的应用背景和体系结构,然后对文档聚类算法、聚类空间的构造和降维方法、文档聚类中的语义问题进行了综述。最后还介绍了聚类质量评测问题。

Abstract

As an unsupervised machine learning method, document clustering has been widely used in many NLP applications such as information retrieval, automatic multi-document summarization and etc. In this paper the background and the architecture of document clustering is discussed firstly, and then some related problems are surveyed which includes clustering algorithm, feature space construction, dimension reduction and the semantic problem. In the end this paper introduces the evaluation of cluster quality.

关键词

计算机应用 / 中文信息处理 / 综述 / 文档聚类 / 降维 / 概念相关 / 聚类算法

Key words

computer application / Chinese information processing / overview / document clustering / dimension reduction / concept relevance / clustering algorithm

引用本文

导出引用
刘远超,王晓龙,徐志明,关毅. 文档聚类综述. 中文信息学报. 2006, 20(3): 57-64
LIU Yuan-chao,WANG Xiao-long,XU Zhi-ming,GUAN Yi. A Survey of Document Clustering. Journal of Chinese Information Processing. 2006, 20(3): 57-64

参考文献

[1] 马帅,王腾蛟,等. 一种基于参考点和密度的快速聚类算法[J]. 软件学报. 2003, 14 (6) : 1089 - 1095.
[2] 孙学刚,陈群秀,马亮. 基于主题的Web文档聚类研究[J]. 中文信息学报. 2003, 17 (3) : 21 - 26.
[3] 吴斌,傅伟鹏,史忠植,等. 一种基于群体智能的web文档聚类算法[J]. 计算机研究与发展, 2002, 39 (11) : 1429 - 1435.
[4] Regina Barzilay, Min-Yen Kan, and Kathleen R. McKeown. Simfinder: A Flexible Clustering Tool for Summarization[A]. In proceedings of the Workshop on Summarization in NAACL‘01 [C]. Pittsburg, Pennsylvania, USA: June 2001.
[5] Zheng Chen, Wei-Ying Ma, Jinwen Ma . Learning to Cluster Web Search Results[A]. In: proceedings of the 27th Annual International ACM SIGIR Conference [C]. Sheffield, South Yorkshire, UK, July 2004, 210 - 217.
[6] 林鸿飞,马雅彬. 基于聚类的文本过滤模型[J]. 大连理工大学学报. 2003, 42 (2).
[7] Y. C. Fang, S. Parthasarathy, F. Schwartz. Using Clustering to Boost Text Classification[J]. In: proceedings of the IEEE ICDM Workshop on Text Mining, Maebashi City, Japan, 2002.
[8] A. Rauber, and M. Frühwirth. Automatically Analyzing and Organizing Music Archives [A]. In: proceedings of the 5. European Conference on Research and Advanced Technology for Digital Libraries (ECDL 2001) [C]. Darmstadt, Germany, 2001.
[9] Cutting, D. , Karger, D. , and etc. Scatter/Gather: A Cluster-based Approach to Browsing Large Document Collections [A]. SIGIR'92, 1992 [C]. 318 - 329 .
[10] JR Wen, JY Nie, HJ Zhang . Clustering User Queries of a Search Engine [A]. The Tenth International World Wide Web Conference[C]. Hong Kong. May 1 - 5, 2001.
[11] Anton Leuski and James Allan. Improving Interactive Retrieval by Combining Ranked Lists and Clustering [A]. In: proceedings of RIAO'2000 [C]. Paris, France, April 12 - 14, 2000, 665 - 681.
[12] Anton V. Leouski and W. Bruce Croft. An Evaluation of Techniques for Clustering Search Results [A]. Technical Report IR - 76, Department of Computer Science, University of Massachusetts, Amherst, 1996.
[13] Htttp://www.cs.washington.edu /research /clustering.
[14] Dell Zhang. Semantic, Hierarchical, Online Clustering of Web Search Results[A]. In: proceedings of the 6th Asia Pacific Web Conference (APWEB) [C]. Hangzhou,. China, April 2004.
[15] P. H. Sneath and R. R. Sokal. Numerical Taxonomy [M]. Freeman, London, UK, 1973.
[16] P. Willett. Recent trends in hierarchic document clustering: a critical review [J]. In: Information Processing and Management, 24 (5) : 577 - 597, 1988.
[17] Yunjae jung. Design and Evaluation of Clustering Criterion for Optimal Hierarchical Agglomerative Clustering [D]. Phd. thesis. University of Minnesota. 2001.
[18] 行小帅,潘进,焦李成. 基于免疫规划的K-means聚类算法[J]. 计算机学报, 2003, 26 (5) : 605 - 610.
[19] 陈浩,何婷婷,姬东鸿. 基于k-means聚类的无导词义消歧[J]. 中文信息学报, 2005, 19 (4) : 10 - 16.
[20] A. Casillas, M. T. Gonzálezde Lena and R. Martínez. Document clustering into an unknown number of clusters using a Genetic Algorithm [A]. International Conference on Text Speech and Dialogue TSD, 2003.
[21] Tao Li . Document clustering via Adaptive Subspace Iteration [A]. In: proceedings of the 12th ACM International Conference on Multimedia[C]. New York, USA, 364 - 367, 2004.
[22] A. Likas, N. Vlassis, and J. J. Verbeek. The global k-means algorithm. Pattern Recognition [J]. Vol. 36, 2003, 451 - 461.
[23] 范金城,梅长林. 数据分析[M]. 科学出版社. 2002年7月第一版.
[24] T. Kohonen. Self-organized formation of topologically correct feature maps[J]. Biological Cybernetics, 43: 59-69, 1982.
[25] Michael Dittenbach, Dieter Merkl, Andreas Rauber. The Growing Hierarchical Self Organizing map [A]. In: proceedings of the Int'l Joint Conference on Neural Networks (IJCNN’2000) [C]. Como, Italy, July 24-27, 2000.
[26] X. Lin, D. Soergel, and G. Marchionini. A self-organizing semantic map for information retrieval [A]. In: proc. ACM SIGIR int'l conf in information retrieval (SIGIR'91) [C]. Chicago, Illinois, 1991.
[27] K. Lagus, T. Honkela, S. Kaski, and T. Kohonen. Self-organizing maps of document collections: A new approach to interactive exploration[A] . In: proc int'l conf knowledge discovery and data mining (KDD'96) [C]. Portland, Oregon, 1996.
[28] A. K. Jain, M. N. Murty, and P. J. Flynn. Data clustering: A review[J]. ACM Computing Surveys, 31 (3) : 264 - 323, 1999.
[29] 史忠植. 知识发现[M]. 清华大学出版社. 2002年1月第一版.
[30] A Hotho, A Maedche, S Staab . Ontology-based Text Clustering [A]. IJCAI - 2001 Workshop.
[31] Andreas Hotho. Wordnet improves Text Document Clustering [A]. In: proc. of the SIGIR 2003 Semantic Web Workshop [C]. Toronto, Canada, 2003.
[32] Wei Xu, Yihong Gong. Document Clustering by Concept Factorization [A]. In proceedings of the 27th ACM SIGIR Conference on Research and Development in Information Retrieval[C]. Sheffield, UK, 2004 .
[33] Mark Sinka and David Corne . A Large Benchmark Dataset for Web Document Clustering[J]. In Soft Computing Systems: Design, Management and Applications, Vol. 87 of Frontiers in Artificial Intelligence and Applications, pages 881 - 890, 2002.
[34] Seung-Shik Kang . Keyword-based Document Clustering [A]. The 6th Internationa-Workshop on Information Retrieval with Asian Languages[C]. IRAL2003, p132-137, July, 2003.
[35] Kristina Lerman. Document Clustering in Reduced Dimension Vector Space [A]. In: proceedings of CSAW’[C]. 04, 2004.
[36] Christian Borgelt and Andreas Nürnberger . Experiments in Document Clustering using Cluster Specific Term Weights [A]. 27th German Conference on Artificial Intelligence[C]. Ulm, Germany, 2004.
[37] Yuanchao liu, xiaolong wang, bingquan liu. A Feature Selection Algorithm For Document Clustering Based On Word Co-occurrence Frequency [A]. In: proceedings of the Third International Conference on Machine Learning and Cybernetics[C]. Shanghai, 26 - 29 August 2004.
[38] Z. Y. Niu, D. H. Ji and C. L. Tan. Document clustering based on cluster validation [A]. 13th Conference on Information and Knowledge Management[C]. CIKM 2004, 8 - 13 Nov 2004, Washington DC, USA.
[39] Stanislaw Osiński. Dimensionality Reduction Techniques for Search Results Clustering[D]. MSc. thesis, University of Sheffield, UK, 2004.
[40] Zhao, Y. , Karypis, G. Criterion Functions for Document Clustering: Experiments and Analysis [A]. Technical Report #01—40, Department of Computer Science, University of Minnesota, Minneapolis, MN, 2001.
[41] Michael Steinbach, George Karypis,Vipin Kumar. A Comparison of Document Clustering Techniques [A]. Department of Computer Science and Engineering, University of Minnesota. Technical Report #00 - 034, 2000.

基金

国家自然科学基金重点资助项目(60435020)
PDF(411 KB)

1416

Accesses

0

Citation

Detail

段落导航
相关文章

/