Abstract:This paper presents a K-NN text clustering algorithm employing uses Gauss Weighed Distance and Cluster Reorganization Mechanism. The concept of Nearest Domain is proposed and Nearest Domain Rules are elaborated. Then Gauss Weighing Algorithm is designed to Quantification samples’ distance and weights. A text is weighed based on the distance from cluster center via Gauss function in order that distances of clusters can be calculated. Further, Cluster Reorganization Mechanism will make a self-adaption to the amount of clusters. Splitting operator separates sparse clusters and adjusts abnormal texts while consolidating operator combines similar ones. Clustering experiment shows that reorganization process effectively improves the accuracy and recall rate and makes result more reasonable by increasing the inner density of clusters.
[1] Hyeong-Il Kim and Jae-Woo Chang. K-Nearest Neighbor Query Processing Algorithms for a Query Region in Road Networks[J]. Journal of Computer Science & Technology, 2013, 28(4): 585-596. [2] 刘金岭,冯万利,张亚红.初始化簇类中心和重构标度函数的文本聚类[J].计算机应用研究,2011,28(11): 4115-4117. [3] 王灿田,孙玉宝,刘青山.基于稀疏重构的超图谱聚类方法[J].计算机科学,2014,41(2): 145-148,156. [4] 曾依灵,许洪波,吴高巍,等.一种基于空间映射及尺度变换的聚类框架[J].中文信息学报,2010,24(3): 81-88. [5] Amineh Amini, Teh Ying Wah, Mahmoud Reza Saybani, et al. A Study of Density-Grid based Clustering Algorithms on Data Streams[C]//Proceedings of the FSKD 2011. Shanghai China. 2011: 1652-1656. [6] 陈建超,胡桂武,杨志华,等.基于全局性确定聚类中心的文本聚类[J].计算机工程与应用,2011,47(10): 147-150. [7] 季铎,王智超,蔡东风,等.基于全局性确定聚类中心的文本聚类[J].中文信息学报,2008,22(3): 50-55. [8] 王骏,王士同,邓赵红. 特征加权距离与软子空间学习相结合的文本聚类新方法[J].计算机学报,2012,35(8): 1655-1665. [9] M Shahriar Hossain, Praveen Kumar Reddy Ojili, Cindy Grimm, etal. Scatter/Gather Clustering: Flexibly Incorporating User Feedback to Steer Clustering Results[J]. IEEE Transactions on Visualization and Computer Graphics, 2012, 18(12): 2829-2838. [10] NishaM N, Mohanavalli S, Swathika R. Improving the quality of Clustering using Cluster Ensembles[C]//Proceedings of 2013 IEEE Conference on Information and Communication Technologies. 2013: 88-92.