该文提出基于高斯加权距离以及聚类重构机制的K-NN文本聚类算法。文章提出K-NN近邻域的概念,通过高斯加权的近邻域算法实施K-NN聚类。利用高斯函数根据样本与聚类中心的距离为样本赋权,计算聚类距离。基于近邻域权重和聚类密度对形成的聚类实施重构,实现聚类数目的自适应调整。使用拆分算子拆分稀疏聚类并调整异常样本;使用合并算子合并相似聚类。实验显示聚类重构机制能够有效地提高聚类的准确率及召回率,增加聚类密度,使得形成的聚类结果更加合理。
Abstract
This paper presents a K-NN text clustering algorithm employing uses Gauss Weighed Distance and Cluster Reorganization Mechanism. The concept of Nearest Domain is proposed and Nearest Domain Rules are elaborated. Then Gauss Weighing Algorithm is designed to Quantification samples’ distance and weights. A text is weighed based on the distance from cluster center via Gauss function in order that distances of clusters can be calculated. Further, Cluster Reorganization Mechanism will make a self-adaption to the amount of clusters. Splitting operator separates sparse clusters and adjusts abnormal texts while consolidating operator combines similar ones. Clustering experiment shows that reorganization process effectively improves the accuracy and recall rate and makes result more reasonable by increasing the inner density of clusters.
关键词
文本聚类 /
K-NN算法 /
高斯加权 /
近邻域规则 /
聚类重构
{{custom_keyword}} /
Key words
text clustering /
K-NN /
Gauss weighing /
nearest domain rule /
cluster reorganization
{{custom_keyword}} /
{{custom_sec.title}}
{{custom_sec.title}}
{{custom_sec.content}}
参考文献
[1] Hyeong-Il Kim and Jae-Woo Chang. K-Nearest Neighbor Query Processing Algorithms for a Query Region in Road Networks[J]. Journal of Computer Science & Technology, 2013, 28(4): 585-596.
[2] 刘金岭,冯万利,张亚红.初始化簇类中心和重构标度函数的文本聚类[J].计算机应用研究,2011,28(11): 4115-4117.
[3] 王灿田,孙玉宝,刘青山.基于稀疏重构的超图谱聚类方法[J].计算机科学,2014,41(2): 145-148,156.
[4] 曾依灵,许洪波,吴高巍,等.一种基于空间映射及尺度变换的聚类框架[J].中文信息学报,2010,24(3): 81-88.
[5] Amineh Amini, Teh Ying Wah, Mahmoud Reza Saybani, et al. A Study of Density-Grid based Clustering Algorithms on Data Streams[C]//Proceedings of the FSKD 2011. Shanghai China. 2011: 1652-1656.
[6] 陈建超,胡桂武,杨志华,等.基于全局性确定聚类中心的文本聚类[J].计算机工程与应用,2011,47(10): 147-150.
[7] 季铎,王智超,蔡东风,等.基于全局性确定聚类中心的文本聚类[J].中文信息学报,2008,22(3): 50-55.
[8] 王骏,王士同,邓赵红. 特征加权距离与软子空间学习相结合的文本聚类新方法[J].计算机学报,2012,35(8): 1655-1665.
[9] M Shahriar Hossain, Praveen Kumar Reddy Ojili, Cindy Grimm, etal. Scatter/Gather Clustering: Flexibly Incorporating User Feedback to Steer Clustering Results[J]. IEEE Transactions on Visualization and Computer Graphics, 2012, 18(12): 2829-2838.
[10] NishaM N, Mohanavalli S, Swathika R. Improving the quality of Clustering using Cluster Ensembles[C]//Proceedings of 2013 IEEE Conference on Information and Communication Technologies. 2013: 88-92.
{{custom_fnGroup.title_cn}}
脚注
{{custom_fn.content}}
基金
国家自然科学基金(61363028)
{{custom_fund}}