面向变异短文本的快速聚类算法

PDF(268 KB)

中文信息学报 ›› 2007, Vol. 21 ›› Issue (2) : 63-68.

综述

面向变异短文本的快速聚类算法

黄永光,刘挺,车万翔,胡晓光

作者信息 +

A Fast Clustering Algorithm for Abnormal and Short Texts

HUANG Yong-guang, LIU Ting, CHE Wan-xiang, HU Xiao-guang

Author information +

History +

摘要

本文主要针对近些年来大量出现在聊天语言中和手机短信中的短文本,提出了一种快速有效的聚类算法。这些短文本由于具有不规范性和大量相似性等特点,我们称其为变异短文本。本文在原有的网页去重算法^[1~3]的基础上,根据变异短文本的特点,采取了特定的特征串抽取方法,并融合了压缩编码的思想,从而加快了处理速度。实验表明,基于该算法的聚类系统对于大量的变异短文本处理速度可以达到每小时百万级以上,并且有比较高的准确率。

Abstract

This paper discusses mainly about the short texts, which occurs on mobile short messages and chat rooms. Because of their irregular style and similarity, we call them abnormal texts. We propose an efficient clustering algorithm based on the duplication information deletion algorithm. It concerns about the features of the abnormal short texts and takes some special methods such as extracting feature code and compressing code to solve this problem. Experiments show that the clustering system based on this algorithm can depose millions of abnormal short texts per hour with high accuracy.

导出引用

黄永光,刘挺,车万翔,胡晓光. 面向变异短文本的快速聚类算法. 中文信息学报. 2007, 21(2): 63-68

HUANG Yong-guang, LIU Ting, CHE Wan-xiang, HU Xiao-guang. A Fast Clustering Algorithm for Abnormal and Short Texts. Journal of Chinese Information Processing. 2007, 21(2): 63-68

参考文献

[1] 吴平博,陈群秀,马亮. 基于特征串的大规模中文网页快速去重算法研究[J]. 中文信息学报,2003,17(2): 29-36.
[2] 张刚,刘挺,郑实福,车万祥,李生. 大规模网页快速去重算法[A]. 中国中文信息学学会二十周年学术会论文集(续集)[C]. 2001. 18-25.
[3] J.W.Kirriemuir & P.Willett, Identification of duplicate and near-duplicate full-text records in database search outputs using hierarchic cluster analysis[J]. In: Program-automated library and information,(1995)29(3):241-256.
[4] 孙学刚,陈群秀,马亮. 基于主题的Web文档聚类研究[J]. 中文信息学报,2003,17(3): 21-26.
[5] G. Karypis, E.H. Han, and V. Kumar. Chameleon: A hierarchical clustering algorithm using dynamic modeling [J]. IEEE Computer, 1999,32(8):68-75.
[6] Ricardo Baeza-Yates, Berthier Ribeiro-Neto, Modern Information Retrieval[M]. Addison Wesley, 2004.
[7] 陈儒,张宇,刘挺. 面向中文特定信息变异的过滤技术研究[J]. 高技术通讯,2005,15(19): 7-12.
[8] 王滨华,石志刚.基于散列关键词的大规模网页去重算法[J].高性能计算技术.2004,(5): 38-41.
[9] Thomas H.Cormen, Charles E.Leiserson. Introduction to Algorithms[M]. Second Edition. The MIT Press, 2002.
[10] Larsen, Bjorner,Aone, Chinatsu.: Fast and Effective Text Mining Using Linear-time Document Clustering[J]. In: KDD’99, San Diego, California: 16-22.
[11] Y.Zhao , and G.Karypis, Evaluation of hierarchical clustering algorithms for document datasets[A]. In: Proceedings of the Eleventh International Conference on Information and Knowledge Management [C]. 2002. 515-524.

PDF(268 KB)

718

Accesses

Citation

Detail

段落导航

摘要
Abstract
关键词
Key words
引用本文
参考文献

Received	Published
2006-03-03	2007-04-16
Issue Date
2007-04-16

选择文件类型/文献管理软件名称

选择包含的内容

摘要

Abstract

关键词

Key words

引用本文

{{custom_sec.title}}

{{custom_sec.title}}

参考文献

{{custom_fnGroup.title_cn}}

脚注