本文主要针对近些年来大量出现在聊天语言中和手机短信中的短文本,提出了一种快速有效的聚类算法。这些短文本由于具有不规范性和大量相似性等特点,我们称其为变异短文本。本文在原有的网页去重算法[1~3]的基础上,根据变异短文本的特点,采取了特定的特征串抽取方法,并融合了压缩编码的思想,从而加快了处理速度。实验表明,基于该算法的聚类系统对于大量的变异短文本处理速度可以达到每小时百万级以上,并且有比较高的准确率。
Abstract
This paper discusses mainly about the short texts, which occurs on mobile short messages and chat rooms. Because of their irregular style and similarity, we call them abnormal texts. We propose an efficient clustering algorithm based on the duplication information deletion algorithm. It concerns about the features of the abnormal short texts and takes some special methods such as extracting feature code and compressing code to solve this problem. Experiments show that the clustering system based on this algorithm can depose millions of abnormal short texts per hour with high accuracy.
关键词
人工智能 /
模式识别 /
检索 /
特征串 /
聚类
{{custom_keyword}} /
Key words
artificial intelligence /
pattern recognition /
retrieve /
feature string /
clustering
{{custom_keyword}} /
{{custom_sec.title}}
{{custom_sec.title}}
{{custom_sec.content}}
参考文献
[1] 吴平博,陈群秀,马亮. 基于特征串的大规模中文网页快速去重算法研究[J]. 中文信息学报,2003,17(2): 29-36.
[2] 张刚,刘挺,郑实福,车万祥,李生. 大规模网页快速去重算法[A]. 中国中文信息学学会二十周年学术会论文集(续集)[C]. 2001. 18-25.
[3] J.W.Kirriemuir & P.Willett, Identification of duplicate and near-duplicate full-text records in database search outputs using hierarchic cluster analysis[J]. In: Program-automated library and information,(1995)29(3):241-256.
[4] 孙学刚,陈群秀,马亮. 基于主题的Web文档聚类研究[J]. 中文信息学报,2003,17(3): 21-26.
[5] G. Karypis, E.H. Han, and V. Kumar. Chameleon: A hierarchical clustering algorithm using dynamic modeling [J]. IEEE Computer, 1999,32(8):68-75.
[6] Ricardo Baeza-Yates, Berthier Ribeiro-Neto, Modern Information Retrieval[M]. Addison Wesley, 2004.
[7] 陈儒,张宇,刘挺. 面向中文特定信息变异的过滤技术研究[J]. 高技术通讯,2005,15(19): 7-12.
[8] 王滨华,石志刚.基于散列关键词的大规模网页去重算法[J].高性能计算技术.2004,(5): 38-41.
[9] Thomas H.Cormen, Charles E.Leiserson. Introduction to Algorithms[M]. Second Edition. The MIT Press, 2002.
[10] Larsen, Bjorner,Aone, Chinatsu.: Fast and Effective Text Mining Using Linear-time Document Clustering[J]. In: KDD’99, San Diego, California: 16-22.
[11] Y.Zhao , and G.Karypis, Evaluation of hierarchical clustering algorithms for document datasets[A]. In: Proceedings of the Eleventh International Conference on Information and Knowledge Management [C]. 2002. 515-524.
{{custom_fnGroup.title_cn}}
脚注
{{custom_fn.content}}