现有的文本聚类方法难以正确识别和描述文本的主题,从而难以实现按照主题对文本进行聚类。本文提出了一种新的基于主题的文本聚类方法: LFIC。该方法能够准确识别文本主题并根据文本的主题对其进行聚类。本方法定义和抽取了“主题元素”,并利用其进行基本类索引。同时还整合利用了语言学特征。实验表明,LFIC的聚类准确率达到94.66%,优于几种传统聚类方法。
Abstract
Few of the existing document clustering methods can detect or describe document topics properly, which makes it difficult to conduct clustering based on topics. In this paper, we introduce a novel topical document clustering method called Linguistic Features Indexing Clustering (LFIC), which can identify topics accurately and cluster documents according to these topics. In LFIC, “topic elements” are defined and extracted for indexing base clusters. Additionally, linguistic features are exploited. Experimental results show that LFIC can gain a higher precision (94.66%) than some widely used traditional clustering methods.
关键词
人工智能 /
模式识别 /
基于主题文本聚类 /
基本类索引 /
语言学特征
{{custom_keyword}} /
Key words
artificial intelligence /
pattern recognition /
topical document clustering /
base clusters indexing /
linguistic features
{{custom_keyword}} /
{{custom_sec.title}}
{{custom_sec.title}}
{{custom_sec.content}}
参考文献
[1] Hatzivassiloglou V, Gravano L and Maganti A. An Investigation of Linguistic Features and Clustering Algorithms for Topical Document Clustering [A]. In: Proceedings of the 23rd ACM SIGIR Conference, Athens [C]. 2000. 224-231.
[2] Zamir O and Etzioni O. Web Document Clustering: A Feasibility Demonstration [A]. In: Proceedings of the 21st International ACM SIGIR Conference on Research and Development in Information Retrieval [C]. 1998. 46-54.
[3] Gusfield D. Algorithms on Strings, Trees and Sequences: Computer Science and Computational Biology [M]. Cambridge, UK: Cambridge University Press, 1997.
[4] Lee D-L, Chuang H and Seamons K. Document Ranking and the Vector-Space Model [J]. IEEE Software, 1997, 14 (2): 67-75.
[5] Kummamuru K, Lotlikar R, Roy S, et al. A Hierarchical Monothetic Document Clustering Algorithm for Summarization and Browsing Search Results [A]. In: Proceedings of the 13th International Conference on World Wide Web [C]. 2004. 658-665.
{{custom_fnGroup.title_cn}}
脚注
{{custom_fn.content}}
基金
国家自然科学基金资助项目(60575042,60503072,60675034);腾讯基金资助项目
{{custom_fund}}