基于语义和图的文本聚类算法研究

蒋 旦,周文乐,朱 明

PDF(2318 KB)
PDF(2318 KB)
中文信息学报 ›› 2016, Vol. 30 ›› Issue (5) : 121-128.
综述

基于语义和图的文本聚类算法研究

  • 蒋 旦,周文乐,朱 明
作者信息 +

Research on Text Clustering Based on Semantics and Graph

  • JIANG Dan, ZHOU Wenle, ZHU Ming
Author information +
History +

摘要

传统的文本聚类往往采用词包模型构建文本向量,忽略了词语间丰富的语义信息。而基于中心划分的聚类算法,容易将概念相关的自然簇强制分开,不能很好地发现人们感兴趣的话题。该文针对传统文本聚类算法的缺点,提出一种基于语义和完全子图的短文本聚类算法,通过对目前主流的三大语义模型进行了实验和对比,选择了一种较为先进的语义模型,基于该语义模型进行了聚类实验,发现新算法能较好地挖掘句子的语义信息且较传统的K-means有更高的聚类纯度。

Abstract

Traditional methods for text clustering have generally taken the BOW (bag-of-words) model to construct the vector of document, ignoring semantic information between words. And partitioning clustering method based on centroid tends to split concept closely related clusters stiffly, not suitable for mining interesting topics. To address these issues, , this paper proposes a text clustering method based on semantics and cliques. Compared with three popular semantic models, experiments reveal that our method performs better than K-means on semantic clustering task.
Keywords: text clustering method;complete sub-graph;semantic similarity;distributed representations of words in a vector space 收稿日期: 2015-04-07 定稿日期: 2015-06-02 基金项目: 海量网络数据流海云协同实时处理系统(子课题)(XDA06011203);电视商务综合体新业态运营支撑系统开发(2012BAH73F01)

关键词

文本聚类 / 完全子图 / 语义相似度 / 词向量

引用本文

导出引用
蒋 旦,周文乐,朱 明. 基于语义和图的文本聚类算法研究. 中文信息学报. 2016, 30(5): 121-128
JIANG Dan, ZHOU Wenle, ZHU Ming. Research on Text Clustering Based on Semantics and Graph. Journal of Chinese Information Processing. 2016, 30(5): 121-128

参考文献

[1] Ping L. 词汇相似度研究进展综述[J]. 现代图书情报技术, 28(7/8): 82-89.
[2] 刘群, 李素建. 基于《知网》 的词汇语义相似度计算[J]. 中文计算语言学, 2002, 7(2): 59-76.
[3] Deerwester S C, Dumais S T, Landauer T K, et al. Indexing by latent semantic analysis[J]. JAsIs, 1990, 41(6): 391-407.
[4] Gabrilovich E, Markovitch S. Computing Semantic Relatedness Using Wikipedia-based Explicit Semantic Analysis[C]//Proceedings of IJCAI.2007, 7: 1606-1611.
[5] Gabrilovich E, Markovitch S. Wikipedia-based semantic interpretation for natural language processing[J]. Journal of Artificial Intelligence Research, 2009: 443-498.
[6] Bengio Y, Ducharme R, Vincent P, et al. A neural probabilistic language model[J]. The Journal of Machine Learning Research, 2003, 3: 1137-1155.
[7] Mikolov T, Chen K, Corrado G, et al. Efficient estimation of word representations in vector space[J]. arXiv.2013: 1301,3781.
[8] Mikolov T, Sutskever I, Chen K, et al. Distributed representations of words and phrases and their compositionality[C]//Proceedings of the Advances in Neural Information Processing Systems.2013: 3111-3119.
[9] Collobert R, Weston J, Bottou L, et al. Natural language processing (almost) from scratch[J]. The Journal of Machine Learning Research, 2011, 12: 2493-2537.
[10] Pennington J,Socher R, Manning C D. Glove: Global vectors for word representation[J]. Proceedings of the Empiricial Methods in Natural Language Processing (EMNLP 2014), 2014: 12.
[11] 来斯惟, 徐立恒, 陈玉博, 等. 基于表示学习的中文分词算法探索[J]. 中文信息学报, 2013, 27(5): 8-14.
[12] Von Luxburg U. A tutorial on spectral clustering[J]. Statistics and computing, 2007, 17(4): 395-416.
[13] Karypis G, Han E H, Kumar V. Chameleon: Hierarchical clustering using dynamic modeling[J]. Computer, 1999, 32(8): 68-75.
[14] Bron C, Kerbosch J. Algorithm 457: finding all cliques of an undirected graph[J]. Communications of the ACM, 1973, 16(9): 575-577.
[15] Tan P N, Steinbach M, Kumar V. Introduction to data mining[M]. Boston: Pearson Addison Wesley, 2006.
[16] Karp R M. Reducibility among combinatorial problems[M]. springer US, 1972.
[17] Le Q V, Mikolov T. Distributed representations of sentences and documents[J]. arXiv.2014: 1405,4053.
[18] Mnih A, Hinton G. Three new graphical models for statistical language modelling[C]//Proceedings of the 24th international conference on Machine learning. ACM, 2007: 641-648.

基金

海量网络数据流海云协同实时处理系统(子课题)(XDA06011203);电视商务综合体新业态运营支撑系统开发(2012BAH73F01)
PDF(2318 KB)

711

Accesses

0

Citation

Detail

段落导航
相关文章

/