基于词义类簇的文本聚类

PDF(1472 KB)

中文信息学报 ›› 2013, Vol. 27 ›› Issue (3) : 113-120.

综述

基于词义类簇的文本聚类

唐国瑜¹,夏云庆¹,张民²,郑方¹

作者信息 +

Document Clustering Based on Word Sense ClusterT

ANG Guoyu¹, XIA Yunqing¹, ZHANG Min², ZHENG Fang¹

Author information +

History +

摘要

文档表示是文本聚类的重要组成部分,该文旨在通过改进文档表示改进文本聚类。同义词和多义词现象是文档表示所面临的重要挑战。为此该文提出了词义类簇模型(Sense Cluster Model,SCM),在词义类簇空间上表示文档。SCM首先构造词义类簇空间,然后将文档表示在词义类簇空间上,获得每篇文档在每个词义类簇的概率。在词义类簇空间构造这一步骤中,首先利用词义归纳技术从文本中自动发现词义,接着采用词义聚类技术识别相同或者相似的词义从而获得词义类簇。词义类簇空间构造后,该文首先进行词义消歧,然后利用词义消歧的结果将文档表示在词义空间上。实验表明,SCM在标准测试集上的性能优于基线系统以及经典话题模型LDA。

Abstract

Document representation is the key part in document clustering. In this paper, we aim at improving document representation in document clustering. Synonymy and polysemy are two challenging issues in document representation. Inspired by the observation that synonymy and polysemy are mainly related to word sense, we present a novel model, referred to as Sense Cluster Model (SCM), to address both issues by representing documents with word sense clusters. In SCM, word sense clusters are first constructed from the development dataset by 1) the word sense induction to automatically discover different senses of each word from raw text; and 2) the word sense clusteringto recognize identical or similar words. Then the probability distribution over word sense clusters is generated to represent every document after word sense disambiguation. The experiments conducted on benchmarking data show that the SCM model outperforms both baseline and the classic topic model, LDA, in the task of document clustering.
Key wordsword sense; document representation; topic model

导出引用

唐国瑜1,夏云庆1,张民2,郑方1. 基于词义类簇的文本聚类. 中文信息学报. 2013, 27(3): 113-120

ANG Guoyu1 , XIA Yunqing1 , ZHANG Min2, ZHENG Fang1. Document Clustering Based on Word Sense ClusterT. Journal of Chinese Information Processing. 2013, 27(3): 113-120

参考文献

[1] G Salton, A Wong, C S Yang. A Vector Space Model for Automatic Indexing[J]. Communications of the ACM, 1975, 18(11): 613-620.
[2] A Hotho, S Staab, G Stumme. WordNet improves text document clustering[C]//Proc.of SIGIR2003 semantic web workshop.ACM, New York, 2003: 541-544.
[3] P Cimiano, A Schultz, S Sizov, et al. Explicit vs. latent concept models for cross-language information retrieval[C]//Proc. of IJCAI09.
[4] D M Blei, A Y Ng, M I Jordan. Latent dirichlet allocation[J]. J. Machine Learning Research,2003(3): 993-1022.
[5] T K Landauer, S T Domais. A Solution to Platos Problem: The Latent Semantic Analysis Theory of Acquisition, Induction and Representation of Knowledge[J]. Psychological Review,1997,104(2): 211-240.
[6] Yue Lu,Qiaozhu Mei,Chengxiang Zhai, Investigating task performance of probabilistic topic models: an empirical study of PLSA and LDA[J]. Information Retrieval, 2011,14(2), 178-203.
[7] S Brody, M Lapata. Bayesian word sense induction[C]//Proc. of EACL2009: 103-111.
[8] J Pessiot, Y Kim, M Amini, et al. Improving document clustering in a learned concet space[J]. Information Processing and Management, 2010,46: 180-192.
[9] S Dhillon. Co-clustering documents and words using bipartite spectral graph partitioning[C]//Proc. SIGKDD2001: 269-274.
[10] S K M Wong, W Ziarko, P C N Wong. Generalized vector model in information retrieval[C]//Proc. of the 8th ACM SIGIR,1985: 18-25.
[11] A K Farahat, M S Kamel. Statistical semantic for enhancing document clustering[J]. Knowledge and Information Systems,2010.
[12] H Huang, Y Kuo. Cross-Lingual Document Representation and Semantic Similarity Measure: A Fuzzy Set and Rough Set Based Approach. Fuzzy Systems[J]. IEEE Transactions,2010,18(6): 1098-1111.
[13] R Navigli. Word sense disambiguation: a survey[J]. ACM Comput. Surv. 2009,41(2), Article 10 (February 2009): 69.
[14] C Stokoe, M P Oakes, J Tait. Word sense disambiguation in information retrieval revisited[C]//Proceedings of SIGIR 2003: 159-166.
[15] M Denkowski, A Survey of Techniques for Unsupervised Word Sense Induction[J]. Technical Report. Language Technologies Institute, Carnegie Mellon University.
[16] E Agirre, A Soroa. Semeval-2007 task02: evaluating word sense induction and discrimination systems[C]. SemEval 2007.
[17] H Schutze, J Pedersen. Information Retrieval based on word senses[C]//Proc. of SDAIR95: 161-175.
[18] R Navigli, G Crisafulli. Inducing word senses to improve web search result clustering[C]//Proc. of EMNLP 10: 116-126.
[19] S Dhillon, D S Modha. Concept decompositions for large sparse text data using clustering[J].Mach. Learn., 2001,42(1-2): 143-175.
[20] Y Zhao, G Karypis, U Fayyad. Hierarchical clustering algorithms for document datasets[J]. Data Mining and Knowledge Discovery, 2005,10(2): 141-168.
[21] C Ordonez, E Omiecinski. Frem: fast and robust em clustering for large data sets[C]//CIKM 02, ACM Press. New York, NY, USA, 2002:590-599.
[22] M Steinbach, G Karypis, V Kumar. A comparison of document clustering techniques[C]//KDD Workshop on Text Mining,2000.
[23] Junbo Kong, David Graff. TDT4 multilingual broadcast news speech corpus[J].2005.
[24] G Tang, Y Xia, M Zhang, et al. 2011 CLGVSM: Adapting Generalized Vector Space Model to Cross-lingual Document Clustering[C]//Proc. of IJCNLP2010: 580-588.
[25] E M Voorhees. Implementing agglomerative hierarchic clustering algorithms for use in document retrieval[J]. Information Processing and Management. v.22(6): 465-476. 1986.

基金

国家自然科学基金资助项目(61272233)

PDF(1472 KB)

660

Accesses

Citation

Detail

段落导航

摘要
Abstract
关键词
Key words
引用本文
参考文献
基金

Received	Published
2012-12-14	2013-06-15
Issue Date
2013-06-15

选择文件类型/文献管理软件名称

选择包含的内容

摘要

Abstract

关键词

Key words

引用本文

{{custom_sec.title}}

{{custom_sec.title}}

参考文献

{{custom_fnGroup.title_cn}}

脚注

基金