文档表示是文本聚类的重要组成部分,该文旨在通过改进文档表示改进文本聚类。同义词和多义词现象是文档表示所面临的重要挑战。为此该文提出了词义类簇模型(Sense Cluster Model,SCM),在词义类簇空间上表示文档。SCM首先构造词义类簇空间,然后将文档表示在词义类簇空间上,获得每篇文档在每个词义类簇的概率。在词义类簇空间构造这一步骤中,首先利用词义归纳技术从文本中自动发现词义,接着采用词义聚类技术识别相同或者相似的词义从而获得词义类簇。词义类簇空间构造后,该文首先进行词义消歧,然后利用词义消歧的结果将文档表示在词义空间上。实验表明,SCM在标准测试集上的性能优于基线系统以及经典话题模型LDA。
Abstract
Document representation is the key part in document clustering. In this paper, we aim at improving document representation in document clustering. Synonymy and polysemy are two challenging issues in document representation. Inspired by the observation that synonymy and polysemy are mainly related to word sense, we present a novel model, referred to as Sense Cluster Model (SCM), to address both issues by representing documents with word sense clusters. In SCM, word sense clusters are first constructed from the development dataset by 1) the word sense induction to automatically discover different senses of each word from raw text; and 2) the word sense clusteringto recognize identical or similar words. Then the probability distribution over word sense clusters is generated to represent every document after word sense disambiguation. The experiments conducted on benchmarking data show that the SCM model outperforms both baseline and the classic topic model, LDA, in the task of document clustering.
Key wordsword sense; document representation; topic model
关键词
文档聚类 /
文档表示 /
话题模型
{{custom_keyword}} /
Key words
word sense /
document representation /
topic model
/
/
/
/
/
/
/
/
{{custom_keyword}} /
{{custom_sec.title}}
{{custom_sec.title}}
{{custom_sec.content}}
参考文献
[1] G Salton, A Wong, C S Yang. A Vector Space Model for Automatic Indexing[J]. Communications of the ACM, 1975, 18(11): 613-620.
[2] A Hotho, S Staab, G Stumme. WordNet improves text document clustering[C]//Proc.of SIGIR2003 semantic web workshop.ACM, New York, 2003: 541-544.
[3] P Cimiano, A Schultz, S Sizov, et al. Explicit vs. latent concept models for cross-language information retrieval[C]//Proc. of IJCAI09.
[4] D M Blei, A Y Ng, M I Jordan. Latent dirichlet allocation[J]. J. Machine Learning Research,2003(3): 993-1022.
[5] T K Landauer, S T Domais. A Solution to Platos Problem: The Latent Semantic Analysis Theory of Acquisition, Induction and Representation of Knowledge[J]. Psychological Review,1997,104(2): 211-240.
[6] Yue Lu,Qiaozhu Mei,Chengxiang Zhai, Investigating task performance of probabilistic topic models: an empirical study of PLSA and LDA[J]. Information Retrieval, 2011,14(2), 178-203.
[7] S Brody, M Lapata. Bayesian word sense induction[C]//Proc. of EACL2009: 103-111.
[8] J Pessiot, Y Kim, M Amini, et al. Improving document clustering in a learned concet space[J]. Information Processing and Management, 2010,46: 180-192.
[9] S Dhillon. Co-clustering documents and words using bipartite spectral graph partitioning[C]//Proc. SIGKDD2001: 269-274.
[10] S K M Wong, W Ziarko, P C N Wong. Generalized vector model in information retrieval[C]//Proc. of the 8th ACM SIGIR,1985: 18-25.
[11] A K Farahat, M S Kamel. Statistical semantic for enhancing document clustering[J]. Knowledge and Information Systems,2010.
[12] H Huang, Y Kuo. Cross-Lingual Document Representation and Semantic Similarity Measure: A Fuzzy Set and Rough Set Based Approach. Fuzzy Systems[J]. IEEE Transactions,2010,18(6): 1098-1111.
[13] R Navigli. Word sense disambiguation: a survey[J]. ACM Comput. Surv. 2009,41(2), Article 10 (February 2009): 69.
[14] C Stokoe, M P Oakes, J Tait. Word sense disambiguation in information retrieval revisited[C]//Proceedings of SIGIR 2003: 159-166.
[15] M Denkowski, A Survey of Techniques for Unsupervised Word Sense Induction[J]. Technical Report. Language Technologies Institute, Carnegie Mellon University.
[16] E Agirre, A Soroa. Semeval-2007 task02: evaluating word sense induction and discrimination systems[C]. SemEval 2007.
[17] H Schutze, J Pedersen. Information Retrieval based on word senses[C]//Proc. of SDAIR95: 161-175.
[18] R Navigli, G Crisafulli. Inducing word senses to improve web search result clustering[C]//Proc. of EMNLP 10: 116-126.
[19] S Dhillon, D S Modha. Concept decompositions for large sparse text data using clustering[J].Mach. Learn., 2001,42(1-2): 143-175.
[20] Y Zhao, G Karypis, U Fayyad. Hierarchical clustering algorithms for document datasets[J]. Data Mining and Knowledge Discovery, 2005,10(2): 141-168.
[21] C Ordonez, E Omiecinski. Frem: fast and robust em clustering for large data sets[C]//CIKM 02, ACM Press. New York, NY, USA, 2002:590-599.
[22] M Steinbach, G Karypis, V Kumar. A comparison of document clustering techniques[C]//KDD Workshop on Text Mining,2000.
[23] Junbo Kong, David Graff. TDT4 multilingual broadcast news speech corpus[J].2005.
[24] G Tang, Y Xia, M Zhang, et al. 2011 CLGVSM: Adapting Generalized Vector Space Model to Cross-lingual Document Clustering[C]//Proc. of IJCNLP2010: 580-588.
[25] E M Voorhees. Implementing agglomerative hierarchic clustering algorithms for use in document retrieval[J]. Information Processing and Management. v.22(6): 465-476. 1986.
{{custom_fnGroup.title_cn}}
脚注
{{custom_fn.content}}
基金
国家自然科学基金资助项目(61272233)
{{custom_fund}}