跨语言文档聚类主要是将跨语言文档按照内容或者话题组织为不同的类簇。该文通过采用跨语言词相似度计算将单语广义向量空间模型(Generalized Vector Space Model, GVSM)拓展到跨语言文档表示中,即跨语言广义空间向量模型(Cross-Lingual Generalized Vector Space Model,CLGVSM),并且比较了不同相似度在文档聚类下的性能。同时提出了适用于GVSM的特征选择算法。实验证明,采用SOCPMI词汇相似度度量算法构造GVSM时,跨语言文档聚类的性能优于LSA。
Abstract
Cross-Lingual Document Clustering is the task to automatically organize a large collection of cross-lingual documents into groups according to their contents or topics. This work extends traditional monolingual Generalized Vector Space Model (GVSM) to Cross-Lingual GVSM (CLGVSM) by using cross-lingual term similarity calculation methods in order to represent documents in different languages and compare different term similarity calculation methods in cross-lingual document clustering. This work also proposes new feature selection method for CLGVSM. Experiment results show that GVSM with Second Order Co-occurrence Point wise Mutual Information (SOCPMI) term similarity measure outperforms the latent semantic analysis (LSA) method.
Key wordsCross-lingual document clustering; CLGVSM; text similarity; document clustering
关键词
跨语言文档聚类 /
跨语言广义向量空间模型 /
文档聚类 /
跨语言信息检索
{{custom_keyword}} /
Key words
Cross-lingual document clustering /
CLGVSM /
text similarity /
document clustering
{{custom_keyword}} /
{{custom_sec.title}}
{{custom_sec.title}}
{{custom_sec.content}}
参考文献
[1] T. Landauer, P. W. Foltz, Darrell Laham. Introduction to Latent Semantic Analysis[J]. Discourse Processes 25: 259-284.
[2] C-P. Wei, C. C. Yang, C-M. Lin. A Latent Semantic Indexing Based Approach to Multilingual Document Clustering [J]. Decision Support System. 45(3):606-620.
[3] T. Leek, H. Jin, S. Sista, et al. The BBN cross-lingual topic detection and tracking system[C]//Proceedings of TDT1999.
[4] H.H. Chen, C.J. Lin. A multilingual news summarizer[C]//Proceedings of COLING2000: 159-165.
[5] D.K. Evans, J.L. Klavans. A Platform for Multilingual News Summarization[R], Technical Report. Department of Computer Science, Columbia University.
[6] B. Mathieu, R. Besancon, C. Fluhr. Multilingual Document Clusters Discovery[C]//Proceedings of RIAO2004: 1-10.
[7] B. Pouliquen, R. Steinberger, C. Ignat, et al. Multilingual and cross-lingual news topic tracking[C]//Proceedings of COLING2004: 959-965.
[8] D. Yogatama, K.Tanaka.. Multilingual Spectral Clustering Using Document Similarity Propagation[C]//Proceedings of EMNLP2009: 871-879.
[9] P. Cimiano, A. Schultz, S. Sizov, et al. Explicit vs. latent concept models for cross-language information retrieval[C]//Proceedings of IJCAI09, 2009.
[10] D. Lin. Automatic retrieval and clustering of similar words[C]//Proceedings of COLING98:768-774.
[11] P. Resnik. Semantic similarity in a taxonomy: An information based measure and its application to problems of ambiguity in natural language[J]. Journal of Artificial Intelligence Research, V.11:95-130.
[12] Q Liu, S Li. Word similarity computing based on How Net[C]//Proceedings of Computational Linguistics and Chinese Language Processing.
[13] Y. Xia, T. Zhao, P. Jin. Measuring Chinese-English Cross-lingual Word Similarity with How Net and Parallel Corpus[C]//Proceedings of CICling2011(II):221-233.
[14] K.W. Church, P. Hanks. Word association norms, mutual information, and lexicography[J]. Computational Linguistics, 16(1):22-29.
[15] P. D. Turney. Mining the Web for Synonyms: PMI-IR versus LSA on TOEF[C]//Proceedings of ECML2001: 491-502.
[16] T. K. Landauer, S. T. Domais. A Solution to Platos Problem: The Latent Semantic Analysis Theory of Acquision, Induction and Representation of Knowledge[J]. Psychological Review. 104(2):211-240.
[17] A. Islam, D. Inkpen. Second order co-occurrence PMI for determining the semantic similarity of words[C]//Proceedings of LREC2006: 1033-1038.
[18] SKM. Wong, W. Ziarko, PCN. Wong. Generalized vector model in information retrieval[C]//Proceedings of the 8th ACM SIGIR:18-25.
[19] A.K. Farahat, M. S. Kamel. Statistical semantic for enhancing document clustering[J]. Knowledge and Information Systems.
[20] E. M. Voorhees. Implementing Agglomerative Hierarchic Clustering Algorithms for Use in Document Retrieval[J]. Information Processing and Management, 22(6): 465-76.
[21] M. Steinbach, G. Kapypis, V. Kumar. A Comparison of Document Clustering Techniques[C]//Proceedings of KDD Workshop on Text Mining, 2000:109-111.
{{custom_fnGroup.title_cn}}
脚注
{{custom_fn.content}}
基金
科技部资助项目(2009DFA12970)
{{custom_fund}}