Organizing web search results into clusters is helpful for users to browse through search results. Many clustering methods have been widely used for this purpose, but most of them do not work well because the generated cluster labels are not readable and informative enough for users to identify the right cluster quickly. In this paper, we focus on how to generate more readable cluster labels and propose a novel method to address this problem. Based on the ranked list of snippets returned by a web search engine for a given query, we first construct a suffix tree for these snippets. Then we calculate scores for all the phrases in the tree by leveraging their statistic and syntactic information. Finally, we rank the phrases in descending order of their scores, and then select the top k phrases as the final cluster labels. Having the labels, we can form clusters by assigning each snippet to the relevant label. Experimental results show that our method works well for clustering web search results.
LUO Xiong-wu, WAN Xiao-jun, YANG Jian-wu, WU Yu-qian.
Suffix Tree Based Label Generation Method for Web Search Results Clustering. Journal of Chinese Information Processing. 2009, 23(2): 83-88
{{custom_sec.title}}
{{custom_sec.title}}
{{custom_sec.content}}
参考文献
[1] Baidu search engine[CP]. http://www.baidu.com. [2] Carrot clustering engine[CP]. http://demo.carrot2.org/demo-stable/main. [3] Dragon toolkit[CP]. http://www.dragontoolkit.org [4] H. Chim and X. Deng. A new suffix tree similarity measure for document clustering[C]//WWW, 121-129, 2007. [5] Google search engine[CP]. http://www.google.com [6] Vivisimo clustering engine[CP]. http://vivisimo.com [7] X. Wang and C. Zhai. Learn from web search logs to organize search results.[C]//SIGIR, 87-94, 2007. [8] O. Zamir and O. Etzioni. Web document clustering: A feasibility demonstration.[C]//SIGIR, 46-54, 1998. [9] H. Zeng, Q. He, Z. Chen, W. Ma and J. Ma. Learning to cluster web search results.[C]//SIGIR, 210-217, 2004. [10] Levenshtein distance[EB]. http://en.wikipedia.org/wiki/Levenshtein_distance.