一种融合聚类和时间信息的微博排序新方法

卫冰洁,史 亮,王 斌

PDF(2896 KB)
PDF(2896 KB)
中文信息学报 ›› 2015, Vol. 29 ›› Issue (3) : 177-183.
信息检索与问答系统

一种融合聚类和时间信息的微博排序新方法

  • 卫冰洁1,3,史 亮3,王 斌2
作者信息 +

Combining Cluster and Temporal Information for Microblog Search

  • WEI BingJie1,3, SHI Liang3, WANG Bin2
Author information +
History +

摘要

随着微博的快速发展,微博检索已经成为近年来研究领域的热点之一。微博检索与传统文本检索在两个方面明显不同: 一是微博具有自己的特点,表现在文本短和内容中具有主题概括词(称为Hashtag);二是微博排序中除了考虑文本和语义相似度,还需考虑时间信息。根据这两点区别,该文在统计语言模型的基础上,使用聚类进行文本扩展,并将Hashtag信息运用到聚类过程中。同时,因为微博数据集中具有Hashtag的微博个数不超过13%,针对这一现象,该文还提出了一种扩展微博Hashtag的方法,最终提出了基于聚类的三个模型。然后通过定义文档先验将时间信息加入到提出的三个检索模型中,得到融入聚类和时间信息的三个模型。最后基于TREC Microblog数据的实验结果证明,融合聚类信息和时间信息的模型在MAP和P@30上有明显提高,分别提高7.1%和11.6%。

Abstract

With the rapid development of microblog, microblog retrieval has become a hot research topic in recent years. In contrast to traditional text retrieval, microblog search significantly differs in two aspects. One is that microblog has its own text features, i.e. short text and Hashtag as the theme term. The other is that microblog search should consider the time information and text and semantic similarity. This paper addresses the above issue by clustering to expand text content. The hashtag is introduced into the clustering, and, to guarantee its effect, a method to enrich the Hashtag in a microblog is described. Finally we used the time information as the documents prior and altogether three models are examined in the experments. Experiments on TREC Microblog dataset show that our models significantly improved MAP and P@30 with 7.1% and 11.6% increase separately.

关键词

微博检索 / Hashtag / 聚类 / 时间 / 语言模型

Key words

microblog search / Hashtag / cluster / temporal / language model

引用本文

导出引用
卫冰洁,史 亮,王 斌. 一种融合聚类和时间信息的微博排序新方法. 中文信息学报. 2015, 29(3): 177-183
WEI BingJie, SHI Liang, WANG Bin. Combining Cluster and Temporal Information for Microblog Search. Journal of Chinese Information Processing. 2015, 29(3): 177-183

参考文献

[1] Liu X, W B Croft. Cluster-based retrieval using language models[C]//Proceedings of the 27th annual international ACM SIGIR conference on research and development in information retrieval, ACM: Sheffield, United Kingdom,2004: 186-193.
[2] Efron M. Hashtag retrieval in a microblogging environment[C]//Proceedings of the 33rd international ACM SIGIR conference on research and development in information retrieval, ACM: Geneva, Switzerland,2010: 787-788.
[3] Rui Li B W, Kai Lu, Bin Wang. Author Model and Negative Feedback Methods on TREC 2011 Microblog Track[C]//Proceedings of the Text Retrieval Conference (TREC),2011.
[4] Donald Metzler C C. USC/ISI at TREC 2011: Microblog Track[C]//Proceedings of the Text Retrieval Conference (TREC),2011.
[5] Feng Liang R Q, Jianwu Yang. PKU_ICST at TREC 2011 Microblog Track[C]//Proceedings of the Text Retrieval Conference (TREC),2011.
[6] Teevan J, D Ramage. M R Morris. TwitterSearch: a comparison of microblog search and web search[C]//Proceedings of the fourth ACM international conference on Web search and data mining, ACM: Hong Kong, China. 2011: 35-44.
[7] Li X, W B Croft. Time-based language models[C]//Proceedings of the twelfth international conference on Information and knowledge management, ACM: New Orleans, LA, USA,2003: 469-475.
[8] Efron M, G Golovchinsky. Estimation methods for ranking recent information[C]//Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval, ACM: Beijing, China,2011: 495-504.
[9] Song F, W B Croft. A general language model for information retrieval[C]//Proceedings of the eighth international conference on Information and knowledge management, ACM: Kansas City, Missouri, United States,1999: 316-321.
[10] Zhai C, J Lafferty. Model-based feedback in the language modeling approach to information retrieval[C]//Proceedings of the tenth international conference on Information and knowledge management, ACM: Atlanta, Georgia, USA,2001: 403-410.
[11] Ponte J M, W B Croft. A language modeling approach to information retrieval[C]//Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval, ACM: Melbourne, Australia,1998: 275-281.
[12] 卫冰洁,王斌. 面向微博搜索的时间感知的混合语言模型[C]. 全国信息检索学术会议(CCIR),2012.
[13] Berkhin P, A survey of clustering data mining techniques[C]//Proceedings of the Grouping Multidimensional Data: Recent Advances in Clustering. 2006: 25-71.
[14] Ramage D, et al., Clustering the tagged web[C]//Proceedings of the Second ACM International Conference on Web Search and Data Mining, ACM: Barcelona, Spain,2009: 54-63.
[15] 王斌. 信息检索导论[M],北京: 人民邮电出版社,2010.
[16] 李锐,王斌. 一种基于作者建模的微博检索模型[J]. 中文信息学报, 2014,28(2): 132-143.
PDF(2896 KB)

492

Accesses

0

Citation

Detail

段落导航
相关文章

/