基于翻译模型的查询会话检测方法研究

张振中,孙 乐,韩先培

PDF(2235 KB)
PDF(2235 KB)
中文信息学报 ›› 2015, Vol. 29 ›› Issue (4) : 95-102.
信息抽取与文本挖掘

基于翻译模型的查询会话检测方法研究

  • 张振中,孙 乐,韩先培
作者信息 +

A Translation Model Based Method for Query Session Detection

  • ZHANG Zhenzhong, SUN Le, HAN Xianpei
Author information +
History +

摘要

查询会话检测的目的是确定用户为了满足某个特定需求而连续提交的相关查询。查询会话检测对于查询日志分析以及用户行为分析来说是非常有用的。传统的查询会话检测方法大都基于查询词的比较,无法解决词语不匹配问题(vocabulary-mismatch problem)——有些主题相关的查询之间并没有相同的词语。为了解决词语不匹配问题,我们在该文提出了一种基于翻译模型的查询会话检测方法,该方法将词与词之间的关系刻画为词与词之间的翻译概率,这样即使词与词之间没有相同的词语,我们也可以捕捉到它们之间的语义关系。同时,我们也提出了两种从查询日志中估计词翻译概率的方法,第一种方法基于查询的时间间隔,第二种方法基于查询的点击URLs。实验结果证明了该方法的有效性。

Abstract

Query session detection is critical for query log analysis and user behavior characterization. It aims at identifying the consecutive queries submitted by a user for the same information need. Traditional query session detection methods are based on lexical comparisons, which often suffer from the vocabulary-mismatch problem(i.e, the topically related queries may not share any common words). To resolve the issue, this paper proposes a translation model based method for query session detection, which can model the relationship between words as word translation probability. In this way our method can capture the relatedness between queries even they do not share any common words. Furthermore, we also propose two approaches for generating training data from web query log for translation probability estimation. The first approach is based on time gap between queries and the second is based on the clicked URLs of queries. Experimental results show that our method can significantly outperform the baselines.

关键词

查询会话检测 / 词语不匹配问题 / 查询日志

Key words

query session detection / vocabulary-mismatch problem / query log

引用本文

导出引用
张振中,孙 乐,韩先培. 基于翻译模型的查询会话检测方法研究. 中文信息学报. 2015, 29(4): 95-102
ZHANG Zhenzhong, SUN Le, HAN Xianpei. A Translation Model Based Method for Query Session Detection. Journal of Chinese Information Processing. 2015, 29(4): 95-102

参考文献

[1]Rosie Jones, Kristina L.Klinkner. Beyond the Session Timeout: Automatic Hierarchical Segmentation of Search Topics in Query Logs [C]// Proceedings of CIKM2008, 2008: 699-708.
[2] 余慧佳,刘奕群,张敏等.基于大规模日志分析的网络搜索引擎用户行为研究 [J]. 中文信息学报,2007, 21(1): 109-114.
[3] Bernard J. Jansen, Amanda Spink, Chris Blakely, et al. Defining a Session on Web Search Engines [J], Journal of the American Society for Information Science and Technology, 2007, 58(6):862-871.
[4] Paolo Boldi, Francesco Bonchi, Carlos Castillo, et al. The query-flow graph: model and applications [C] // Proceedings of CIKM2008, 2008: 609-618.
[5] Doug Downey, Susan Dumais, Eric Horvitz. Models of searching and browsing: languages, studies, and application [C] // Proceedings of IJCAI, 2007: 2740-2747.
[6] Daniel Gayo-Avello. A survey on session detection methods in query logs and a proposal for future evaluation [J]. Information Sciences, 2009, 179(12):1822-1843.
[7] Matthias Hagen, Benno Stein, Tino Rüb. Query session detection as a cascade [C] // Proceedings of CIKM2011, 2011: 147-152.
[8] Daqing He, Ayse Gker. Detecting session boundaries from Web user logs [C] // Proceedings of the 22nd Annual Colloquium on Information Retrieval Research, 2000: 57-66.
[9] Daqing He, Ayse Gker, David J. Harper. Combining evidence for automatic Web session identification [J], Information Processing and Management, 2002, 38(5):727-742.
[10] 张磊,李亚男,王斌等. 网页搜索引擎查询日志的session划分研究 [J]. 中文信息学报, 2009, 23( 2): 54-61.
[11] Nikolai Buzikashvili, Bernard J. Jansen. Limits of the Web log analysis artifacts [C]//Proceedings of the Workshop on Logging Traces of Web Activity, WWW, 2006.
[12] Filip Radlinski, Thorsten Joachims. Query chains: learning to rank from implicit feedback [C]// Proceedings of KDD, 2005: 239-248.
[13] Tessa Lau, Eric Horvitz. Patterns of search: analyzing and modeling Web query refinement [C]// Proceedings of the Seventh International Conference on User Modeling, 1999: 119-128.
[14] Amanda Spink, Bernard J. Jansen, H. C. zmutlu. Use of query reformulation and relevance feedback by excite users [J], Internet Research: Electronic Networking Applications and Policy, 2000, 10(4): 317-328.
[15] Girill T R. Online access AIDS for documentation: a bibliographic outline [J]. ACM SIGIR Forum, 1985, 18(2-4):24-27.
[16] Xuehua Shen, Bin Tan, Chengxiang. Zhai. Implicit user modeling for personalized search [C]// Proceedings of CIKM, 2005: 824-831.
[17] Claudio Lucchese, Salvatore Orlando, Raffaele Perego, et al. Identifying task-based sessions in search engine query logs [C]// Proceedings of WSDM, 2011: 277-286.
[18] Craig Silverstein, hannes Marais, Monika Henzinger, et al. Analysis of a very large web search engine query log [J]. In SIGIR Forum, 1999, 33(1):6-12.
[19] Greg Pass, Abdur Chowdhury, Cayley Torgeson. A picture of search [C]// Proceedings of Infoscale, 2006: 1.
[20] Lin Li, Zhenglu Yang, Ling Liu, et al. Query-URL bipartite based approach to personalized query recommendation [C]// Proceedings of AAAI, 2008: 1189-1194.
[21] Evgeniy Gabrilovich, Shaul Markovitch. Computing semantic relatedness using Wikipedia-based explicit semantic analysis [C]//Proceedings of IJCAI, 2007: 1606-1611.

基金

国家自然科学基金(61433015,61272324),国家高技术研究发展计划项目(2015AA015405)
PDF(2235 KB)

564

Accesses

0

Citation

Detail

段落导航
相关文章

/