搜索引擎查询日志中的session (以下简称session)是指某特定用户为得到某个信息需求而在一段时间内的搜索行为的连续序列。Session的正确划分是进行用户搜索行为分析等一系列工作的重要基础,目前尚没有关于session的系统研究工作。本文针对相关研究工作的问题重新统一定义了session的概念并进行探索和比较研究,得出结论(1)统计语言模型因数据稀疏问题不适合做session划分;(2)利用多种属性的决策树方法可以得到比较理想的结果,以session为单位进行评价,F值达到了78.6%。
Abstract
The session in query logs of web search denotes a sequential series of queries from a user when he is searching for certain information during a period of time. Correct session segmentation is a fundamental work for various researches such as searching activities analysis. Due to the unsystematic research on session at present, this paper redefines the conception of session and does several comparative studies. We conclude that (1) the statistical language model is not suitable for session segmentation because of the heavy data sparseness and (2) the decision tree method using multiple attributes can obtain very promising results. Evaluated at the session level, the decision tree based method achieves a F-measure up to 78.6%.
关键词
计算机应用 /
中文信息处理 /
网络信息检索 /
查询日志 /
session划分
{{custom_keyword}} /
Key words
computer application /
Chinese information processing /
web information retrieval /
search logs /
session segmentation
{{custom_keyword}} /
{{custom_sec.title}}
{{custom_sec.title}}
{{custom_sec.content}}
参考文献
[1] Bin Tan, Fuchun Peng. Unsupervised query segmentation using generative language models and Wikipedia[C]//Proceeding of the 17th international conference on World Wide Web. Beijing, China, 2008: 347-356.
[2] Craig Silverstein, Monika Henzinger, Hannes Marais, et al. Analysis of a very large Web search engine query log[J]. In SIGIR Forum , fall 1998, 33(1): 6-12.
[3] Daqing He, Ays, e Gker. Detecting session boundaries from Web user logs[C]//Proceedings of the 22nd annual colloquium on information, 2000.
[4] H. Cenk Ozmutlu , Fatih cavdur, Application of automatic topic identification on excite web search engine data logs.[J]Information Processing and Management: an International Journal, 2005, 41(5): 1243-1262.
[5] Jing Bai, Jian-Yun Nie, Guihong Cao, Hugues Bouchard. Using query contexts in information retrieval[J]. SIGIR'07, July 23-27, 2007.
[6] Jinhui Yuan, Huiyi Wang, Lan Xiao, Wujie Zheng, Jianmin Li, Fuzong Lin, and Bo Zhang. A Formal Study of Shot Boundary Detection. [C]//IEEE transactions on circuits and systems for video technology, VOL. 17, NO. 2, pp. 168-186. February 2007.
[7] Qingsong Yao, Xiangji Huang and Aijun An. Applying Language Modeling to Session Identification from Database Trace Logs[C]//Knowledge and Information Systems, 2006-Springer.
[8] S Ozmutlu, F Cavdur. Neural network applications for automatic new topic identification[J]. Online Information Review,2005, 29(1): 34-53.
[9] Seda Ozmutlu, H. Cenk Ozmutlu, Amanda Spink. Automatic New Topic Identification in Search Engine Transaction Logs using Multiple Linear Regression[C]//Proceedings of the 41st Hawaii International Conference on System Sciences. 2008: 140.
[10] Seda Ozmutlu, Huseyin C. Ozmutlu, Buket Buyuk. Using Monte-Carlo Simulation for Automatic New Topic Identification of Search Engine Transaction Logs[C]//Proceedings of the 2007 Winter Simulation Conference. 2007: 2306-2314.
[11] Smitha Sriram, Xuehua Shen, Chengxiang Zhai. A Session-Based Search Engine[C]//SIGIR'04, July 25-29, 2004.
[12] Xiangji Huang, Fuchun Peng, Aijun An, Dale Schuurmans. Dynamic Web Log Session Identification with Statistical Language Models[J]. Journal of the American Society for Information Science and Technology, 55(14): 1290-1303.
[13] Yanan Li, Sen Zhang, Bin Wang, Jintao Li. Characteristics of Chinese Web Searching: A Large-Scale Analysis of Chinese Query Logs[C]//Proceedings of the sixth Symposium of Search Engine and Web Mining(SEWM2008), April 11-13, Nanchang, China. 2008.
[14] 余慧佳,刘奕群,张敏,茹立云,马少平. 基于大规模日志分析的网络搜索引擎用户行为研究[J]. 中文信息学报,Vol. 2007, 21(1): 109-114.
[15] ICTCLAS分词软件主页[EB/OL]. http://ictclas.org/.
[16] 百度公司网站. 中文搜索风云榜[EB/OL]. http://top.baidu.com/.
{{custom_fnGroup.title_cn}}
脚注
{{custom_fn.content}}
基金
国家自然科学基金资助项目(60603094);北京市自然科学基金资助项目(4082030);国家863资助项目(2006AA010105)
{{custom_fund}}