Top-k查询是搜索引擎领域广泛应用的技术之一,该算法从海量数据中返回最符合用户需求的前k 个结果,在执行时能避免对大部分无关文档的打分处理。Top-k 查询虽然极大提升了查询性能,但其存在的慢启动问题并未得到有效解决。为此,该文首先提取倒排索引的静态Top-k信息,再动态计算针对具体查询词项的初始阈值,在此基础上,结合MaxScore和WAND算法,提出了快速启动的Top-k查询处理算法。实验结果表明,该方法能够有效解决上述问题,具有良好的性能。
Abstract
Top-k query is a popular technique of search engines, which returns the most relative results for user from massive data. Although Top-k query significantly improves the performance of the system, its slow-start issue has not been effectively resolved. This paper extracts static Top-k information of inverted index and then calculats initial threshold in real time for specific query. On this basis, this paper presents a rapid start algorithm of Top-k query for the current state-of-art methods MaxScore and WAND. Experimental results show that the proposed approach achieves better performance.
关键词
Top-k查询处理 /
阈值计算 /
倒排索引
{{custom_keyword}} /
Key words
Top-k query /
threshold-calculation /
inverted index
{{custom_keyword}} /
{{custom_sec.title}}
{{custom_sec.title}}
{{custom_sec.content}}
参考文献
[1] Turtle H, Flood J. Query evaluation:strategies andoptimizations[J]. Information Processing & Management, 1995, 31(6):831-850.
[2] Broder A Z, Carmel D, Herscovici M, et al. Efficient query evaluation using a two-level retrieval process[C]// Proceedings of the 12th International Conference on Information and Knowledge Management. ACM, 2003:426-434.
[3] Moffat A, Zobel J. Self-indexing inverted files for fast text retrieval[J]. ACM Transactions on Information Systems (TOIS), 1996, 14(4):349-379.
[4] Jonassen S, Bratsberg S E. Efficient compressed inverted index skipping for disjunctive text-queries[M]. Advances in Information Retrieval. Springer Berlin Heidelberg, 2011:530-542.
[5] Yan H, Ding S, Suel T. Inverted index compression and query processing with optimized document ordering[C]//Proceedings of the 18th International Conference on World Wide Web. ACM, 2009:401-410.
[6] Ricardo B Y, Berthier R N. Modern information retrieval:the concepts and technology behind search (2nd Edition)[M]. Addision Wesley, 2011, 84:2.
[7] Strohman T, Turtle H, Croft W B. Optimization strategies for complex queries[C]//Proceedings of the 28th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, 2005:219-225.
[8] Rossi C, de Moura E S, Carvalho A L, et al. Fast document-at-a-time query processing using two-tier indexes[C]//Proceedings of the 36th International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, 2013:183-192.
[9] Ding S, Suel T. Faster Top-k document retrieval using block-max indexes[C]//Proceedings of the 34th International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, 2011:993-1002.
[10] de Carvalho L L S, de Moura E S, Daoud C M, et al. Heuristics to improve the BMW method and its variants[J]. Journal of Information and Data Management, 2016, 6(3):178.
[11] Jonassen S, Bratsberg S E. Efficient compressed inverted index skipping for disjunctive text-queries[M]. Advances in Information Retrieval. Springer Berlin Heidelberg, 2011:530-542.
{{custom_fnGroup.title_cn}}
脚注
{{custom_fn.content}}
基金
湖南省自然科学基金(2016JJ2007)
{{custom_fund}}