基于LDA主题模型的分布式信息检索集合选择方法

何旭峰;陈 岭;陈根才;钱 坤;吴 勇;王敬昌

PDF(2868 KB)
PDF(2868 KB)
中文信息学报 ›› 2017, Vol. 31 ›› Issue (3) : 125-133.
信息检索与问答系统

基于LDA主题模型的分布式信息检索集合选择方法

  • 何旭峰1;陈 岭1;陈根才1;钱 坤1;吴 勇2;王敬昌2
作者信息 +

A LDA Topic Model Based Collection Selection Method for Distributed Information Retrieval

  • HE Xufeng1; CHEN Ling1; CHEN Gencai1;QIAN Kun1;WU Yong2; WANG Jingchang2
Author information +
History +

摘要

该文针对分布式信息检索时不同集合对最终检索结果贡献度有差异的现象,提出一种基于LDA主题模型的集合选择方法。该方法首先使用基于查询的采样方法获取各集合描述信息;其次,通过建立LDA主题模型计算查询与文档的主题相关度;再次,用基于关键词相关度与主题相关度相结合的方法估计查询与样本集中文档的综合相关度,进而估计查询与各集合的相关度;最后,选择相关度最高的M个集合进行检索。实验部分采用RmP@nMAP作为评价指标,对集合选择方法的性能进行了验证。实验结果表明该方法能更准确的定位到包含相关文档多的集合,提高了检索结果的召回率和准确率。

Abstract

Considering that different collections have different contributions to the final search results, a LDA topic model based collection selection method is proposed for distributed information retrieval. Firstly, the method acquires information about the representation of each collection by query-based sampling. Secondly, a method using the LDA topic model is proposed to estimate the relevance between the query and a document. Thirdly, a method based on both term and topic is proposed to estimate the relevance between the query and the sample documents, by which the relevance between the query and collections can be estimated. Finally, M collections with the highest relevance are selected for retrieving. Experiment results demonstrates that the proposed method can improve the accuracy and recall of search results.

关键词

集合选择 / 分布式信息检索 / LDA

Key words

collection selection / distributed information retrieval / LDA

引用本文

导出引用
何旭峰;陈 岭;陈根才;钱 坤;吴 勇;王敬昌. 基于LDA主题模型的分布式信息检索集合选择方法. 中文信息学报. 2017, 31(3): 125-133
HE Xufeng; CHEN Ling; CHEN Gencai;QIAN Kun;WU Yong; WANG Jingchang. A LDA Topic Model Based Collection Selection Method for Distributed Information Retrieval. Journal of Chinese Information Processing. 2017, 31(3): 125-133

参考文献

[1] Callan J. Distributed information retrieval. Croft W B. Advances in information retrieval[M]. USA: Kluwer Academic Publishes, 2000: 127-150.
[2] Callan J P, Lu Z, Croft W B. Searching distributed collections with inference network[C]// Proceedings of the 18th International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, 1995: 21-28.
[3] Xu J, Croft W B. Cluster-based language models for distributed retrieval[C]// Proceedings of the 22nd International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, 1999: 254-261.
[4] Si L, Jin R, Allan J. et al. A language modeling framework for resource selection and results merging[C]// Proceeding of ACM Conference on Information and Knowledge Management. McLean, Virginia, USA, 2002: 391-397.
[5] Yuwono B, Lee D L. Server ranking for distributed text retrieval systems on the Internet[C]// Proceedings of International Conference on Database Systems for Advanced Applications. 1997, 97: 41-49.
[6] Si L, Callan J. Relevant document distribution estimation method for resource selection[C]// Proceedings of the 26th International ACM SIGIR Conference on Research and Development in Informaion Retrieval. ACM, 2003: 298-305.
[7] Shokouhi M. Central-rank-based collection selection in uncooperative distributed information retrieval[M]. Advances in Information Retrieval. Springer Berlin Heidelberg, 2007: 160-172.
[8] Thomas P, Shokouhi M. SUSHI: Scoring scaled samples for server selection[C]// Proceedings of the 32nd International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, 2009: 419-426.
[9] Kulkarni A, Tigelaar A S, Hiemstra D, et al. Shard ranking and cutoff estimation for topically partitioned collections[C]// Proceedings of the 21st ACM International Conference on Information and Knowledge Management. ACM, 2012: 555-564.
[10] Cetintas S, Si L, Yuan H. Learning from past queries for resource selection[C]// Proceedings of the 18th ACM Conference on Information and Knowledge Management. ACM, 2009: 1867-1870.
[11] 刘颖, 陈岭, 陈根才. 基于历史点击数据的集合选择方法[J]. 浙江大学学报 (工学版), 2013, 47(1): 23-28.
[12] Wauer M, Schuster D, Schill A. Integrating explicit semantic analysis for ontology-based resource selection[C]// Proceedings of the 13th International Conference on Information Integration and Web-based Applications and Services. ACM, 2011: 519-522.
[13] Blei D M, Ng A Y, Jordan M I. Latent dirichlet allocation[J]. the Journal of Machine Learning Research, 2003, 3: 993-1022.
[14] 张俊林, 孙乐, 孙玉芳. 基于主题语言模型的中文信息检索系统研究[J]. 中文信息学报, 2005, 19(3): 14-20.
[15] 刘振鹿, 王大玲, 冯时,等. 一种基于LDA的潜在语义区划分及Web文档聚类算法[J]. 中文信息学报, 2011, 25(1): 60-65.
[16] Wei X, Croft W B. LDA-based document models for ad-hoc retrieval[C]// Proceedings of the 29th International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, 2006: 178-185.
[17] Callan J, Connell M. Query-based sampling of text databases[J]. ACM Transactions on Information Systems, 2001, 19(2): 97-130.
[18] Porteous I, Newman D, Ihler A, et al. Fast collapsed gibbs sampling for latent dirichlet allocation[C]// Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 2008: 569-577.

基金

“核高基”国家科技重大专项(2010ZX01042-002-003);国家自然科学基金(60703040,61332017);浙江省重大科技专项(2011C13042,2013C01046);中国工程科技知识中心(CKCEST-2014-1-5)
PDF(2868 KB)

641

Accesses

0

Citation

Detail

段落导航
相关文章

/