王明文,洪 欢,江爱文,左家莉. 基于词重要性的信息检索图模型[J]. 中文信息学报, 2016, 30(4): 134-141.
WANG Mingwen, HONG Huan, JIANG Aiwen, ZUO Jiali. An Information Retrieval Graph Model Based on Term Importance. , 2016, 30(4): 134-141.
基于词重要性的信息检索图模型
王明文,洪 欢,江爱文,左家莉
江西师范大学 计算机信息工程学院,江西 南昌 330022
An Information Retrieval Graph Model Based on Term Importance
WANG Mingwen, HONG Huan, JIANG Aiwen, ZUO Jiali
School of Computer Information Engineering, Jiangxi Normal University, Nanchang, Jiangxi 330022, China
Abstract:In information retrieval modeling, to determine the importance of index terms of the documents is an important content. Those retrieval models which use a bag-of-word document representation are mostly based on the term independence assumption, and calculate the terms importance by the functions of TF and IDF, without considering about the relationship between terms. In this paper, we used a document representation based on graph-of-word to capture the dependencies between terms, and proposed a novel information graph retrieval model TI-IDF. According to the graph, we obtained the co-occurrence matrix and the probability transfer matrix of terms, then we determined the terms importance (TI) by using the Markov chain computing method, and used TI to replace traditional term frequency at indexing time. This model possesses a better robustness, we compared our model with traditional retrieval models on the international public datasets. Experimental results show that, the proposed model is consistently superior to BM25 and better than its extension models, TW-IDF and other models in most cases.
[1] Singhal A, Choi J, Hindle D, et al. At&t at TREC-7[J]. NIST SPECIAL PUBLICATION SP, 1999: 239-252. [2] Robertson S E, Walker S, Jones S, et al. Okapi at TREC-3[J]. NIST SPECIAL PUBLICATION SP, 1995: 109-109. [3] Zhai C, Lafferty J. A study of smoothing methods for language models applied to ad hoc information retrieval[C]//Proceedings of the 24th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, 2001: 334-342. [4] Van Rijsbergen C J. A theoretical basis for the use of co-occurrence data in information retrieval[J]. Journal of Documentation, 1977, 33(2): 106-119. [5] Fang H, Tao T, Zhai C X. A formal study of information retrieval heuristics[C]//Proceedings of the 27th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, 2004: 49-56. [6] Lv Y, Zhai C X. Lower-bounding term frequency normalization[C]//Proceedings of the 20th ACM International Conference on Information and Knowledge Management. ACM, 2011: 7-16. [7] Page L, Brin S, Motwani R, et al. The PageRank citation ranking: Bringing order to the web[J]. Stanford Inforlab,1999:1-14. [8] Blanco R, Lioma C. Random walk term weighting for information retrieval[C]//Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, 2007: 829-830. [9] Blanco R, Lioma C. Graph-based term weighting for information retrieval[J]. Information Retrieval, 2012, 15(1): 54-92. [10] Rousseau F, Vazirgiannis M. Graph-of-word and TW-IDF: new approach to ad hoc IR[C]//Proceedings of the 22nd ACM International Conference on Conference on Information & Knowledge Management. ACM, 2013: 59-68. [11] 甘丽新, 涂伟, 王明文, 等. 基于混合相关的 Markov 网络信息检索扩展模型[J]. 中文信息学报, 2013, 27(4): 83-88. [12] 洪欢, 王明文, 万剑怡, 等. 基于迭代方法的多层 Markov 网络信息检索模型[J]. 中文信息学报, 2013, 27(5): 122-128. [13] Singhal A, Buckley C, Mitra M. Pivoted document length normalization[C]//Proceedings of the 19th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, 1996: 21-29. [14] Lv Y, Zhai C X. When documents are very long, BM25 fails![C]//Proceedings of the 34th International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, 2011: 1103-1104. [15] Lv Y, Zhai C X. Adaptive term frequency normalization for bm25[C]//Proceedings of the 20th ACM International Conference on Information and Knowledge Management. ACM, 2011: 1985-1988. [16] Rousseau F, Vazirgiannis M. Composition of TF normalizations: new insights on scoring functions for ad hoc IR[C]//Proceedings of the 36th International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, 2013: 917-920. [17] Rhoades B E. The convergence of matrix transforms for certain Markov chains[J]. Stochastic Processes and their Applications, 1979, 9(1): 85-93.