大量的基于链接的搜索引擎作弊方法对传统PageRank算法造成了巨大的影响,例如,链接农场、交换链接、黄金链、财富链等使得网页的PageRank值失去了公正性和权威性。该文在分析多种作弊方法对传统PageRank算法所造成的不利影响的基础上,提出了一种可以抵抗链接作弊的三阶段PageRank算法-TSPageRank算法,该文对TSPageRank算法的原理进行了详细分析,并通过实验证明TSPageRank算法比传统的PageRank算法在效果上提高了59.4%,能够有效地提升重要网页的PageRank值,并降低作弊网页的PageRank值。
Abstract
A large number of link-based spams caused a huge impact on traditional PageRank algorithm, such as link farm, link exchange, golden links and so on. This paper proposes a new PageRank algorithm named Three Stages PageRank algorithm(TSPageRank) which can resist link spam to a certain extent. Through experiments, we found out that TSPageRank algorithm increased 59.4% on the result of PageRank. TSPageRank can increase the PR of useful and authority pages and decrease the PR of spam and rubbish pages.
Key wordssearch engine spam; PageRank algorithm; link farm
关键词
搜索引擎作弊 /
PageRank算法 /
链接农场
{{custom_keyword}} /
Key words
search engine spam /
PageRank algorithm /
link farm
{{custom_keyword}} /
{{custom_sec.title}}
{{custom_sec.title}}
{{custom_sec.content}}
参考文献
[1] 第28次中国互联网络发展状况统计报告[R]. 中国互联网络信息中心,2011年7月.
[2] S. Brin, L. Page. The anatomy of a large-scale hypertextual Web search engine[J].Computer Networks and ISDN Systems, 1998, 30: 107-117.
[3] B. Wu.Finding and Fighting Search Engine Spam[D].PhD thesis, Department of Computer Science and Engineering, Lehigh University, 2007.
[4] Baoning Wu.Finding and Fighting Search Engine Spare[D].Lehigh Univ.2007.
[5] Gyngyi Z., Garcia-Molina H.Web spam taxonomy[C]//Proceedings of First International Workshop on Adversarial Information Retrieval on the Web, 2005: 39-47.
[6] Zoltan Gyongyi, Pavel Berkhin, Hector Garcia-Molina, et al. Link Spam Detection Based on Mass Estimation[C]//Proceedings of Technical Report. 2006.
[7] Baoning Wu, Brian D. Davison. Identifying link farm spam pages[C]//Proceedings of Special Interest Tracks and Posters of the 14th International Conference on World Wide Web, Japan, Chiba, May, 2005, 10-14.
[8] B. Wu. Finding and Fighting Search Engine Spam[D].PhD thesis, Department of Computer Science and Engineering, Lehigh University, 2007.
[9] Y. Wang, Z. Qin, B. Tong, et al. Link Farm Spam Detection Based on Its Properties[C]//Proceedings of the 2008 International Conference on Calculational Intelligence and Security. 2008: 477-480.
[10] Allan Borodin, Gareth O. Roberts, Jeffrey S. Rosenthal, et al. Finding authorities and hubs from link structures on the World Wide Web[C]//Proceedings of the 10th International Conference on World Wide Web, May 01-05, 2001: 415-429.
[11] G. O. Roberts, J. S. Rosenthal. Downweighting tightly knit communities in World Wide Web rankings[J]. Advances and Applications in Statistics, Dec. 2003, 3(3):199-216.
[12] W. Gang, Y. Wei. A Power-Arnoldi Algorithm for Computing PageRank [J].Numeric Linear Algebra Applications. 2007, 14:521-546.
[13] Jeffrey Dean, Sanjay Ghemawat. MapReduce: simplified data processing on large clusters[C]//Procee-dings of the 6th Conference on Symposium on Opera-ting Systems Design & Implementation, San Francisco, CA, December 06-08, 2004: 10-10.
[14] S. D. Kamvar, T. H. Haveliwala, C. D. Manning et al. Exploiting the Block Structure of the Web for Computing PageRank [C]//Proceedings of the 12th International World Wide Web Conference. 2003.
[15] 刘松彬,都云程,施水才. 基于分解转移矩阵的Page-Rank迭代计算方法[J]. 中文信息学报, 2007, 21(5): 41-45.
{{custom_fnGroup.title_cn}}
脚注
{{custom_fn.content}}
基金
国家自然科学基金资助项目(61170230,60903139,60873243,60933005);国家863计划重点资助项目(2010AA012502,2010AA012503)
{{custom_fund}}