基于版块的论坛增量搜集策略

杜言琦,马 军

PDF(3065 KB)
PDF(3065 KB)
中文信息学报 ›› 2010, Vol. 24 ›› Issue (3) : 62-69.
综述

基于版块的论坛增量搜集策略

  • 杜言琦,马 军
作者信息 +

A Board-Based Incremental Crawling of Web Forums

  • DU Yanqi, MA Jun
Author information +
History +

摘要

该文研究论坛的增量搜集问题。由于在论坛中同一主题通常分布在多个页面上,而传统增量搜集技术的抓取策略通常是基于单个页面,因此这些技术并不适于对论坛增量搜集。该文通过对许多论坛中版块变化规律的统计分析,提出了基于版块的论坛增量搜集策略。该策略将属于同一版块的所有页面看做一个整体,以它做为抓取的基本单位。同时该策略利用版块权重和局部时间规律确定抓取频率和抓取时间点。实验结果表明本策略对新增和新回复帖子的平均召回率为99.3%,并且与平均调度方法相比系统总延迟最高可减小42%。

Abstract

This paper studies the problem of incremental crawling of forums. Since a topic in a forum is usually distributed in more than one page and the revisiting strategy of traditional incremental technologies is centered on the individual page, these technologies are not suitable for crawling forum sites incrementally. Based on the statistical analysis on the evolution of board in many Web forums, a novel and board-based incremental crawling strategy is proposed. The main idea of the approach is to define the pages of the same board as the basic unit for re-crawling. In detail, this approach leverages the board weights and local time discipline to allocate crawl resources and determine the crawl time. Experimental results show that the recall for the newly published and updated discussion threads is close to 99.3% for our method strategy, and the overall system delay is maximally decreased by 42% as compared with even scheduling method.
Key wordscomputer application; Chinese information processing;incremental crawl; forum crawler; delay

关键词

计算机应用 / 中文信息处理 / 增量搜集 / 论坛爬虫 / 延迟

Key words

computer application / Chinese information processing / incremental crawl / forum crawler / delay

引用本文

导出引用
杜言琦,马 军. 基于版块的论坛增量搜集策略. 中文信息学报. 2010, 24(3): 62-69
DU Yanqi, MA Jun. A Board-Based Incremental Crawling of Web Forums. Journal of Chinese Information Processing. 2010, 24(3): 62-69

参考文献

[1] Cai R, Yang JM, Lai W., et al. iRobot: An Intelligent Crawler for Web Forums[C]//Proc. of the 17th World Wide Web Conf.Beijing,2008:447-456.
[2] Wang Y, Yang JM, Lai W,et al. Exploring Traversal Strategy for Web Forum Crawling[C]//ACM SIGIR. Singapore,2008: 459-466.
[3] Cho J, Garcia-Molina H. The evolution of the Web and implications for an incremental crawler[C]//Proc. of the 26th Int’l Conf. on Very Large Databases. San Francisco: Morgan Kaufmann Publishers, 2000: 200-209.
[4] 孟涛,王继民, 闫宏飞.网页变化与增量搜集技术[J].软件学报, 2006,17(5):1051-1067
[5] Cho J, Garcia-Molina H. Effective page refresh policies for Web crawlers[J]. ACM Trans. on Database Systems, 2003,28(4): 390-426.
[6] Guo Y, Li K, Zhang K, et al. Board forum crawling: a Web crawling method for Web forum[C]//Proc. 2006 IEEE/WIC/ACM Int.Conf.Web Intelligence, Hong Kong, 2006:745-748.
[7] M. L. A. Vidal, A. S. Silva, E. S. Moura, and J. M. B. Caval-canti. Structure-driven crawler generation by example[C]//Proc. of the 29th SIGIR Conf,Seattle,2006:292-299.
[8] Olston, C. and Pandey, S. Recrawl scheduling based on information longevity[C]//Proc. of the 17th World Wide Web Conf. New York,2008:437-446.
[9] S. O’Brien and C.Grimes.Microscale evolution of web pages[C]//Proceedings of the 17th International World Wide Web Conf, New York,2008:1149-1150.
[10] Liu B, Grossman, R and Zhai, Y. Mining data records from Web pages[C]//Proc. of the 9th ACM SIGKDD Int’l Conf. on Knowledge discovery and data mining,Washington, 2003:601-606.
[11] Zhai Y , Liu B. Structured data extraction from the Web based on partial tree alignment[J]. IEEE Trans. Knowl. Data Eng. 2006,18(12):1614-1628.
[12] Cho J, Ntoulas A. Effective change detection using sampling[C]//Proc. of the 28th Int’l Conf. on Very Large Databases. San Francisco: Morgan Kaufmann Publishers, 2002: 514-525.
               

基金

国家自然科学基金资助项目(60970047);山东省科技攻关资助项目(2007GG10001002,2008GG10001026);山东省自然科学基金资助项目(Y2008G19)
PDF(3065 KB)

525

Accesses

0

Citation

Detail

段落导航
相关文章

/