及时获取新增内容,是采集器的重要衡量指标。基于版块页-内容页架构设计的网络采集器通过定期重采入口的版块页,能够有效地快速识别新产生内容页面并进行扩展。然而获取内容的实时性与对网站访问的友好性存在一定的折中。传统的重采策略关注时效性,而忽略了对网站访问的友好性。该文提出了一种基于时间序列预测的改进重采策略兼顾时效性和友好性。实验表明,该方法可以在保证数据采集实时性的情况下,有效降低访问量,提升对网站访问的友好性。
Abstract
It is critical for a web crawler to identify new relevant contents and expand its data collection targets in time. A board-article structure based web crawler could effectively achieve the above goal by frequently revisiting its target sites, without being website-friendly by bombarding the target sites. To address this issue, we propose an improved re-crawling strategy based on time series prediction. Experiments show that our method can significantly reduce the number of visits required and improve the friendliness towards websites of our web crawler while obtaining the data in time.
关键词
网络采集 /
采集策略 /
时间序列预测
{{custom_keyword}} /
Key words
Web crawling /
crawling strategy /
time series prediction
{{custom_keyword}} /
{{custom_sec.title}}
{{custom_sec.title}}
{{custom_sec.content}}
参考文献
[1] Dasgupta A,Ghosh A,Kumar R,et al.The discoverability of the web[C]//Proceedings of the 16th International Conference on World Wide Web,2007:421-430.
[2] 杨玉军.基于机器学习的时间序列模型研究及其应用[D].成都:电子科技大学博士学位论文,2018.
[3] Kawakami K.Supervised sequence labelling with recurrent neural networks[D].Technical University of Munich,2008.
[4] Lipton Z C,Berkowitz J,Elkan C.A critical review of recurrent neural networks for sequence learning[J].arXiv preprint arXiv:1506.00019,2015.
[5] Hochreiter S,Schmidhuber J.Long short-term memory[J].Neural Computation,1997,9(8):1735-1780.
[6] Bengio Y,Simard P,Frasconi P.Learning long-term dependencies with gradient descent is difficult[J].IEEE Transactions on Neural Networks,1994,5(2):157-166.
[7] Chen K,Huo Q.Training deep bidirectional LSTM acoustic model for LVCSR by a context-sensitive-chunk BPTT approach[C]//Proceedings of IEEE/ACM Transactions on Audio,Speech and Language Processing (TASLP),2016,24(7):1185-1193.
[8] Gao L,Guo Z,Zhang H,et al.Video captioning with attention-based LSTM and semantic consistency[J].IEEE Transactions on Multimedia,2017,19(9):2045-2055.
[9] Song E,Soong F K,Kang H G.Effective spectral and excitation modeling techniques for LSTM-RNN-based speech synthesis systems[J].IEEE/ACM Transactions on Audio,Speech,and Language Processing,2017,25(11):2152-2161.
[10] Ntoulas A,Cho J,Olston C.What's new on the web?:the evolution of the web from a search engine perspective[C]//Proceedings of the 13th International Conference on World Wide Web,2004:1-12.
[11] Olston C,Pandey S.Recrawl scheduling based on information longevity[C]//Proceedings of the 17th International Conference on World Wide Web,2008:437-446.
[12] Adar E,Teevan J,Dumais S T,et al.The web changes everything:understanding the dynamics of web content[C]//Proceedings of the Second ACM International Conference on Web Search and Data Mining,2009:282-291.
[13] Bar Yossef Z,Broder A Z,Kumar R,et al.Sic transit gloria telae:towards an understanding of the web's decay[C]//Proceedings of the 13th International Conference on World Wide Web,2004:328-337.
[14] Barbosa L,Salgado A C,De Carvalho F,et al.Looking at both the present and the past to efficiently update replicas of web content[C]//Proceedings of the 7th Annual ACM International Workshop on Web Information and Data Management,2005:75-80.
[15] Tan Q,Mitra P.Clustering-based incremental web crawling[J].ACM Transactions on Information Systems (TOIS),2010,28(4):17.
[16] Badinsky K,Bennett P N.Predicting content change on the web[C]//Proceedings of the 6th ACM Iternational conference on Web Sarch and Dta Mning,2013:415-424.
[17] Li X,Cline D B,Loguinov D.Temporal update dynamics under blind sampling[J].IEEE/ACM Transactions on Networking (TON),2017,25(1):363-376.
[18] Santos A,Pasini B,Freire J.A first study on temporal dynamics of topics on the web[C]//Proceedings of the 25th International Conference Companion on World Wide Web,2016:849-854.
[19] 李魁,程学旗,郭岩,等.WWW 论坛中的动态网页采集[J].计算机工程,2007,33(6):80-82.
[20] Prieto V M,Alvarez M,Carneiro V,et al.Distributed and collaborative Web cange dtection system[J].Computer Science Information System.,2015,12(1):91-114.
[21] Cho K,Van Merrinboer B,Gulcehre C,et al.Learning phrase representations using RNN encoder-decoder for statistical machine translation[J].arXiv preprint arXiv:1406.1078,2014.
[22] Gers F A,Schmidhuber J,Cummins F.Learning to forget:Continual prediction with LSTM[J],Neural Computation,2000,12(10):451-2471.
[23] Chung J,Gulcehre C,Cho K H,et al.Empirical evaluation of gated recurrent neural networks on sequence modeling[J].arXiv preprint arXiv:1412.3555,2014.
[24] Kingma D P,Ba J.Adam:A method for stochastic optimization[J].arXiv preprint arXiv:1412.6980,2014.
{{custom_fnGroup.title_cn}}
脚注
{{custom_fn.content}}
基金
国家重点研究计划项目(2017YFC0820404);国家杰出青年基金(61425016);国家自然科学基金(91746301)
{{custom_fund}}