林海伦,熊锦华,王 博,程学旗. 基于领域知识抽样的深网资源采集方法[J]. 中文信息学报, 2016, 30(2): 175-181.
LIN Hailun, XIONG Jinhua, WANG Bo, CHENG Xueqi. An Approach to Crawling the Deep Web Based on Domain Knowledge Sampling. , 2016, 30(2): 175-181.
An Approach to Crawling the Deep Web Based on Domain Knowledge Sampling
LIN Hailun1,2, XIONG Jinhua1, WANG Bo3, CHENG Xueqi1
1. Institute of Computing Technology, Chinese Academy of Sciences, Beijing 100190, China; 2. University of Chinese Academy of Sciences, Beijing 100049, China; 3. CNCERT/CC, Beijing 100029, China)
Abstract:The Deep Web refers to the Web databases content hidden behind HTML forms, which can only be accessed by performing form submissions. The current web page collection technologies can not cover these resources effectively by employing only hyperlinks. For this purpose, this paper proposes an approach to crawling the deep web based on domain knowledge sampling. Firstly, it creates a domain attributes set using open source directory services and assigns the attributes based on a confidence function; Secondly, it uses the domain attributes set to select query interface and generate assignments, and finally, it selects the assignment with the highest confidence as a query instance for deep web crawling based on greedy algorithm. Experiments show that our approach can effectively collect the deep web resources.
[1] M K Bergman, The Deep Web: Surfacing Hidden Value[J]. Journal of Electronic Publishing, 2001,7(1)[DB/OL]DOI:http://dx.doi.org110.399813336451.0007.104 [2] K C C Chang, B He, C Li, et al. Structured databases on the web: Observations and implications[R]. ACM SIGMOD Record, 2004,33(3): 61-70. [3] B He, M Patel, et al., Accessing the deep web:A Survey[C]//Proceedings of the Communications of the ACM, 2007, 50(5): 94-101. [4] 刘伟, 孟小峰, 凌妍妍, 一种基于图模型的 Web 数据库采样方法[J]. 软件学报, 2008, 19(2): 179-193. [5] S Raghavan, H Garcia-Molina. Crawling the Hidden Web[C]//Proceedings of 27th VLDB. 2001:129-138. [6] P Wu, J R Wen, H Liu, et al. Query selection techniques for efficient crawling of structured web sources[C]//Proceedings of the 22nd International Conference on Data Engineering. 2006: 47-56. [7] M A lvarez, J Raposo, F Cacheda, et al., A Task-specific approach for crawling the deep web[J]. Journal Engineering Letters. Special Issue: Advances in Information Engineering, 2006, 13(2): 204-215. [8] M A Lvarez, J Raposo, A Pan, et al. DeepBot: a focused crawler for accessing hidden web content[C]//Proceedings of the ACM Conference on Electronic Commerce. 2007:18-25. [9] J Madhavan, D Ko, L Kot, et al. Googles deep web crawl[J]. VLDB Endowment, 2008,1(2): 1241-1252. [10] L Jiang, Z Wu, Q Zheng, et al. Learning deep web crawling with diverse features[C]//Proceedings of the IEEE/WIC/ACM International Joint Conferences on Web Intelligence and Intelligent Agent Technologies. 2009: 572-575. [11] L Jiang, Z Wu, Q Feng, et al., Efficient deep web crawling using reinforcement learning[J]. Advances in Knowledge Discovery and Data Mining, 2010: 428-439. [12] Q Zheng, Z Wu, X Cheng, et al., Learning to Crawl Deep Web[R]. Information Systems, 2013,38(6):801-819. [13] T Furche, G Gottlob, G Grasso, et al., OXPATH: A Language for Scalable Data Extraction, Automation, and Crawling on the Deep Web[J]. The VLDB Journal, 2012,22(1): 47-72. [14] V I Levenshtein. Binary codes capable of correcting deletions[J], insertions and reversals. 1966,10(8): 707-710. [15] H Nguyen, T Nguyen, J Freire, Learning to extract form labels[R]. VLDB Endowment, 2008,1(1): 684-694. [16] R Khare. An empirical study on using hidden markov model for search interface segmentation[C]//Proceedings of the 18th ACM Conference on Information and Knowledge Management, 2009: 17-26. [17] E C Dragut, T Kabisch, C Yu, et al., A hierarchical approach to model web query interfaces for web source integration[C]//Proceedings of the VLDB Endowment, 2009,2(1): 325-336. [18] W Wu, A H Doan, C Yu, et al., Modeling and extracting deep-web query interfaces[J]. Advances in Information and Intelligent Systems, 2009: 65-90. [19] T Furche, G Gottlob, G Grasso, et al. OPAL: automated form understanding for the deep web[C]//Proceedings of the 21st International Conference on World Wide Web. 2012: 829-838. [20] R Khare, Y An, I Y Song, Understanding deep web search interfaces: a survey[R]. SIGMOD record, 2010,39(1): 33-40.