从Web中快速、准确地检索出所需信息的迫切需求催生了专业搜索引擎技术。在专业搜索引擎中,网络爬虫(Crawler)负责在Web上搜集特定专业领域的信息,是专业搜索引擎的重要核心部件。该文对中文专业网页的爬取问题进行了研究,基于KL距离验证了网页内容与链接前后文在分布上的差异,在此基础上提出了以链接锚文本及其前后文为特征、Nave Bayes分类器制导的中文专业网页爬取算法,设计了自动获取带链接类标的训练数据的算法。以金融专业网页的爬取为例,分别对所提出的算法进行了离线和在线测试,结果表明,Nave Bayes分类器制导的网络爬虫可以达到近90%的专业网页收割率。
Abstract
The urgent need for quick and accurate information from the Web contributes to the domain specific search engine, in which the crawler is a keycomponent to the desired WebPages. Focused on the crawling of domain specific Chinese WebPages, this paper first examines the the distributional difference between WebPages and link contexts via the KL distance, and then proposes a Nave Bayesian classifier-guided algorithm to collect the domain specific Chinese WebPages. The classifier uses anchor text of hyperlink and its context as features. An algorithm is further designed to automatically collect labeled hyperlinks necessary for training the classifier. Taking the financial WebPages as an example, both the off4line and on-line tests are performed to validate the algorithm. The results show that the crawler guided by the Nave Bayesian classifier reaches nearly 90% accuracy in the domain specific WebPages.
Key wordscomputer application; Chinese information processing; search engine; domain dpecific crawler; Nave Bayesian Classifier; hyperlink context
Key words
computer application /
Chinese information processing /
search engine /
domain dpecific crawler /
Nave Bayesian Classifier /
hyperlink context
/
/
/
/
/
/
/
/
{{custom_keyword}} /
{{custom_sec.title}}
{{custom_sec.title}}
{{custom_sec.content}}
参考文献
[1] S. Chakrabarti, M. van den Berg, B. Dom. Focused Crawling: A New Approach to Topic-Specific Web Resource Discovery[J]. Computer Networks, 1999,31(11-16): 1623-1640.
[2] J. Kleinberg. Authoritative Sources in a Hyperlinked Environment[J]. Journal of the ACM, 1999,46(5): 604-632.
[3] M. Diligenti, F.M. Coetzee, S. Lawrence, 等. Focused Crawling Using Context Graphs[C]// Proc. of Intl. Conf. On Very Large Databases (VLDB’00), Morgan Kaufmann Publishers Inc. San Francisco, CA, USA, 2000: 527-534.
[4] P.M.E. De Bra, R.D.J. Post, Information Retrieval in the World Wide Web: Making Client-Based Searching Feasible [C]// Proceedings of the First International World-Wide Web Conference, CERN, Switzerland, May, 1994.
[5] M. Iwazume, K. Shirakami, K. Hatadani, 等. IICA: An Ontology-Based Internet Navigation System[C]// Proc. AAAI-96 Workshop Internet Based Information Systems, 1996.
[6] M. Hersovici, M. Jacovi, Y.S. Maarek, 等. The Shark-Search Algorithm—An Application: Tailored Web Site Mapping[C]// Proc. Seventh Int’l World Wide Web Conf., 1998.
[7] S. Chakrabarti, K. Punera, M. Subramanyam. Accelerated Focused Crawling through Online Relevance Feedback[C]// Proc. 11th Int’l World Wide Web Conf., May 2002.
[8] 周立柱,林玲.聚焦爬虫技术研究综述[J].计算机应用,2005,25(9): 1965-1969.
[9] 蒋宗礼,徐学可,李帅.一种基于超链接引导的主题搜索的主题敏感爬行方法[J].计算机应用,2008,28(4): 942-944.
[10] 李勇,韩亮.主题搜索引擎中网络爬虫的搜索策略研究[J].计算机工程与科学,2007,30(3): 4-6,56.
[11] Manning C.D.等,苑春法等译,统计自然语言处理基础[M].电子工业出版社,2005.
[12] S. Chakrabarti, B. Dom, P. Raghavan, S. Rajagopalan, D. Gibson, J. Kleinberg. Automatic Resoutce Compilation by Analyzing Hyperlink Stucture and Associated Text[C]// Proc. 7th WWW Conf, Brisbane, Australia, 1998.
[13] Gautam Pant, Padmini Srinivasan. Link Contexts in Classifier-Guided Topical Crawlers[J]. IEEE Transactions on Knowledge and Data Engineeering, 2006,18(1): 107-122.
[14] 李晓明,闫宏飞,王继民.搜索引擎——原理、技术与系统[M].北京:科学出版社,2005.
{{custom_fnGroup.title_cn}}
脚注
{{custom_fn.content}}