Web数据中的质量参差不齐、可信度不高以及冗余现象造成了网络信息检索工具存储和运算资源的极大浪费,并直接影响着检索性能的提高。现有的网络数据清理方式并非专门针对网络信息检索的需要,因而存在着较大不足。本文根据对检索用户的查询行为分析,提出了一种利用查询无关特征分析和先验知识学习的方法计算页面成为检索结果页面的概率,从而进行网络数据清理的算法。基于文本信息检索会议标准测试平台的实验结果证明,此算法可以在保留近95%检索结果页面的基础上清理占语料库页面总数45%以上的低质量页面,这意味着使用更少的存储和运算资源获取更高的检索性能将成为可能。
Abstract
The existence of low quality Web pages affects the effectiveness and efficiency of Web search. In this paper, we define the Web page quality estimation as a learning problem. First, several query-independent features are investigated which can separate search target page from ordinary ones. Bayes estimation based on these features is then used to train a model to assign importance scores to Web pages. In TREC based experiments, the top-scored set reduces 45% low quality pages as well as retains 95% high quality ones. It shows the possibility to gain better performance with less storage and computing resource for search engines.
关键词
计算机应用 /
中文信息处理 /
网络信息检索 /
数据清理 /
机器学习
{{custom_keyword}} /
Key words
computer application /
Chinese information processing /
Web information retrieval /
data cleansing /
machine learning
{{custom_keyword}} /
{{custom_sec.title}}
{{custom_sec.title}}
{{custom_sec.content}}
参考文献
[1] Lyman, Peter and Hal R. Varian, How Much Information 2003 [EB/OL]. http://www.sims.berkeley.edu/how-much-info - 2003 on 2005 - 06 - 18, 2003 - 10 - 30/2005 - 06 - 18.
[2] Danny Sullivan, Search Engine Sizes [EB/OL]. From search engine watch web site http://searchenginewatch.com /reports/article.php /2156481, 2005 - 01 - 28/2005 - 06 - 18.
[3] Danny Sullivan, Searches Per Day [EB/OL]. From search engine watch web site http://searchenginewatch.com/reports/article.php/2156461, 2003 - 02 - 25/2005 - 06 - 18.
[4] Sergey Brin and Lawrence Page, The anatomy of a large-scale hypertextual Web search engine [J]. Computer Networks and ISDN Systems, 1998, 30 (7) : 107 - 117.
[5] Jon M. Kleinberg, Authoritative sources in a hyperlinked environment [J]. Journal of the ACM, 1999, 46 (5) : 604 - 632.
[6] Deng Cai, Shipeng Yu, Ji-Rong Wen and Wei-Ying Ma. VIPS: a Vision-based Page Segmentation Algorithm [R] , Microsoft Technical Report (MSR-TR-2003-79) , 2003.
[7] Deng Cai, Shipeng Yu, Ji-Rong Wen and Wei-Ying Ma. Block-based web search [A]. In: proceedings of the 27th Annual international ACM SIGIR Conference on Research and Development in information Retrieval. SIGIR ’04 [C]. New York, NY: ACM Press, 2004, 456-463.
[8] Deng Cai, Xiaofei He, Ji-Rong Wen and Wei-Ying Ma. Block-level Link Analysis[R] , Microsoft Technical Report MSR-TR-2004-50, 2004.
[9] Ruihua Song, Haifeng Liu, Ji-Rong Wen and Wei-Ying Ma, Learning Block Importance Models for Web Pages [A]. In: proceeding of the Thirteenth World Wide Web conference [C] , New York, NY: ACM Press, 2004, 203-211.
[10] Monika R. Henzinger, Rajeev Motwani and Craig Silverstein, Challenges in Web Search Engines [A] , Georg Gottlob, Toby Walsh eds. IJCAI-03, Proceedings of the Eighteenth International Joint Conference on Artificial Intelligence [C]. San Francisco: Morgan Kaufmann Press, 2003. 1573-1579.
[11] B. Amento and L. Terveen and W. Hill. Does authority mean quality? Predicting expert quality ratings of Web documents [A]. Nicholas J. Belkin, Peter Ingwersen and Mun-Kew Leong, eds. In: proceedings of the 23rd Annual International ACM SIGIR Conference [C]. New York: ACM Press, 2000. 296 - 303.
[12] Andrei Broder, A taxonomy of Web search [J]. SIGIR Forum, 2002, 36 (2) : 1 - 8.
[13] Nick Craswell, David Hawking and Stephen Robertson. Effective Site Finding using Link Anchor Information [A]. W. Bruce Croft, David J. Harper, Donald H. Kraft, Justin Zobel eds. In: proceedings of the 24th Annual International ACM SIGIR Conference [C]. New York: ACM Press, 2001. 250 - 257.
[14] Nick Craswell and David Hawking. Query-independent evidence in home page finding [J]. ACM Transactions on Information Systems (TOIS) , 2003, 21 (3) : 286 - 313.
[15] W. Kraaij, T. Westerveld, and D. Hiemstra. The importance of prior probabilities for entry page search [A]. Ricardo Baeza-Yates ed. In: proceedings of the 25th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval [C]. New York: ACM Press, 2002. 27 - 34.
[16] Yiqun Liu, Min Zhang, Shaoping Ma, Effective Topic Distillation with Key Resource Pre-selection [J] , Lecture Notes in Computer Science, Volume 3411, 129 - 140.
[17] Yiqun L iu, Canhui Wang, Min Zhang, Shaoping Ma, Web Data Cleansing for Information Retrieval using Key Resource Page Selection [A]. In: proceedings of the 14th International World Wide Web conference [C] , New York: ACM Press, 2005, 1136 - 1137.
[18] Hwangjo Yu et. al. PEBL: Web Page Classification without Negative Examples. IEEE Trans. On Knowledge and Data Engineering [J] , 2004, 16 (1).
{{custom_fnGroup.title_cn}}
脚注
{{custom_fn.content}}
基金
国家重点基础研究(973)资助项目(2004CB318108);自然科学基金资助项目(60223004,60321002,60303005,60503064);教育部科学技术研究重点项目资助(104236)
{{custom_fund}}