中文搜索引擎经常会返回大量的无关项或者不含具体信息的间接项,产生这类问题的一个原因是网页中存在着大量与主题无关的文字。对使用关键字检索方法的搜索引擎来说,想在检索或者后处理阶段解决这类问题不仅要付出一定代价,而且在大多数情况下是不可能的。在这篇论文中,我们提出了网页噪声的概念,并针对中文网页的特点,实现了一种对网页自动分块并去噪的HTML解析方法,从而达到在预处理阶段消除潜在无关项和间接项的目的。实验结果表明,该方法能够在不占用查询时间的前提下100%地消除中文搜索引擎隐藏的间接项,以及大约11%的无法过滤或隐藏的无关项或间接项,从而大幅度提高检索结果的查准率。
Abstract
While using search engine , people always find so many irrelevant or peripherally relevant items in the result list . Most of them are produced by the words irrelevant to the topic of a web page. It is costly or even impossible to remove such items using traditional keyword methods. In this paper , we define the concept of noise in web pages , and propose a novel approach to clean the noise information of web pages in the pre-processing stage. A novel model of Chinese web pages and 4 simple rules are build to discard noise from HTML files. Experimental results show that , all the indirect items that appear in the results of site grouping are removed correctly and about 11% irrelevant or indirect items that cannot be excluded by commercial Chinese search engines are removed by our approach.
关键词
计算机应用 /
中文信息处理 /
HTML解析 /
降噪 /
分块模型 /
搜索引擎
{{custom_keyword}} /
Key words
computer application /
Chinese information processing /
HTML parser /
noise filtering /
block model /
search engine
{{custom_keyword}} /
{{custom_sec.title}}
{{custom_sec.title}}
{{custom_sec.content}}
参考文献
[1] www.cnnic.net.cn. 引用的数据出现在www.cnnic.net.cn/develst/2002-1/4.shtml.
[2] Kushmerick , N. , Weld , D.S. , and Doorenbos , R. , Wrapper Induction for Information Extraction , Proceedings of the Fifteenth International Joint Conference on Artificial Intelligence , 729 - 735 , 1997.
[3] Carchiolo , V. ; Longheu , A. ; Malgeri , M. , Structuring the Web , Database and Expert Systems Applications , 2000. Proceedings. 11th International Workshop on , 1123 - 1127 , 2000.
[4] Jinlin Chen , Baoyao Zhou , Jin Shi , HongJiang Zhang , Qiu Fengwu , Function-based object model towards website adaptation , WWW10 , 587 - 596 , 2001.
[5] Michal Cutler , Yungming Shih , Weiyi Meng , Using the Structure of HTML Documents to Improve Retrieval , Proceedings of the USENIX Symposium on Internet Technologies and Systems , 241 - 251 , 1997.
[6] S. Chakrabarti , B. Dom , D. Gibson , H. Kleinberg , P. Raghavan , S. Rajagopalan , Automatic Resource Compilation by Analyzing Hyperlink Structure and Associated Text , WWW7 , 1998.
[7] N. Craswell , D. Hawking , S. E. Robertson , Effective Site Finding Using Link Anchor Information , SIGIR 2001 , 2001.
[8] Google : http:∥www.google.com.
[9] P. Buneman , Semistructured data , In Proceedings of the ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Databases Systems , 117 - 121 , 1997.
[10] 赛迪搜索大赛是赛迪网(ccidnet.com)于2001-11-15到2001-12-15举办的搜索技能大赛,形式为每日在线答题,参赛者使用网易搜索,雅虎中国搜索等七大门户搜索引擎查找答案.
{{custom_fnGroup.title_cn}}
脚注
{{custom_fn.content}}
基金
国家重点基础研究资助项目(973)(G1998030509);自然科学基金资助项目(60223004);863高科技资助项目(2001AA114082)
{{custom_fund}}