一种提高中文搜索引擎检索质量的HTML解析方法

宋睿华,马少平,陈刚,李景阳

PDF(395 KB)
PDF(395 KB)
中文信息学报 ›› 2003, Vol. 17 ›› Issue (4) : 20-27.

一种提高中文搜索引擎检索质量的HTML解析方法

  • 宋睿华1,2,马少平1,2,陈刚1,李景阳1,2
作者信息 +

A HTML Parser to Improve Chinese Search Engines

  • SONG Rui-hua1,2,MA Shao-ping1,2,CHEN Gang1,LI Jing-yang1,2
Author information +
History +

摘要

中文搜索引擎经常会返回大量的无关项或者不含具体信息的间接项,产生这类问题的一个原因是网页中存在着大量与主题无关的文字。对使用关键字检索方法的搜索引擎来说,想在检索或者后处理阶段解决这类问题不仅要付出一定代价,而且在大多数情况下是不可能的。在这篇论文中,我们提出了网页噪声的概念,并针对中文网页的特点,实现了一种对网页自动分块并去噪的HTML解析方法,从而达到在预处理阶段消除潜在无关项和间接项的目的。实验结果表明,该方法能够在不占用查询时间的前提下100%地消除中文搜索引擎隐藏的间接项,以及大约11%的无法过滤或隐藏的无关项或间接项,从而大幅度提高检索结果的查准率。

Abstract

While using search engine , people always find so many irrelevant or peripherally relevant items in the result list . Most of them are produced by the words irrelevant to the topic of a web page. It is costly or even impossible to remove such items using traditional keyword methods. In this paper , we define the concept of noise in web pages , and propose a novel approach to clean the noise information of web pages in the pre-processing stage. A novel model of Chinese web pages and 4 simple rules are build to discard noise from HTML files. Experimental results show that , all the indirect items that appear in the results of site grouping are removed correctly and about 11% irrelevant or indirect items that cannot be excluded by commercial Chinese search engines are removed by our approach.

关键词

计算机应用 / 中文信息处理 / HTML解析 / 降噪 / 分块模型 / 搜索引擎

Key words

computer application / Chinese information processing / HTML parser / noise filtering / block model / search engine

引用本文

导出引用
宋睿华,马少平,陈刚,李景阳. 一种提高中文搜索引擎检索质量的HTML解析方法. 中文信息学报. 2003, 17(4): 20-27
SONG Rui-hua,MA Shao-ping,CHEN Gang,LI Jing-yang. A HTML Parser to Improve Chinese Search Engines. Journal of Chinese Information Processing. 2003, 17(4): 20-27

参考文献

[1] www.cnnic.net.cn. 引用的数据出现在www.cnnic.net.cn/develst/2002-1/4.shtml.
[2] Kushmerick , N. , Weld , D.S. , and Doorenbos , R. , Wrapper Induction for Information Extraction , Proceedings of the Fifteenth International Joint Conference on Artificial Intelligence , 729 - 735 , 1997.
[3] Carchiolo , V. ; Longheu , A. ; Malgeri , M. , Structuring the Web , Database and Expert Systems Applications , 2000. Proceedings. 11th International Workshop on , 1123 - 1127 , 2000.
[4] Jinlin Chen , Baoyao Zhou , Jin Shi , HongJiang Zhang , Qiu Fengwu , Function-based object model towards website adaptation , WWW10 , 587 - 596 , 2001.
[5] Michal Cutler , Yungming Shih , Weiyi Meng , Using the Structure of HTML Documents to Improve Retrieval , Proceedings of the USENIX Symposium on Internet Technologies and Systems , 241 - 251 , 1997.
[6] S. Chakrabarti , B. Dom , D. Gibson , H. Kleinberg , P. Raghavan , S. Rajagopalan , Automatic Resource Compilation by Analyzing Hyperlink Structure and Associated Text , WWW7 , 1998.
[7] N. Craswell , D. Hawking , S. E. Robertson , Effective Site Finding Using Link Anchor Information , SIGIR 2001 , 2001.
[8] Google : http:∥www.google.com.
[9] P. Buneman , Semistructured data , In Proceedings of the ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Databases Systems , 117 - 121 , 1997.
[10] 赛迪搜索大赛是赛迪网(ccidnet.com)于2001-11-15到2001-12-15举办的搜索技能大赛,形式为每日在线答题,参赛者使用网易搜索,雅虎中国搜索等七大门户搜索引擎查找答案.

基金

国家重点基础研究资助项目(973)(G1998030509);自然科学基金资助项目(60223004);863高科技资助项目(2001AA114082)
PDF(395 KB)

627

Accesses

0

Citation

Detail

段落导航
相关文章

/