该文提出了一种基于统计与正文特征的网页正文抽取方法。该方法继承了统计方法的优点,同时利用正文特征克服了原有基于统计的方法无法抽取多正文体网页的缺陷。源于多正文体在网页的DOM树中对应着正文区域下的多棵具有相似特征的正文子树,该文首先基于统计的方法获取一条正文路径,然后学习该路径的正文特征识别正文区域和子树主干,最后根据区域及该主干具有的正文特征进而得到完整的正文。实验表明该方法抽取单正文和多正文的精确率分别为94%和91%。
Abstract
This paper presents a new method for content extraction from Web pages based on statistic and content-features. This method not only inherits the merits of the traditional statistic method, but also can extract the multi-body documents which can not be obtained by the pure statistic method. According to the fact that the multi-body documents are corresponding to multi-subtrees with the similar characteristics in the DOM tree of the web page, we first get a content path using the statistic method. Then, the content region and a trunk of subtree are modeled by the important features of the path, which are applied to get the whole information of the body content. Our experiment results show that the extraction precision of the single-body documents is 94%, and the multi-body documents is 91%.
Key words computer application; Chinese information processing; content extraction;single-body documents;multi-body documents
关键词
计算机应用 /
中文信息处理 /
正文抽取 /
单正文体 /
多正文体
{{custom_keyword}} /
Key words
computer application /
Chinese information processing /
content extraction /
single-body documents /
multi-body documents
{{custom_keyword}} /
{{custom_sec.title}}
{{custom_sec.title}}
{{custom_sec.content}}
参考文献
[1] M Satyanarayanan. Pervasive Computing: Vision and Challenges[J].IEEE Personal Communications, 2001, 6(8):10-17.
[2] 孙承杰,关毅.基于统计的网页正文信息抽取方法的研究[J].中文信息学报, 2004,18(5):17-22.
[3] Mingqiu Song, Xintao Wu. Content Extraction from Web Pages Based on Chinese Punctuation Number[C]//Wireless Communications, Networking and Mobile Computing, 2007. WiCom 2007:5568-5570.
[4] 梅雪,程学旗,郭岩,等.一种全自动生成网页信息抽取Wrapper的方法[J].中文信息学报,2008,22(1):22-29.
[5] 杨少华,林海略,韩燕波.针对模板生成网页的一种数据自动抽取方法[J].软件学报.2008,19(2):209-223.
[6] Deng Cai, Yu Shipeng, Wen Jirong et al. VIPS: a vision-based page segmentation algorithm[R].Microsoft Technical Report,MSR-TR-2003-79,2003.
[7] 于满泉,陈铁睿,许洪波.基于分块的网页信息解析器的研究与设计[J].计算机应用,2005,25(4):974-976.
[8] Baeza-Yates, R. Algorithms for string matching: A survey.[J]. ACM SIGIR Forum, 1989, 23(3-4):34-58.
[9] 安淑芝.数据仓库与数据挖掘[M].清华大学出版社,2005: 115-119.
()
{{custom_fnGroup.title_cn}}
脚注
{{custom_fn.content}}
基金
国家“十一五”863计划资助项目(2006AA01Z112)
{{custom_fund}}