周佳颖,朱珍民,高晓芳.. 基于统计与正文特征的中文网页正文抽取研究[J]. 中文信息学报, 2009, 23(5): 80-86.
ZHOU Jiaying,ZHU Zhenmin,GAO Xiaofang,. Research on Content Extraction from Chinese Web Page Based on Statistic and Content-Features. , 2009, 23(5): 80-86.
Research on Content Extraction from Chinese Web Page Based on Statistic and Content-Features
ZHOU Jiaying1,2,ZHU Zhenmin1,GAO Xiaofang1,3
1. Institute of Computing Technology, The Chinese Academy of Sciences, Beijing 100190, China; 2. College of Information Technology, Xiangtan University, Xiangtan, Hunan 411105, China; 3. Joint Faculty of Computer Scientific Research, Capital Normal University, Beijing 100037, China
Abstract:This paper presents a new method for content extraction from Web pages based on statistic and content-features. This method not only inherits the merits of the traditional statistic method, but also can extract the multi-body documents which can not be obtained by the pure statistic method. According to the fact that the multi-body documents are corresponding to multi-subtrees with the similar characteristics in the DOM tree of the web page, we first get a content path using the statistic method. Then, the content region and a trunk of subtree are modeled by the important features of the path, which are applied to get the whole information of the body content. Our experiment results show that the extraction precision of the single-body documents is 94%, and the multi-body documents is 91%. Key words computer application; Chinese information processing; content extraction;single-body documents;multi-body documents
[1] M Satyanarayanan. Pervasive Computing: Vision and Challenges[J].IEEE Personal Communications, 2001, 6(8):10-17. [2] 孙承杰,关毅.基于统计的网页正文信息抽取方法的研究[J].中文信息学报, 2004,18(5):17-22. [3] Mingqiu Song, Xintao Wu. Content Extraction from Web Pages Based on Chinese Punctuation Number[C]//Wireless Communications, Networking and Mobile Computing, 2007. WiCom 2007:5568-5570. [4] 梅雪,程学旗,郭岩,等.一种全自动生成网页信息抽取Wrapper的方法[J].中文信息学报,2008,22(1):22-29. [5] 杨少华,林海略,韩燕波.针对模板生成网页的一种数据自动抽取方法[J].软件学报.2008,19(2):209-223. [6] Deng Cai, Yu Shipeng, Wen Jirong et al. VIPS: a vision-based page segmentation algorithm[R].Microsoft Technical Report,MSR-TR-2003-79,2003. [7] 于满泉,陈铁睿,许洪波.基于分块的网页信息解析器的研究与设计[J].计算机应用,2005,25(4):974-976. [8] Baeza-Yates, R. Algorithms for string matching: A survey.[J]. ACM SIGIR Forum, 1989, 23(3-4):34-58. [9] 安淑芝.数据仓库与数据挖掘[M].清华大学出版社,2005: 115-119. ()