QU You-li,YU Hao,XU Guo-wei,NIsino
2004, 18(1): 7-14.
With the development of the Internet the number of the Web pages increases dramatically , efficient information extraction from Web pages becomes more and more important . Some Web pages often contain multiple information units , which are arranged orderly and compactly with same presentation style and similar HTML syntax , for example , a BBS page that contains multiple posts. For information extraction , information filtering and suchlike Web application , we need segment this kind of original Web page into several appropriate information blocks as the preprocessing. This paper proposed a new automatic approach to segment the Web page into information blocks. First , we construct a structural HTML parsing tree for the Web page , and then locate the sub-tree that contains all information blocks. Finally , 2-rank PAT algorithm is applied to segment the sub-tree according to the depth of the sub-tree and the information of node under the sub-tree. Our experiments on BBS pages show this approach is fairly effective.