傅彦,徐昭邦,夏虎,周俊临. 基于逆向匹配的电子商务网站实体模板半自动构建方法[J]. 中文信息学报, 2015, 29(2): 157-162.
FU Yan, XU Zhaobang, XIA Hu, ZHOU Junlin. Reverse Match Based Semi-automatic Entity Template Extraction for E-commerce Websites. , 2015, 29(2): 157-162.
基于逆向匹配的电子商务网站实体模板半自动构建方法
傅彦,徐昭邦,夏虎,周俊临
电子科技大学 计算机科学与工程学院 互联网科学中心,四川 成都 611731)
Reverse Match Based Semi-automatic Entity Template Extraction for E-commerce Websites
FU Yan, XU Zhaobang, XIA Hu, ZHOU Junlin
Web Sciences Center, School of Computer Science and Engineering,University of Electronic Science and Technology of China,Chengdu, Sichuan 611731, China
Abstract:Generally, the distribution of the subject information in the Web page is centralized .Therefore,we can utilize this characteristics of Web page to extract the subject information automatically. Due to the fact that the HTML label in the page source code is not well qualified, it is difficult to construct a DOM tree with accurate structure through the forward matching. This article presents a new method which applies the reverse matching to construct a complete DOM tree. By deleting the insignificant node the DOM tree, we can select from the remained information node labels manually to finalize the templeaterdeciden if they are unique. This is a general and semi- automatic method, and experiments on the e-commerce webpages are reported in this paper.
[1] 杨晓琴,鞠时光,曹庆皇等.面向Deep Web数据自动抽取的模板生成方法[J].计算机应用,2010,27(1): 200-203. [2] 周炘.面向电子商务网站的深度搜索与信息抽取研究[D].江西: 江西师范大学软件学院硕士学位论文,2011. [3] 侯明燕.基于网页信息定位的数据抽取技术的研究[D].广东: 暨南大学硕士学位论文,2011. [4] 王琦,唐世渭,杨冬青等.基于DOM树的网页主题信息自动提取[J].计算机研究与发展,2004,41(10): 1786-1792. [5] Beyer K, Viglas S D, Tatarinov I, et al. Storing and querying ordered XML using a relational database system[C]//Proceedings of the 2002 ACM SIGMOD International Conference, 2002: 204-215.