任意网页的主题信息抽取研究

张儒清,郭岩,刘悦,俞晓明,程学旗

PDF(7297 KB)
PDF(7297 KB)
中文信息学报 ›› 2017, Vol. 31 ›› Issue (5) : 127-137.
信息抽取与文本挖掘

任意网页的主题信息抽取研究

  • 张儒清1,2,郭岩1,刘悦1,俞晓明1,程学旗1
作者信息 +

A General Theme Information Extraction for Webpages

  • ZHANG Ruqing1, 2, GUO Yan1, LIU Yue1, YU Xiaoming1, CHENG Xueqi1
Author information +
History +

摘要

目前大部分的网页信息抽取方法都局限于某一类网页的提取,并没有进一步深入到适用于任意网页的抽取。针对这一问题,该文提出了一种基于融合机制的任意网页主题信息抽取框架,特点是通过“模板库匹配—基于模板抽取—网页分类—全自动抽取”四个步骤实现对模板无关的全自动抽取算法和基于模板的抽取算法的融合。实验显示,这种融合机制能促进抽取准确率的有效提高,从而最终建立起一个适用于任意网页的、具有实用价值的信息抽取框架。

Abstract

Most of existing information extraction methods are focused on a specific type of webpages, rather than applicable to all webpages. In this paper, we propose a general framework based on fusion mechanism to enable the extraction of the theme information of all webpages. This framework combines the automatic information extraction strategy and the template detection strategy through four steps: template matching, template based extraction, web page classification and automatic extraction. The experiments show that the proposed strategy can lead to an additional performance improvement in the precision of extraction.

关键词

任意网页 / 主题信息 / 网页分类 / 实用价值

Key words

any page / theme information / web page classification / practical value

引用本文

导出引用
张儒清,郭岩,刘悦,俞晓明,程学旗. 任意网页的主题信息抽取研究. 中文信息学报. 2017, 31(5): 127-137
ZHANG Ruqing, GUO Yan, LIU Yue, YU Xiaoming, CHENG Xueqi. A General Theme Information Extraction for Webpages. Journal of Chinese Information Processing. 2017, 31(5): 127-137

参考文献

[1] Chang, C. H. , et al. , A survey of web information extraction systems. Knowledge and Data Engineering [J], IEEE Transactions on, 2006. 18(10):1411-1428.
[2] Chun-Nan Hsu, Ming-Tzung Dung. Generating finite-state transducers for semi-structured data extraction from the web [J]. Information Systems 23(8):521-538, 1998.
[3] Valter Crescenzi, Giansalvatore Mecca and Paolo Merialdo. RoadRunner:Towards Automatic Data Extraction from Large Web Sites [C]//Proceedings of the 27th International Conference on Very Large Data Bases, p. 109 - 118 Morgan Kaufmann Publishers Inc. San Francisco, CA, USA 2001.
[4] Ion Muslea, et al. A hierarchical approach to wrapper induction[C]//Proceedings of AGENTS’99, New York, NY, USA, ACM, 1999:190-197.
[5] Chai-Hui Chang, Shih-Chein Kuo. Olear:semisupervised web-data extraction with visual support [J]. Intelligent Systems, IEEE, 2004, 19(6):56-64.
[6] Tim Weninger, William H and Jiawei Han. CETR-Content Extraction via Tag Ratios [C]//Proceedings of the 19th international conference on World wide web, p. 971-980, New York, NY, USA 2010.
[7] Jyotika Prasad, Andreas Paepcke. CoreEx:Content Extraction from Online News Articles [C]//Proceedings of the 17th ACM conference on Information and knowledge management, p. 1391-1392 ACM New York, NY, USA 2008.
[8] Deng Cai, Shipeng Yu, Jirong Wen and Weiying Ma. Extracting content structure for web pages based on visual representation[C]//Proceedings of the 5th Asia-Pacific web conference on Web technologies and applications, Springer-Verlag Berlin, Heidelberg 2003:406-417.
[9] Burges CJC. A tutorial on support vector machines for pattern recognition [C]//Proceedings of the Data Mining and Knowledge Discovery , 1998 2(2).
[10] 刘春梅, 郭岩, 俞晓明, 等. 针对开源论坛网页的信息抽取研究 [J]. 计算机科学与探索, 2016.
[11] Yoon Kim. 2014. Convolutional neural net- works for sentence classification [C]//Proceedings of the arXiv preprint arXiv:1408. 5882.
[12] Ilya Sutskever, James Martens, and Geoffrey E Hinton. 2011. Generating text with recurrent neural networks[C]//Proceedings of the 28th International Conference on Machine Learning (ICML-11).
[13] 郗家贞. 一种基于时间串的论坛页面信息自动抽取方法及系统[J]. 中国, 201410429698. 9[P]. 2014-08-29.

基金

国家重点基础研究发展计划(“973”计划)(2014CB340401,2013CB329606);科技部重点研发计划(2016QY02D0405);国家自然科学基金(61232010,61472401,61425016,61203298);中国科学院青年创新促进会优秀会员项目(20144310,2016102)
PDF(7297 KB)

Accesses

Citation

Detail

段落导航
相关文章

/