冯艳卉,洪 宇,颜振祥,姚建民,朱巧明. 基于搜索引擎的双语混合网页识别新方法[J]. 中文信息学报, 2011, 25(1): 71-79.
FENG Yanhui, HONG Yu, YAN Zhenxiang, YAO Jianmin, ZHU Qiaoming. A Novel Method for Bilingual Web Page Mining Via Search Engines. , 2011, 25(1): 71-79.
基于搜索引擎的双语混合网页识别新方法
冯艳卉,洪 宇,颜振祥,姚建民,朱巧明
苏州大学 计算机科学与技术学院,江苏 苏州 215006
A Novel Method for Bilingual Web Page Mining Via Search Engines
FENG Yanhui, HONG Yu, YAN Zhenxiang, YAO Jianmin, ZHU Qiaoming
School of Computer Science & Technology, Soochow University, Suzhou, Jiangsu 215006,China
Abstract:A new approach has been developed for acquiring bilingual web pages from the result pages of search engines, which is composed of two challenging tasks. The first task is to detect web records embedded in the result pages automatically via a clustering method of a sample page. Identifying these useful records through the clustering method allows the generation of highly effective features for the next task which is high-quality bilingual web page acquisition. The task of high-quality bilingual web page acquisition is assumed as a classification problem. One advantage of our approach is that it is independent of the search engine and the domain. The test is based on 2 516 records extracted from six search engines automatically and annotated manually, which gets a high precision of 81.3% and a recall of 94.93%. The experimental results indicate that our approach is very effective. Key wordsweb mining; bilingual web pages; parallel corpora
[1] Resnik Philip and Noah A. Smith. The web as a Parallel Corpus[J]. Computational Linguistics,2003,29(3):349-380. [2] Zhang Ying, Ke Wu, Jianfeng Gao, Phil Vines. Automatic Acquisition of Chinese-English Parallel Corpus from the Web[C]//Proceedings of 28th European Conference on Information Retrieval.2006 [3] Shi Lei, Cheng Niu, Ming Zhou, and Jianfeng Gao. A DOM Tree Alignment Model for Mining Parallel Data from the Web[C]//Joint Proceedings of the Association for Computational Linguistics and the International Conference on Computational Linguistics, Sydney, Australia.2006. [4] Long Jiang, Shiquan Yang, Ming Zhou, Xiaohua Liu and Qingsheng Zhou. Mining Bilingual Data from the Web with Adaptively Learnt Patterns[C]//The 47th Annual Meeting of the Association for Computational Linguistics. 2009: 870-878. [5] Baumgartner R., S. Flesca and G. Gottlob. Visual Web Information Extraction with Lixto[C]//Proceedings of the 27th International Conference on Very Large Data Bases, September 11-14, 2001:119-128. [6] Zhai Y., B. Liu. Extracting Web Data Using Instance-Based Learning[C]//Proc. Sixth Intl Conf. Web Information Systems Engineering. 2005. [7] Chang C., S. Lui. Information Extraction based on Pattern Discovery[C]//Proceedings of the 10th international conference on World Wide Web. May 01-05, Hong Kong,2001: 681-688. [8] Liu B., R. Grossman and Y. Zhai. Mining Data Records in Web Pages[C]//Proceedings of the ninth ACM SIGKDD international conference on Knowledge Discovery and Data mining, Washington, D.C, 2003: 601-606. [9] Zhai Y., B. Liu. Web Data Extraction Based on Partial Tree Alignment[C]//Proceedings of the 14th international conference on World Wide Web. May 10-14, Chiba, Japan,2005. [10] Liu B. and Y. Zhai. System for extracting Web data from flat and nested data records[C]//Proceedings of the Conference on Web Information Systems Engineering, 2005: 487-495. [11] Zhao H., W. Meng, Z. Wu, V. Raghavan, C. Yu. Fully Automatic Wrapper Generation for Search Engines[C]//Proceedings of the 14th international conference on World Wide Web, 2005: 66-75. [12] Zhao H., W. Meng, Z. Wu, V. Raghavan, C. Yu. Automatic Extraction of Dynamic Record Sections from Search Engine Result Pages[C]//Proceedings of the 32nd International conference on Very large databases.2006. [13] Miao Gengxin, Junichi Tatemura, Wang-Pin Hsiung, Arsany Sawires, Louise E. Moser. Extracting data records from the web using tag path clustering[C]//Proceedings of the 18th International Conference on World Wide Web, Spain, Madrid.2009. [14] Frey B. J. and D. Dueck. Clustering by passing messages between data points[J]. Science, 16 February 2007, 315(5814):972-976. [15] Cortes, C. and V. Vapnik. Support-vector network[J]. Machine Learning 20: 273-297. [16] DuVerle David, Helmut Prendinger. A Novel Discourse Parser Based on Support Vector Machine Classification[C]//The 47th Annual Meeting of the Association for Computational Linguistics. 2009: 665-673. [17] Deng Dan. Research on Chinese-English word alignment[D]. Institute of Computing Technology Chinese Academy of Sciences, Master Thesis. (in Chinese). 2004. [18] Zhao H., W. Meng, Z. Wu, V. Raghavan, C. Yu. Automatic Extraction of Dynamic Record Sections from Search Engine Result Pages[C]//Proceedings of the 32nd International conference on Very large databases.2006.