URL模式与HTML结构相结合的平行网页获取方法

刘 奇,刘 洋,孙茂松

PDF(2871 KB)
PDF(2871 KB)
中文信息学报 ›› 2013, Vol. 27 ›› Issue (3) : 91-100.
综述

URL模式与HTML结构相结合的平行网页获取方法

  • 刘 奇,刘 洋,孙茂松
作者信息 +

A Parallel Pages Mining Approach: Combining URL Patterns and HTML Structures

  • LIU Qi, LIU Yang, SUN Maosong
Author information +
History +

摘要

平行语料库是对机器翻译、跨语言信息检索等应用技术具有重要支撑作用的基础数据资源。虽然互联网上的平行网页数量巨大且持续增长,但由于平行网站的异构性和复杂性,如何快速自动获取高质量的平行网页进而构造平行语料库仍然是巨大的挑战。该文提出了一种URL模式与HTML结构相结合的平行网页获取方法,首先利用HTML结构实现平行网页的递归访问,其次使用URL模式优化遍历平行网站的拓扑顺序,从而实现高效准确的平行网页获取。在联合国与香港政府两个平行网站上的实验表明,该方法相对传统获取方法在获取时间上减少50%以上,准确率提高15%,并显著提高了机器翻译的质量(BLEU 值分别提高1.6 和0.7 个百分点)。

Abstract

Parallel corpus is the fundamental resource for statistical machine translation, cross-lingual information retrieval and others information processing technologies. Although the amount of parallel data on the web is continually increasing, the heterogeneity and complexity of parallel website make it still a challenge to collect such parallel texts. This paper presents a new parallel web pages mining approach, which combines URL patterns and HTML structure together. First, we use HTML structure to recursively visit parallel pages. Then, URL patterns are used to optimize the traverse sequence of parallel web site topology. Thus an efficient and accurate parallel pages mining system is relaized. Compared with traditional approach, experiments on two parallel web sites(www.un.org and www.gov.hk1) show that this approach saves more than 50% processing timeand improves 15% accuracy, resulting a significant increase in the translation quality of MT System.
Key wordsparallel pages mining; parallel corpus; URL pattern; HTML structure

关键词

平行网页获取 / 平行语料库 / URL模式 / HTML结构

Key words

parallel pages mining / parallel corpus / URL pattern / HTML structure
 
/   /   /
 
/   /   /
 
/   /  

引用本文

导出引用
刘 奇,刘 洋,孙茂松. URL模式与HTML结构相结合的平行网页获取方法. 中文信息学报. 2013, 27(3): 91-100
LIU Qi, LIU Yang, SUN Maosong. A Parallel Pages Mining Approach: Combining URL Patterns and HTML Structures. Journal of Chinese Information Processing. 2013, 27(3): 91-100

参考文献

[1] Koehn P, Och F J, Marcu D. Statistical phrase-based translation[C]//Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology-Volume 1. Association for Computational Linguistics, 2003: 48-54.
[2] Chiang D. Hierarchical phrase-based translation[J]. computational linguistics, 2007, 33(2): 201-228.
[3] Galley M, Graehl J, Knight K, et al. Scalable inference and training of context-rich syntactic translation models[C]//Proceedings of the 21st International Conference on Computational Linguistics and the 44th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, 2006: 961-968.
[4] Munteanu D S, Marcu D. Improving machine translation performance by exploiting non-parallel corpora[J]. Computational Linguistics, 2005, 31(4): 477-504.
[5] Ma X, Liberman M. Bits: A method for bilingual text search over the web[C]//Machine Translation Summit VII. 1999: 538-542.
[6] Chen J, Nie J Y. Automatic construction of parallel English-Chinese corpus for cross-language information retrieval[C]//Proceedings of the 16th Conference on Applied Natural Language Processing. Association for Computational Linguistics, 2000: 21-28.
[7] Resnik P, Smith N A. The web as a parallel corpus[J]. Computational Linguistics, 2003, 29(3): 349-380.
[8] Chen J, Chau R, Yeh C H. Discovering parallel text from the World Wide Web[C]//Proceedings of the second workshop on Australasian information security, Data Mining and Web Intelligence, and Software Internationalisation-Volume 32. Australian Computer Society, Inc., 2004: 157-161.
[9] Shi L, Niu C, Zhou M, et al. A dom tree alignment model for mining parallel data from the web[C]//Proceedings of the 21st International Conference on Computational Linguistics and the 44th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, 2006: 489-496.
[10] Zhang Y, Wu K, Gao J, et al. Automatic Acquisition of Chinese-English Parallel Corpus from the Web[M]//Advances in Information Retrieval. Springer Berlin Heidelberg, 2006: 420-431.
[11] Kit C, Ng J Y H. An intelligent web agent to mine bilingual parallel pages via automatic discovery of url pairing patterns[C]//Web Intelligence and Intelligent Agent Technology Workshops, 2007 IEEE/WIC/ACM International Conferences on. IEEE, 2007: 526-529.
[12] Shani Ye, Yajuan Lv, Yun Huang, et al. Automatic parallel sentences extracting from web [C].Journal of Chinese Information Processing,2008.
[13] Nie J Y, Simard M, Isabelle P, et al. Cross-language information retrieval based on parallel texts and automatic mining of parallel texts from the Web[C]//Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, 1999: 74-81.
[14] Fan R E, Chang K W, Hsieh C J, et al. LIBLINEAR: a library for large linear classification[J]. The Journal of Machine Learning Research, 2008, 9: 1871-1874.
[15] Li P, Sun M, Xue P. Fast-Champollion: a fast and robust sentence alignment algorithm[C]//Proceedings of the 23rd International Conference on Computational Linguistics: Posters. Association for Computational Linguistics, 2010: 710-718.
[16] Koehn P. Statistical significance tests for machine translation evaluation[C]//Proceedings of EMNLP. 2004, 4: 388-395.

基金

国家863计划资助项目(2012AA011102,2011AA01A207);媒体与网络技术教育部一微软重点实验室资助项目(20123000007)
PDF(2871 KB)

517

Accesses

0

Citation

Detail

段落导航
相关文章

/