刘 奇,刘 洋,孙茂松. URL模式与HTML结构相结合的平行网页获取方法[J]. 中文信息学报, 2013, 27(3): 91-100.
LIU Qi, LIU Yang, SUN Maosong. A Parallel Pages Mining Approach: Combining URL Patterns and HTML Structures. , 2013, 27(3): 91-100.
A Parallel Pages Mining Approach: Combining URL Patterns and HTML Structures
LIU Qi, LIU Yang, SUN Maosong
Department of Computer Science and Technology, State Key Lab on Intelligent Technology and Systems, Tsinghua National Laboratory for Information Science and Technology, Tsinghua University, Beijing 100084, China
Abstract:Parallel corpus is the fundamental resource for statistical machine translation, cross-lingual information retrieval and others information processing technologies. Although the amount of parallel data on the web is continually increasing, the heterogeneity and complexity of parallel website make it still a challenge to collect such parallel texts. This paper presents a new parallel web pages mining approach, which combines URL patterns and HTML structure together. First, we use HTML structure to recursively visit parallel pages. Then, URL patterns are used to optimize the traverse sequence of parallel web site topology. Thus an efficient and accurate parallel pages mining system is relaized. Compared with traditional approach, experiments on two parallel web sites(www.un.org and www.gov.hk1) show that this approach saves more than 50% processing timeand improves 15% accuracy, resulting a significant increase in the translation quality of MT System. Key wordsparallel pages mining; parallel corpus; URL pattern; HTML structure
[1] Koehn P, Och F J, Marcu D. Statistical phrase-based translation[C]//Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology-Volume 1. Association for Computational Linguistics, 2003: 48-54. [2] Chiang D. Hierarchical phrase-based translation[J]. computational linguistics, 2007, 33(2): 201-228. [3] Galley M, Graehl J, Knight K, et al. Scalable inference and training of context-rich syntactic translation models[C]//Proceedings of the 21st International Conference on Computational Linguistics and the 44th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, 2006: 961-968. [4] Munteanu D S, Marcu D. Improving machine translation performance by exploiting non-parallel corpora[J]. Computational Linguistics, 2005, 31(4): 477-504. [5] Ma X, Liberman M. Bits: A method for bilingual text search over the web[C]//Machine Translation Summit VII. 1999: 538-542. [6] Chen J, Nie J Y. Automatic construction of parallel English-Chinese corpus for cross-language information retrieval[C]//Proceedings of the 16th Conference on Applied Natural Language Processing. Association for Computational Linguistics, 2000: 21-28. [7] Resnik P, Smith N A. The web as a parallel corpus[J]. Computational Linguistics, 2003, 29(3): 349-380. [8] Chen J, Chau R, Yeh C H. Discovering parallel text from the World Wide Web[C]//Proceedings of the second workshop on Australasian information security, Data Mining and Web Intelligence, and Software Internationalisation-Volume 32. Australian Computer Society, Inc., 2004: 157-161. [9] Shi L, Niu C, Zhou M, et al. A dom tree alignment model for mining parallel data from the web[C]//Proceedings of the 21st International Conference on Computational Linguistics and the 44th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, 2006: 489-496. [10] Zhang Y, Wu K, Gao J, et al. Automatic Acquisition of Chinese-English Parallel Corpus from the Web[M]//Advances in Information Retrieval. Springer Berlin Heidelberg, 2006: 420-431. [11] Kit C, Ng J Y H. An intelligent web agent to mine bilingual parallel pages via automatic discovery of url pairing patterns[C]//Web Intelligence and Intelligent Agent Technology Workshops, 2007 IEEE/WIC/ACM International Conferences on. IEEE, 2007: 526-529. [12] Shani Ye, Yajuan Lv, Yun Huang, et al. Automatic parallel sentences extracting from web [C].Journal of Chinese Information Processing,2008. [13] Nie J Y, Simard M, Isabelle P, et al. Cross-language information retrieval based on parallel texts and automatic mining of parallel texts from the Web[C]//Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, 1999: 74-81. [14] Fan R E, Chang K W, Hsieh C J, et al. LIBLINEAR: a library for large linear classification[J]. The Journal of Machine Learning Research, 2008, 9: 1871-1874. [15] Li P, Sun M, Xue P. Fast-Champollion: a fast and robust sentence alignment algorithm[C]//Proceedings of the 23rd International Conference on Computational Linguistics: Posters. Association for Computational Linguistics, 2010: 710-718. [16] Koehn P. Statistical significance tests for machine translation evaluation[C]//Proceedings of EMNLP. 2004, 4: 388-395.