Mining Translation Pairs with Learnt Patterns from Wikipedia
DUAN Jianyong1,YAN Qiwei2,ZHANG Mei1,HU Yi2
1. College of Information Engineering, North China Univesity of Technology, Beijing 100144, China; 2. Searching Product Section, Tencent Corporation, Shanghai 200230,China
Abstract:Bilingual translation pairs play an import role in many NLP applications, such as cross language information retrieval and machine translation. The translation of proper names, out of vocabulary words, idioms and technical terminologies is one of the key factors that affect the performance of the systems. However, these translations can hardly be found in the traditional bilingual dictionary. This paper proposes a new method to automatically extract high quality translation pairs from Wikipedia based on the wide area coverage and data structure, the method not only can learn common patterns, but also learn many patterns that can hardly be found by human beings. The method contains three steps: 1) extract translation pairs from the language toolbox of the Wikipedia. They can be heuristic for the next step; 2) learn patterns of translation pairs with the knowledge of PAT-Array gained from the previous work; 3) extract other translation pairs automatically using the learned patterns. Our experimental results show the accuracy can reach 90.4%.
[1] JianYun Nie. Cross-Language Information Retrieval. Morgan & Claypool Publishers.2010. [2] 孙常龙,洪宇,葛运东等.基于维基百科的未登录词译文挖掘[J]. 计算机研究与发展,2011,6: 1068-1076. [3] Lei Shi, Cheng Niu, Ming Zhou, et al. A DOM Tree Alignment Model for Mining Parallel Data from the Web[C]//Proceedings of the ACL2006, 2006: 1-8. [4] Huang F,Zhang Y, Vogel S. Mining key phrase translations from Web corpora[C]//Proceedings of the ACL2005,2005: 483-490. [5] Resnik P,Smith N A. The Web as a parallel corpus [J]. Computational Linguistics, 2003, 29(3):349-380. [6] Tao Tao, Zhai Chengxiang. Mining comparable bilingual text corpora for cross-language information integration[C]//Proceedings of the KDD2005, 2005: 691-696. [7] Talvensaari T, Laurikkala J, Jarvenlink, et al. Creating and exploiting a comparable corpus in cross-language information retrieval[J]. ACM Trans on Information Systems, 2007, 25(1):1-21. [8] J-Y Nie, M Simard, P Isabelle et al. Cross-Language Information Retrieval Based on Parallel Texts and Automatic Mining of parallel Text from the Web[C]//Proceedings of the SIGIR1999, 1999:74-81. [9] M Nagata, T Saito, K Suzuki. Using the web as a bilingual dictionary[C]//Proceedings of the ACL 2001 Workshop Data-Driven Methods in Machine Translation. 2001: 95-102. [10] W H Lu, L F Chien,H J Lee. Translation of web queries using anchor text mining [J]. ACM Trans. Asian Language Information Processing(TALIP).2002, 1(2):159-172. [11] Y. Zhang and P. Vines. Detection and Translation of OOV Terms Prior to Query Time[C]//Proceedings of the SIGIR2004, 2004: 524-525. [12] 郭稷,吕雅娟,刘群.一种高效的基于Web的双语翻译对获取方法[J].中文信息学报,2008,22(6):103-109. [13] 罗阳,季铎,张桂平,等.面向单一双语网页的双语资源挖掘方法[J].中文信息学报,2011,25(1): 375-382. [14] HALAVAIS A, LACKAFF D. An analysis of topical coverage of Wikipedia [J]. Journal of Computer-Mediated Communication, 2008, 13(2): 429-440. [15] KITTUR A, CHI E H, SUH B W. Whats in Wikipedia? Mapping topics and conflict using socially annotated category structure [C]//Proceedings of the 27th International Conference on Human Factors in Computing Systems, 2009:1509-1512. [16] 张海粟,马大明,邓智龙.基于维基百科的语义知识库及其构建方法研究[J].计算机应用研究,2011,28(8):2807-2811. [17] MANBER U, MYERSG. Suffix arrays: a new method for on-line string searches[J]. SIAM Journal on Computing, 1993, 22(5):935-948. [18] D Lin, S Zhao, B Durme et al. Mining Parenthetical Translations from the Web by Word Alignment[C]//Proceedings of the ACL-08. 2008: 994-1002.