识别和定位特定领域双语网站,是基于Web自动构建特定领域双语语料库的关键。然而,特定领域双语网站之间的句对质量往往差异较大。相对于原有基于句对文本特征识别过滤质量较差句对的方法。该文从句对的来源(即特定领域双语网站)出发,依据领域权威性高的网站往往蕴含高质量平行句对这一假设,提出一种基于HITS算法的双语句对挖掘优化方法。该方法通过网站之间的链接信息建立有向图模型,利用HITS算法度量网站的权威性,在此基础上,仅从权威性高的网站中抽取双语句对,用于训练特定领域机器翻译系统。该文以教育领域为目标,验证“领域权威性高的网站蕴含高质量句对”假设的可行性。实验结果表明,利用该文所提方法挖掘双语句对训练的翻译系统,相比于基准系统,其平均性能提升0.44个BLEU值。此外,针对HITS算法存在的“主题偏离”问题,该文提出基于GHITS的改进算法。结果显示,基于GHITS算法改进的机器翻译系统,其性能继续提升0.40个BLEU值。
Abstract
Identifying and locating domain-specific bilingual websites is a crucial step for the Web-based bilingual resource construction. However, the quality of sentence pairs varies among different bilingual websites. In contrast to the existing method focusing only on the sentence internal features, we explore the sentence pairs' origin information for identifying and filtering the low-quality sentences pairs. We hypothesize that, if a website is authoritative in the target domain, it tends to contain more high-quality sentence pairs. Thus, we propose a HITS based optimization method for mining domain-specific bilingual sentence pairs. In this method, we first construct a directed-graph model based on the link-info among the websites. Secondly, we propose a HITS based method for evaluating the authority of websites. Finally, we only extract the sentence pairs from the authoritative websites, and use them to enlarge the training-set of our machine translation system. Experimented on the education domain, our system achieves improvements of 0.44% BLEU score compared with existing method. A further proposed GHITS method achieve additional improvements of 0.40% BLEU score.
关键词
统计机器翻译 /
特定领域机器翻译 /
特定领域双语网站 /
权威性
{{custom_keyword}} /
Key words
statistical machine translation /
specific-domain machine translation /
specific-domain bilingual websites /
authority /
HITS
{{custom_keyword}} /
{{custom_sec.title}}
{{custom_sec.title}}
{{custom_sec.content}}
参考文献
[1] Resnik Philip. Parallel strands: A preliminary investigation into mining the web for bilingual text[M]. Springer Berlin Heidelberg: 1998.
[2] Chen Jiang, JianYun Nie. Automatic construction of parallel English-Chinese corpus for cross-language information retrieval[C]//Proceedings of the 6th conference on Applied natural language processing(ANLC). 2000: 21-28.
[3] Long Jiang, Shiquan Yang, Ming Zhou et al. Mining Bilingual Data from the Web with Adaptively Learnt Patterns[C]//Proceedings of the Joint Conference of the 47th Annual Meeting of the Association for Computational Linguistics and the 4th International Joint Conference on Natural Language Processing of the AFNLP(ACL-IJCNLP). Suntec, Singapore, 2009, 2: 870-878.
[4] Rarrick, Spencer, Chris Quirk, et al. MT detection in web-scraped parallel corpora[C]//Rroceedings of The Thirteenth Machine Translation Summit(MT Summit XIII). Xiamen, China, 2011, 422-429.
[5] Arase, Yuki, Ming Zhou. Machine Translation Detection from Monolingual Web-Text[C]//Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics(ACL). Sofia, Bulgaria, 2013: 1597-1607.
[6] Munteanu, Dragos Stefan, Daniel Marcu. Improving machine translation performance by exploiting non-parallel corpora[J]. Computational Linguistics, 2005, 31(4): 477-504.
[7] Le Liu, Yu Hong, Hao Liu. Effective Selection of Translation Model Training Data[C]//Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics(ACL). Baltimore, Maryland, USA, 2014, 569-573.
[8] 刘昊,洪宇,刘乐等. 基于全局搜索和局部分类的特定领域双语网站识别方法[C]//第二十届全国信息检索学术会议(CCIR). KunMing, China, 2014.
[9] Ma, Xiaoyi, and Mark Liberman. Bits: A method for bilingual text search over the web[C]//The eighth Machine Translation Summit(MT Summit VIII). 1999: 538-542.
[10] 叶莎妮,吕雅娟,黄赟等. 基于Web的双语平行句对自动抽取[J]. 中文信息学报, 2008, 22(5): 67-73.
[11] 冯艳卉,洪宇,颜振祥,姚建民,朱巧明. 基于搜索引擎的双语混合网页识别新方法[J]. 中文信息学报, 2011, 25(1): 71-78.
[12] Smith, Jason R., Chris Quirk, et al. Extracting parallel sentences from comparable corpora using document level alignment[C]//Proceedings of the Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics(NAACL). LOS ANGELES, USA, 2010, 403-411.
[13] Bharadwaj, Rohit G., and Vasudeva Varma. Language independent identification of parallel sentences using Wikipedia[C]//Proceedings of the 20th International Conference Companion on World Wide Web(WWW). Hyderabad, India. 2011, 11-12.
[14] Pavel Pecina, Vassilis Papavassiliou. Towards Using Web-Crawled Data for Domain Adaptation in Statistical Machine Translation[C]//Proceedings of the 15th Conference of the European Association for Machine Translation. Leuven, Belgium, 2011, 297-304.
[15] 黄瑾,吕雅娟,刘群. 基于信息检索方法的统计翻译系统训练数据选择与优化[J]. 中文信息学报, 2008, 22(2): 40-46.
[16] Keiji Yasuda, Ruiqiang Zhang, Hirofumi Yamamoto, et al. Method of selecting training data to build a compact and efficient translation model[C]//Proceedings of the International Joint Conference on Natural Language Processing(IJCNLP). Hyderabad, India, 2008: 655-660.
[17] Foster, George, Cyril Goutte, et al. Discriminative Instance Weighting for Domain Adaptation in Statistical Machine Translation[C]//Proceedings of the Empirical Methods in Natural Language Processing(EMNLP). Massachusetts, USA, 2010: 451-469
[18] Axelrod, Amittai, Xiaodong He, et al. Domain adaptation via pseudo in-domain data selection[C]//Proceedings of the 2011 Conference on Empirical Method in Natural Language Processing(EMNLP). Scotland, UK, 2011, 355-362.
[19] Kevin Duh, Graham Neubig, Katsuhito Sudoh,et al. Adaptation Data Selection using Neural Language Models: Experiment in Machine Translation[C]//Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics(ACL).Sofia, Bulgaria, 2013, 678-683.
[20] Jon M. Kleinberg. Authoritative sources in a hyperlinked environment[J]. Journal of the ACM (JACM), 1999, 46(5): 604-632.
[21] Brin, Sergey, and Lawrence Page. The anatomy of a large-scale hypertextual Web search engine[J]. Computer networks and ISDN systems, 1998, 30(1): 107-117.
[22] 范聪贤, 徐汀荣, 范强贤. Web 结构挖掘中 HITS 算法改进的研究[J]. 微计算机信息, 2010 (3): 160-162.
[23] Franz Joset Cch, Hermann Ney. A systematic comparison of various statistical alignment models[J]. Computational Linguistics, 2003,29(1): 19-51.
[24] Och, Franz Josef. Minimum error rate training in statistical machine translation[C]//Proceedings of the 41st Annual Meeting on Association for Computational Linguistics(ACL). Association for Computational Linguistics, 2003, 160-167.
[25] Kishore Papineni, Salim Roukos, Todd Ward, et al. BLEU: a method for automatic evaluation of machine translation[C]//Proceedings of the 40th annual meeting on association for computational linguistics(ACL). Association for Computational Linguistics, 2002: 311-318.
{{custom_fnGroup.title_cn}}
脚注
{{custom_fn.content}}
基金
国家自然科学基金(61373097, 61272259, 61272260, 90920004);教育部博士学科点专项基金(2009321110006, 20103201110021);江苏省自然科学基金(BK2011282);江苏省高校自然科学基金重大项目(11KJA520003);苏州市自然科学基金(SH201212)
{{custom_fund}}