双语平行语料库在自然语言处理领域有很多重要应用,但是大规模双语平行语料库的自动获取并不容易。该文提出了一种有效的从Web上获取高质量双语平行语料库的方案,研究了候选双语混合网页获取和平行句对抽取等关键技术。运用该文方法共获取了258万双语平行句对,平均正确率为93.75%,其中前150万句对的平均正确率达到96%。该文还提出句对质量排序和领域信息检索两种方法将Web数据应用于统计机器翻译的模型训练,在IWSLT评测数据上BLEU值可以提高2到5个百分点。
Abstract
Bilingual parallel corpora can be used in many applications of NLP, but it’s not easy to acquire the large-scale corpora automatically. This paper proposes an effective solution to mine high-quality bilingual parallel corpora from web pages and analyses the key technology of obtaining candidate mix-languages web pages and sentence alignment. We have extracted 1.67 million parallel sentences, which average accuracy is 93.75%, and the accuracy of the first 1 million sentences is 96%.This paper also proposes the sentences re-ranking method and domain information retrieval method to apply the web data to the training of SMT model. Experiments conducted on the IWSLT tasks show 2 to 5 BLEU gains over baseline.
Key wordsWeb mining; parallel corpora; sentence alignment; statistical machine translation
关键词
Web挖掘 /
平行语料库 /
句子对齐 /
统计机器翻译
{{custom_keyword}} /
Key words
Web mining /
parallel corpora /
sentence alignment /
statistical machine translation
{{custom_keyword}} /
{{custom_sec.title}}
{{custom_sec.title}}
{{custom_sec.content}}
参考文献
[1] Peter F. Brown, John Cocke, Stephen A, et al.. A Statistical Approach to Machine Translation: Parameter Estimation[J]. Computational Linguistics, 1990,volume 16: 79-85.
[2] 孙乐,金友兵,杜林,等. 平行语料库中双语术语词典的自动抽取[J].中文信息学报,2000,14(6):33-39.
[3] 冯志伟. 中国语料库研究的历史与现状[J].Journal of Chinese Language and Computing,2002,11(2):127-136.
[4] Resnik,p. and N.A.Smith..The web as a Parallel Corpus[J].Comoutational Linguistics,2003, volume 29: 349-380.
[5] 叶莎妮, 吕雅娟, 黄赟,等. 基于Web的双语平行句对自动获取[J]. 中文信息学报,2008,22(5):67-73.
[6] Lei Shi, Cheng Niu, Ming Zhou,,et al.A DOM Tree Alignment Model for Mining Parallel Data from the Web[C]//Joint Pro-ceedings of the Association for Computational Linguistics and the International Conference on Computational Linguistics, Sydney, Australia,2006: 489-496.
[7] Lei Shi, Ming Zhou: Improved Sentence Alignment on Parallel Web Pages Using a Stochastic Tree Alignment Model[C]//EMNLP,2008: 505-513.
[8] Long Jiang,Shiquan Yang,Ming Zhou,et al.Mining Bilingual Data from the Web with Adaptively Learnt Patterns[C]//Joint conference of the 47th Annual Meeting of the Association for Computational Linguistics and the 4th International Joint Conference on Natural Language Processing of the Asian Federation of Natural Language Processing,2009: 870-878.
[9] 林政,吕雅娟,刘群,等. 基于双语混和网页的平行语料挖掘[C]//全国第十届计算语言学会,烟台,2009: 352-357.
[10] 刘非凡,赵军,徐波. 大规模非限定领域汉英双语语料库建设及句子对齐研究[C]//全国第七届计算语言学联合学术会议,哈尔滨,2003: 339-345.
[11] Gale, William A. Kenneth W. Church. A program for aligning sentences in Bilingual corpora[J]. Computational Linguistics,1993, 19 : 75-102.
[12] Stanley F.Chen.Aligning Sentences in Bilingual Corpora Using Lexical Information[C]//Proceedings of the 31st Annual Meeting of the Association for Computational Linguaistics,1993:9-16.
[13] DeKai Wu.Aligning a Parallel English-Chinese Corpus Statistically with Lexical Criteria[C]//Proceedings of the 32nd Annual Conference of the Association for Computational Linguaistics,1994: 80-87.
[14] T.Utsuro,H.Ikeda.Bilingual Text Matching using Bilingual Dictionary and Statistics[C]//15th COLING,1994: 1076-1082.
[15] 张艳,柏冈秀纪. 基于长度的扩展方法的汉英句子对齐[J]. 中文信息学报,2005,19(5):31-36.
[16] Kishore Papineni, Salim Roukos, Todd Ward, et al. BLEU: A Method for Automatic Evaluation of Machine Translation[C]//Proceedings of the 40th Annual Meeting on Association for Computational Linguistics, 2002: 311-318.
{{custom_fnGroup.title_cn}}
脚注
{{custom_fn.content}}
基金
国家自然科学基金资助项目(60603095)
{{custom_fund}}