利用平行网页建立中英文统计翻译模型

聂建云,陈江

PDF(236 KB)
PDF(236 KB)
中文信息学报 ›› 2001, Vol. 15 ›› Issue (1) : 1-12.
综述

利用平行网页建立中英文统计翻译模型

  • 聂建云,陈江
作者信息 +

Building English-Chinese Statistical Translation Models from Semi-structured Parallel Texts

  • NIE Jian-yun,CHEN Jiang
Author information +
History +

摘要

建立翻译模型的目的是试图从平行文本(或翻译例句)中自动抽取翻译关系。本文将描述我们在建立中英文统计翻译模型上的尝试。我们所用的平行文本是从万维网上自动获得的半结构性平行文本。在训练过程中,我们尽量利用文本中的HTML结构信息。实验表明,所训练的翻译模型能达到80%的准确率。对于象跨语言信息检索这样的应用,这样的准确率已经能大致满足需要。这一工作表明,对于检索引擎上的问句的翻译可以使用比机器翻译成本更低的工具。

Abstract

A statistical translation model tries to capture translation relationships from a set of parallel texts (or translation examples) . This paper describes our attempt to train such translation models from a set of semi-structured parallel texts in Chinese and English. These texts are gathered from the Web by an automatic mining tool-PTMiner. Our work takes advantage of the HTML structure of the texts. Some special processing is necessary on Chinese. Our experiments show that we can obtain a translation precision of about 80% with the trained model. This performance is reasonable for less critical tasks such as cross-language information retrieval. This work shows that it is possible to construct a means of query translation at a much lower cost than a machine translation system.

关键词

中英问句翻译 / 平行网页 / 句对齐 / 统计翻译模型 / 跨语言信息检索

Key words

Chinese-English query translation / parallel web pages / sentence alignment / statistical translation model / cross-language information retrieval

引用本文

导出引用
聂建云,陈江. 利用平行网页建立中英文统计翻译模型. 中文信息学报. 2001, 15(1): 1-12
NIE Jian-yun,CHEN Jiang. Building English-Chinese Statistical Translation Models from Semi-structured Parallel Texts. Journal of Chinese Information Processing. 2001, 15(1): 1-12

参考文献

[1] Buckley C. Implementation of the SMART information retrieval system. Technical report , # 85 - 686 , Cornell University ,1985
[2] Brown P F ,Lai J C ,Mercer R L. Aligning sentences in parallel corpora. In :29th Annual Meeting of the Association for Computational Linguistics ,Berkeley ,Calif. ,1991 ,89 - 94
[3] Brown P F ,Della Pietra S A ,Della Pietra V J et al. The mathematics of machine translation : Parameter estimation. Computational Linguistics ,1993 ,19 :263 - 311
[4] Chang J S. Chinese word segmentation through constraint satisfaction and statistical optimization. In :ROCLING 4 ,1991 ,147 - 165
[5] Chen KJ ,Kiu S H. Word identification for Mandarin Chinese sentences. In :5th International Conference on Computational Linguistics ,1992 ,101 - 107
[6] Chen S F. Aligning sentences in bilingual corpora using lexical information. In : Proceedings of the 31st Annual Meeting of the Association for Computational Linguistics ,Columbus ,Ohio ,1993 ,9 - 16
[7] Jiang Chen ,Jian-Yun Nie. Automatic construction of parallel English-Chinese corpus for cross-language information retrieval. In :Proc. of the 6th Applied Natural Language Processing Conference ,Seattle ,2000 , 21 - 28
[8] Paul Denisowski. Cedict (Chinese-English dictionary) project . http:∥www.mindspring.com/~pauldenisowski/cedict. html ,1999
[9] William A Gale ,Kenneth W. Church. A program for aligning sentences in bilingual corpora. In : Proceedings of the 29th Annual Meeting of the Association for Computational Linguistics , Berkeley , Calif. , 1991 ,177 - 184
[10] Harman D K,Voorhees E M et al. Text REtrieval Conference (TREC - 6) . Gaithersburg ,1997
[11] Isabelle P , Foster G, Plamondon P. SILC: a System for Language and coding identification. http:∥www-rali.iro.umontreal.ca/ProjetSILC.en.html ,1997
[12] Kay M ,R; scheisen M. Text-translation alignment . Computational Linguistics ,1993 ,19 :121 - 142
[13] Kwok K L. English-Chinese cross-language retrieval based on a translation package. In :Conference on Research and Development in Information Retrieval ,ACM-SIGIR ,1999
[14] Liang N Y,Zhen Y B. A Chinese word segmentation model and a Chinese word segmentation system PC-CWSS. In :COLIPS’91 ,1991 ,1 :51 - 55
[15] Jian-Yun Nie ,Wanying Jin ,Hannan M L. A hybrid approach to unknown word detection and segmentation of Chinese. In : International Conference on Chinese Computing ,Singapore ,1994 ,326 - 335
[16 ] Jian-Yun Nie ,Michel Simard ,Pierre Isabelle et al. Cross-language information retrieval based on parallel texts and automatic mining parallel texts from the Web. In :Conference on Research and Development in Information Retrieval ,ACM SIGIR’99 ,August 1999 ,74 - 81
[17] http:∥www.readworld.com/translate.htm ,1999
[18] Michel Simard ,George F Foster ,Pierre Isabelle. Using cognates to align sentences in bilingual corpora. In :Proceedings of TMI-92 ,Montreal ,Quebec ,1992
[19] Sproat R ,Shih C. A statistical method for finding word boundaries in Chinese text . Computer Processing of Chinese and Oriental Languages ,1991 ,4 (4) :336 - 351
[20] Dekai Wu. Aligning a parallel English-Chinese corpus statistically with lexical criteria. In :ACL - 94 : 32nd Annual Meeting of the Assoc. of Computational Linguistics. Las Cruces ,NM ,June 1994 ,80 - 87
[21] Dekai Wu. Large-scale automatic extraction of an English-Chinese lexicon. Machine Translation ,1995 ,9 (3 - 4) :285 - 313
PDF(236 KB)

787

Accesses

0

Citation

Detail

段落导航
相关文章

/