基于Web数据的特定领域双语词典抽取

张永臣,孙乐,李飞,李文波,西野文人,于浩,方高林

PDF(780 KB)
PDF(780 KB)
中文信息学报 ›› 2006, Vol. 20 ›› Issue (2) : 18-25.

基于Web数据的特定领域双语词典抽取

  • 张永臣1,孙乐1,李飞1,李文波1,西野文人2,于浩2,方高林2
作者信息 +

Bilingual Dictionary Extraction for Special Domain Based on Web Data

  • ZHANG Yong-chen1,SUN Le1,LI Fei1,LI Wen-bo1,Nishino2,YU Hao2,FANG Gao-lin2
Author information +
History +

摘要

双语词典是跨语言检索以及机器翻译等自然语言处理应用的基础资源。本文提出了一种从非平行语料中抽取特定领域双语词典的算法。首先给出了算法的基本假设并回顾了相关的研究方法,然后详细给出了利用词间关系矩阵法从特定领域非平行语料中抽取双语词典的过程,最后通过大量实验分析了种子词选择对词典抽取结果的影响,实验结果表明种子词的数量和频率对词典抽取结果有积极作用。

Abstract

Bilingual dictionary is the base of many NLP applications such as multi-lingual information retrieval and machine translation. This paper proposes a method of extracting bilingual dictionary for the special domain from the non-parallel corpora : first , discusses the fundamental postulate and reviews the related research ,second , presents an algorithm of extracting the bilingual dictionary for the special domain based on the non-parallel corpora with the word relation matrix ,and finally , analyzes the influence of the seed word on the extraction of the bilingual dictionary with abundant of experimentation. The experiments demonstrate that the quantity and average frequency of the seed word pairs contribute to the results effectively.

关键词

计算机应用 / 中文信息处理 / 双语词典 / 词间关系矩阵 / 非平行语料 / 种子词

Key words

computer application / Chinese information processing / bilingual dictionary / word relation matrix / non-parallel corpus / seed word

引用本文

导出引用
张永臣,孙乐,李飞,李文波,西野文人,于浩,方高林. 基于Web数据的特定领域双语词典抽取. 中文信息学报. 2006, 20(2): 18-25
ZHANG Yong-chen,SUN Le,LI Fei,LI Wen-bo,Nishino,YU Hao,FANG Gao-lin. Bilingual Dictionary Extraction for Special Domain Based on Web Data. Journal of Chinese Information Processing. 2006, 20(2): 18-25

参考文献

[1] 孙乐. 平行语料库中双语术语词典的自动抽取[J] . 中文信息学报,2000 ,14 (6) :33 - 39.
[2] 王斌. 基于未对齐汉英双语库的翻译对抽取[J] . 中文信息学报,2000 ,14 (6) :40 - 44 ,57.
[3] 许勇. 基于互联网的术语定义获取系统[J] . 中文信息学报,2004 ,18 (4) :37 - 43.
[4] Resnik P. ,Smith N. A. The Web as a Parallel Corpus[J] . Computational Linguistics. 1 September 2003 ,vol. 29 (3) ,349 - 380.
[5] Christopher C. Yang ,Kar Wing Li. Automatic construction of English/Chinese parallel corpora[J] . Volume 54 (8) , 730 - 742.
[6] P. Fung. Compiling Bilingual Lexicon Entries From a Non-Parallel English-Chinese Corpus[A] . Workshop on Very Large Corpora. Boston. MA. 1995 ,173 - 183.
[7] P. Fung. A Statistical view on Bilingual lexicon extraction : From Parallel Corpora to non-parallel corpora[A] . In Jean Veronis. Parallel Text Processing[C] . 2000.
[8] Nagata. M, Saito. T and Suzuki. K. Using the Web as a Bilingual Dictionary[A] . In : Proc Workshop on Data-driven Methods in Machine Translation[C] . 2001 ,95 - 102.
[9] Reinhard Rapp. Identifying word translations in non-parallel texts[A] . In : Proceedings of the 35th Conference of the Association of Computational Linguistics ,student session[C] . Boston. Mass. 1995 ,321 - 322.
[10] Y. Cao and H. Li. Base Noun Phrase Translation Using Web Data and the EM Algorithm[A] . In : Proc. of the 19th International Conference on Computational Linguistics (COLING2002) [C] ,Taipei. 2002 ,127 - 133.
[11] Yang Yiming ,Pederson J O. A Comparative Study on Feature Selection in Text Categorization[A] . In : Proceedings of the 14th International Conference on Machine learning[C] . Nashville Morgan Kaufmann ,1997 ,412 - 420.
[12] Pascale Fung and Kathleen McKeown. Finding terminology translations from non-parallel corpora [A] . In The 5th Annual Workshop on Very Large Corpora[C] . Hong Kong. 1997 ,192 - 202.

基金

富士通研究开发中心合作项目;国家自然科学基金资助项目(60203007);国家“八六三”高技术研究发展计划资助项目(2003AA1Z2110);北京市科技新星计划资助项目(H020820790130)
PDF(780 KB)

817

Accesses

0

Citation

Detail

段落导航
相关文章

/