基于加权二部图的汉日词对齐

吴宏林,刘绍明,于戈

PDF(854 KB)
PDF(854 KB)
中文信息学报 ›› 2007, Vol. 21 ›› Issue (5) : 101-106.
综述

基于加权二部图的汉日词对齐

  • 吴宏林1,刘绍明2,于戈1
作者信息 +

Word Alignment Between Chinese and Japanese Based on Weighted Bipartite Graph

  • WU Hong-lin1, LIU Shao-ming2, YU Ge1
Author information +
History +

摘要

高效的自动词对齐技术是词对齐语料库建设的关键所在。当前很多词对齐方法存在以下不足: 未登录词问题、灵活翻译问题和全局最优匹配问题。针对以上不足,该文提出加权二部图最大匹配词对齐模型,利用二部图为双语句对建模,利用词形、语义、词性和共现等信息计算单词间的相似度,利用加权二部图最大匹配获得最终对齐结果。在汉日词语对齐上的实验表明,该方法在一定程度上解决了以上三点不足,F-Score为80%,优于GIZA</sub><sub>++的72%。

Abstract

The paper proposed a word alignment model which matches words by maximum matching on a weighted bipartite graph and measures word similarity in terms of morphological similarity, semantic distance, part of speech and co-occurrence. The experiments on Chinese-Japanese word aligment shows that this model can partly solve some problems of existing word alignment methods, such as the unknown word problem, the synonym problem and the global optimization problem. In the experiment, the F-score of our method is 80%, better than the F-score 72% of GIZA</sub><sub>++.

关键词

计算机应用 / 中文信息处理 / 词对齐 / 二部图 / 匹配

Key words

computer application / Chinese information processing / word alignment / bipartite graph / matching

引用本文

导出引用
吴宏林,刘绍明,于戈. 基于加权二部图的汉日词对齐. 中文信息学报. 2007, 21(5): 101-106
WU Hong-lin, LIU Shao-ming, YU Ge. Word Alignment Between Chinese and Japanese Based on Weighted Bipartite Graph. Journal of Chinese Information Processing. 2007, 21(5): 101-106

参考文献

[1] F. Och, H. Ney.. A Systematic Comparison of Various Statistical Alignment Models[J]. Computational Linguistics, 2003,29(1):19-51.
[2] P. F. Brown, J. Cocke, S.A.D. Pietra, V.J.D. Pietra, et al. A Statistical Approach to Machine Translation[J]. Computational Linguistics,1990,16(2):79-85.
[3] W. Gale, K. Church. Identifying Word Correspondances in Parallel Texts[A]. DARPA Workshop on Speech and Natural Language[C]. Canada: Pacific Grove,1991,152-157.
[4] Y. Zhang, Q. Ma, H. Isahara. Use of Kanji Information in Constructing a Japanese-Chinese Bilingual Lexicon[A]. The 4th Workshop on Asian Language Resources[C]. Hainan: 2004,39-46.
[5] D. WU. Bracketing and Aligning Words and Constituents in Parallel Text Using Stochastic Inversion Transduction Grammars[A]. Parallel Text Processing: Alignment and Use of Translation Corpora[M]. Dordrecht:Kluwer,2000.
[6] 刘小虎,吴葳,李生,等.基于词典和统计的语料库词汇级对齐算法[J].情报学报,1997,16(1):20-26.
[7] 吕雅娟,赵铁军,李生,等.统计和词典方法相结合的双语语料库词对齐[A]. 自然语言理解与机器翻译[C]. 北京:清华大学出版社,2001,108-115.
[8] 常宝宝.基于统计的翻译等价词对抽取研究[J].计算机学报,2003,26(5):616-621.
[9] 吕学强,吴宏林,姚天顺.无双语词典的英汉词对齐[J].计算机学报,2004,27(8):1036-1045.
[10] J. Jiang and D. Conrath. Semantic Similarity Based on Corpus Statistics and Lexical Taxonomy[A]. International Conference on Research in Computational Linguistics[C]. China, Taiwan:1997,19-33.
[11] H.W. Kuhn. The Hungarian Method for the Assignment Problem[J]. Naval Research Logistic Quarterly,1955,21(2):83-97.

基金

本文承富士施乐访问研究员计划的资助。
PDF(854 KB)

Accesses

Citation

Detail

段落导航
相关文章

/