二分图顶点配对模型下的英汉句子对齐研究

严灿勋

PDF(1594 KB)
PDF(1594 KB)
中文信息学报 ›› 2016, Vol. 30 ›› Issue (5) : 153-159.
综述

二分图顶点配对模型下的英汉句子对齐研究

  • 严灿勋
作者信息 +

Sentence Alignment Under A Bipartite Graph Vertex Pairing Model

  • YAN Canxun
Author information +
History +

摘要

英汉平行文本句子对齐可以视为一个二分图顶点配对模型。利用完全基于英汉词典的双语句子相关性评价函数,能够对二分图的“顶点对”进行加权。该文提出的顶点配对句子对齐方法首先获取二分图全局最大权重顶点配对作为临时锚点;在此基础上,根据句子先后顺序,局部最大权重顶点配对和英汉句长比的值域范围,纠正临时锚点中的错误,补充锚点序列未覆盖的合法顶点对,同时划分句对,实现句子对齐处理。在对比实验中该句子对齐方法优于Champollion句子对齐系统。从实验对比结果和实践效果看,该句子对齐方法可行。

Abstract

Pairing vertices properly in a bipartite graph can be taken as a model for the bilingual sentence alignment. The vertex pairs in the bipartite graph can be weighted with a totally bilingual-dictionary-based evaluation function which evaluates the word correspondences between an English sentence and a Chinese sentence. In our appoach, the globally-maximum-weighted vertex pairs are first chosen as temporary anchors. Then, based on the temporary anchors, the results of the locally-maximum-weighted vertex pairs and the range of the ratio of English and Chinese sentence lengths, the mistakes in the original anchor vertex pairs are corrected and the missing vertex pairs are supplemented. Meanwhile, the sentences in the bipartite graph are simultaneously grouped into minimal groups of corresponding sentences. The comparison experiments show that the vertex-pairing sentence alignment approach works better than the Champollion sentence alignment system.

关键词

句子对齐 / 双语词典 / 平行文本 / 二分图 / 顶点配对 / 顶点对

Key words

sentence alignment / bilingual dictionary / parallel text / bipartite graph / vertex pairing / vertex pair

引用本文

导出引用
严灿勋. 二分图顶点配对模型下的英汉句子对齐研究. 中文信息学报. 2016, 30(5): 153-159
YAN Canxun. Sentence Alignment Under A Bipartite Graph Vertex Pairing Model. Journal of Chinese Information Processing. 2016, 30(5): 153-159

参考文献

[1] 孙乐, 金友兵, 杜林, 等. 平行语料库中双语术语词典的自动抽取[J], 中文信息学报, 2000, 14(6): 33-39.
[2] 李莉, 刘知远, 孙茂松. 基于中英平行专利语料的短语复述自动抽取研究[J], 中文信息学报, 2013, 27(6): 151-157.
[3] Ma, Xiaoyi. Champollion: A robust parallel text sentence aligner[C]//Proceedings of the LREC 2006: Fifth International Conference on Language Resources and Evaluation.2006: 489-492.
[4] Brown P F, Jennifer C Lai, Robert L. Mercer. Aligning Sentences in Parallel Corpora[C]//Proceedings of the 29th Annual Meeting of the Association for Computational Linguistics, Berkeley, California, 1991: 169-176.
[5] Gale W A, Church K W. A program for Aligning Sentences in Bilingual Corpora[C]//Proceedings of the 29th Annual Meeting of the Association for Computational Linguistics, Berkeley, California, 1991: 177-184.
[6] Kay M, M Roscheisen. Text-Translation Alignment[J].Computational Linguistics, 1993, 19(1): 121-142.
[7] Chen S F Aligning Sentence in Bilingual Corpora Using Lexical Information[C]//Proceedings of the 31st Annual Meeting of the Association for computational Linguistics (ACL '93),Columbus, Ohio, USA, 1993: 9-16.
[8] Moore R C. Fast and Accurate Sentence Alignment of Bilingual Corpora[C]//Proceedings of Machine Translation: From Research to Real Users, Springer, 2002: 135-144.
[9] Wu, Dekai. Aligning a Parallel English-Chinese Corpus Statistically with Lexical Criteria[C]//Proceedings of ACL 31.1994: 80-87.
[10] Tan, Chew Lim and Makoto Nagao. Automatic alignment of Japanese-Chinese bilingual texts[J].IEICE Transactions on Information and Systems, 1995, E78-D(1): 68-76.
[11] 张艳, 柏冈秀纪. 基于长度的扩展方法的汉英句子对齐[J]. 中文信息学报, 2005, 19(5): 31-36.
[12] 张亚军, 贺琛琛, 香丽芸. 限定领域的汉语-维吾尔语句子级对齐研究[J]. 软件, 2014, 35(3): 62-64.
[13] 邵健, 章成志. 从互联网上自动获取领域平行语料[J]. 现代图书情报技术, 2014, 253(12): 36-42.
[14] 刘颖, 王楠. 古汉语与现代汉语句子对齐研究[J]. 计算机应用与软件, 2013, 30(11): 127-130.
[15] Braune F, Alexander Fraser. Improved unsupervised sentence alignment for symmetrical and asymmetrical parallel corpora[C]//Proceedings of the COLING 2010: Poster Volume, Beijing, 2010: 81-89.
[16] 魏雪丽. 离散数学及其应用[M]. 北京: 机械工业出版社, 2008,4.
[17] 李维刚, 刘挺, 王震, 李生. 双语语料库段落重组对齐方法研究[C], 哈尔滨工业大学信息检索研究室论文集, 2003: 67-73.
[18] 陈相, 林鸿飞. 基于锚信息的生物医学文献双语摘要句子对齐[J]. 中文信息学报, 2009, 23(1): 58-62.
[19] Li Peng, Sun Maosong, Xue Ping. Fast-Champollion: A Fast and Robust Sentence Alignment Algorithm[C]//Proceedings of the 23rd International Conference on Computational Linguistics, Beijing, China, 2010: 710-718.
[20] 熊文新. 英汉环保领域平行语料的句对齐与再对齐[J]. 现代图书情报技术, 2013(6): 36-41.
[21] 梁茂成, 许家金. 双语语料库建设中元信息的添加和段落与句子的两级对齐[J]. 中国外语, 2012, 9(6): 37-42.

基金

中央文献对外翻译与传播协同创新中心科学研究项目(2013XT08)
PDF(1594 KB)

Accesses

Citation

Detail

段落导航
相关文章

/