基于平行语料和翻译概率的多语种词对齐方法

杨飞扬,赵亚慧,崔荣一,易志伟

PDF(3401 KB)
PDF(3401 KB)
中文信息学报 ›› 2019, Vol. 33 ›› Issue (12) : 37-44.
机器翻译

基于平行语料和翻译概率的多语种词对齐方法

  • 杨飞扬,赵亚慧,崔荣一,易志伟
作者信息 +

Words Alignment in Parallel Corpus Based on Translation Probability

  • YANG Feiyang, ZHAO Yahui, CUI Rongyi, YI Zhiwei
Author information +
History +

摘要

为了实现多语种词对齐,该文提出一种以点互信息为基础的翻译概率作为改进的多语种单词关联强度度量方法。首先,论证了在服从Zipf定律的普通频级词区域,单词间关联强度的点互信息度量法可简化为翻译概率;其次,对汉语、英语、朝鲜语平行语料进行句子对齐、分词和去停用词等预处理后计算平行语料单词之间的翻译概率,取翻译概率最高的前k个词作为候选翻译词,并通过优化处理提高了词对齐准确率。实验结果表明,该方法可以不完全依赖语料规模,在小规模语料中取得94%以上的准确率,为跨语言小众文献及低资源语言词对齐提供了技术基础。

Abstract

In order to achieve multi-language word alignment, an improved multi-language word relevance measure based on PMI translation probability is proposed. Firstly, it is proved that the PMI measure method of the correlation strength between words can be simplified to translation probability in the region of ordinary frequency grade words obeying Zipf's law. Secondly, the translation probability between parallel corpus words is calculated after pre-processing of Chinese, English and Korean parallel corpus, and the top k words with the highest translation probability are chosen as candidate translation words. Further optimization is applied to improve the word alignment accuracy. The experimental results show that this method can obtain more than 94% accuracy in small-scale corpus, which provides a solution to the low-resource language word alignment.

关键词

词对齐 / 平行语料 / 翻译概率 / Zipf定律

Key words

word alignment / parallel corpus / translation probability / Zipf's law

引用本文

导出引用
杨飞扬,赵亚慧,崔荣一,易志伟. 基于平行语料和翻译概率的多语种词对齐方法. 中文信息学报. 2019, 33(12): 37-44
YANG Feiyang, ZHAO Yahui, CUI Rongyi, YI Zhiwei. Words Alignment in Parallel Corpus Based on Translation Probability. Journal of Chinese Information Processing. 2019, 33(12): 37-44

参考文献

[1] 李业刚,黄河燕,史树敏,等. 多策略机器翻译研究综述[J]. 中文信息学报,2015,29(2): 1-9.
[2] 刘洋. 神经机器翻译前沿进展[J]. 计算机研究与发展, 2017,54(6): 1144-1149.
[3] 牛翊童. 基于汉越双语平行语料库的词对齐方法研究[D]. 昆明:昆明理工大学硕士学位论文, 2015: 7-12.
[4] 刘昊. 统计机器翻译领域自适应方法研究[D]. 苏州: 苏州大学硕士学位论文,2016.
[5] Daniel B,Imed Z. Multilingual natural language processing applications: From theory to practice[M]. The IBM Press Series,IBM Press, 2012.
[6] 王爱平,张功营,刘方. EM算法研究与应用[J]. 计算机技术与发展,2009,19(9): 108-110.
[7] Boydgraber J L,Blei D M. Multilingual topic models for unaligned text[J]. Uncertainty in Artificial Intelligence,2009: 75-82.
[8] Ni X ,Sun J T ,Hu J, et al. Using multilingual topic models for improved alignment in English-Hindi MT[C]//Proceedings of International Conference on Networks, 2015: 308-315.
[9] Tian M,Zhao Y,Cui R. Identifying word translations in scientific literature based on labeled bilingual topic model and co-occurrence features [C]//Proceedings of Chinese Computational Linguistics and Natural Language Processing Based on Naturally Annotated Big Data, 2018: 76-87.
[10] Peter D T. Mining the web for sysonyms: PMI-IR versus LSA on TOEFL[C]//Proceedings of the 12th European Conference on Machine Learning, 2001: 1-12.
[11] Manku G S,Jain A,Sarma A D. Detecting near-duplicates for web crawling [C]//Proceedings of the 16th International Conference on World Wide Web. New York: ACM Press,2007: 141-150.
[12] 郑雅雯. 面向微博的文本情感分类的研究[D]. 长春: 吉林大学硕士学位论文,2018.
[13] 崔荣一,赵雪. 齐普夫定律对朝鲜语适用性的测定[J].中文信息学报,2017,31(5): 81-84,91.

基金

国家语委“十三五”科研规划项目(YB135-76);延边大学外国语言文学世界一流学科建设科研项目(18YLPY13,18YLPY14)
PDF(3401 KB)

723

Accesses

0

Citation

Detail

段落导航
相关文章

/