Transformer-CRF词切分方法在蒙汉机器翻译中的应用

苏依拉,张振,仁庆道尔吉,牛向华,高芬,赵亚平

PDF(5579 KB)
PDF(5579 KB)
中文信息学报 ›› 2019, Vol. 33 ›› Issue (10) : 38-46.
机器翻译

Transformer-CRF词切分方法在蒙汉机器翻译中的应用

  • 苏依拉,张振,仁庆道尔吉,牛向华,高芬,赵亚平
作者信息 +

Application of Transformer-CRF Word Segmentation Method in Mongolian-Chinese Machine Translation

  • SU Yila, ZHANG Zhen, RENQING Dao'erji, NIU Xianghua, GAO Fen, ZHAO Yaping
Author information +
History +

摘要

基于编码—解码(端到端)结构的机器翻译逐渐成为自然语言处理之机器翻译的主流方法,其翻译质量较高且流畅度较好,但依然存在词汇受限、上下文语义信息丢失严重等问题。该文首先进行语料预处理,给出一种Transformer-CRF算法来进行蒙古语词素和汉语分词的预处理方法。然后构建了基于Tensor2Tensor的编码—解码模型,为了从蒙古语语料中学习更多的语法和语义知识,该文给出了一种基于词素四元组编码的词向量作为编码器输入,解码阶段。为了进一步缓解神经网络训练时出现的词汇受限问题,该文将专有名词词典引入翻译模型来进一步提高翻译质量和译文忠实度。根据构建模型对不同长度句子进行实验对比,表明模型在处理长时依赖问题上翻译性能得到提高。

Abstract

Focused on Mongolian-Chinese machine translation, this paper proposes a Transformer-CRF algorithm to perform corpus preprocessing for Mongolian morphemes and Chinese word segmentation. Then the encoding-decoding model based on Tensor2Tensor is constructed. In order to learn more grammar and semantic knowledge from Mongolian corpus, this paper presents a morpheme quad-encoded word vector as the encoder input. In order to further alleviate the vocabulary limitation problem in neural network training, this paper introduces a proper noun dictionary into the translation model. Experimental results indicate that the model has improved translation quality in dealing with long-term dependence.

关键词

蒙汉翻译 / Transformer-CRF分词算法 / 蒙古语词素

Key words

Mongolian-Chinese translation / Transformer-CRF word segmentation algorithm / Mongolian morpheme

引用本文

导出引用
苏依拉,张振,仁庆道尔吉,牛向华,高芬,赵亚平. Transformer-CRF词切分方法在蒙汉机器翻译中的应用. 中文信息学报. 2019, 33(10): 38-46
SU Yila, ZHANG Zhen, RENQING Dao'erji, NIU Xianghua, GAO Fen, ZHAO Yaping. Application of Transformer-CRF Word Segmentation Method in Mongolian-Chinese Machine Translation. Journal of Chinese Information Processing. 2019, 33(10): 38-46

参考文献

[1] Kyunghyun Cho, Bart van Merrienboer, Dzmitry Bahdanau, et al. On the properties of neural machine translation: Encoder-decoder approaches[J]. arXiv: 1409.1259v2, 2014,2(12): 55-68.
[2] Martin Sundermeyer,Ralf Schlüter,Hermann Ney. LSTM neural networks for language modeling[C]//Proceedings of the 13th Annual Conference of the International Speech Communication Association, 2012:601-608.
[3] Dzmitry Bahdanau, Kyunghyun Cho, Yoshua Bengio. Neural machine translation by jointly learning to align and translate[J]. arXiv: 1409.0473v2, 2014.
[4] Wu Yonghui,Schuster Mike,Chen Zhifeng. Google's neural machine translation system: Bridging the gap between human and machine translation[J]arXiv.1609.08144vl, 2016 (10): 1-23.
[5] 玉霞,王斯日古楞.蒙古文词性标注及融合词性因子的蒙汉统计机器翻译[J]. 内蒙古师范大学报(自然汉文版), 2015(3): 364-367.
[6] 申志鹏. 基于注意力神经网络的蒙汉机器翻译系统的研究[D].呼和浩特: 内蒙古大学硕士学位论文, 2017.
[7] 刘婉婉. 基于LSTM神经网络的蒙汉机器翻译的研究[D]. 呼和浩特: 内蒙古工业大学硕士学位论文, 2018.
[8] 乌尼尔. 基于端到端神经网络的蒙汉机器翻译的研究[D]. 呼和浩特: 内蒙古工业大学硕士学位论文, 2018.
[9] 乌尼尔, 苏依拉, 刘婉婉, et al. 基于CNN词根形态选择模型的改进蒙汉机器翻译研究[J]. 中文信息学报, 2018, 32(5),42-48.
[10] Vaswani Ashish, et al. Attention is all you need[C]//Proceedings of Advances in Neural Information Processing Systems, 2017:5998-6008.
[11] Li Zhongguo, Maosong Sun. Punctuation as implicit annotations for Chinese word segmentation[J]. Computational Linguistics, 2009,35(4): 505-512.
[12] Huang Zhiheng; XU Wei; YU Kai. Bidirectional LSTM-CRF models for sequence tagging[C]//arXiv: 1508.01991v1, 2015.
[13] Papineni K, et al.(IBM Research Report) BLEU: a method for automatic evaluation of machine translation[C]//Proceedings of Annual Meeting of the Association for Computational Linguistics, 2002, 30(2): 311-318.
[14] Turian J, Ratinov L, Bengio Y. Word representations: a simple and general method for semi-supervised learning[C]//Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, 2010:384-394.
[15] Jimmy Lei Ba, Jamie Ryan Kiros, Geoffrey E. Hinton.Layer normalization[J]. arXiv: 1607.06450v1, 2016.
[16] Lee K, He L, Lewis M, et al. End-to-end neural coreference resolution[C]//Proceedings of EMNLP, 2017.

基金

国家自然科学基金(61363052,61966028);内蒙古自治区自然科学基金(2016MS0605);内蒙古自治区民族事务委员会基金(MW-2017-MGYWXXH-03)
PDF(5579 KB)

Accesses

Citation

Detail

段落导航
相关文章

/