基于词典注入的藏汉机器翻译模型预训练方法

桑杰端珠,才让加

PDF(2431 KB)
PDF(2431 KB)
中文信息学报 ›› 2023, Vol. 37 ›› Issue (8) : 43-51.
民族、跨境及周边语言信息处理

基于词典注入的藏汉机器翻译模型预训练方法

  • 桑杰端珠1,2,才让加1,2
作者信息 +

Dictionary Injection for Tibetan-Chinese Machine Translation Pretraining

  • SANGJIE Duanzhu1,2, CARING Jia1,2
Author information +
History +

摘要

近年来,预训练方法在自然语言处理领域引起了广泛关注,但是在比如藏汉机器翻译等低资源的任务设定下,由于双语监督信息无法直接参与预训练,限制了预训练模型在此类任务上的性能改进。考虑到双语词典是丰富且廉价的先验翻译知识来源,同时受到跨语言交流中人们往往会使用混合语言增加沟通效率这一现象启发,该文提出一种基于词典注入的藏汉机器翻译模型的预训练方法,为预训练提供学习双语知识关联的广泛可能。经验证,该方法在藏汉和汉藏翻译方向测试集上的 BLEU 值比 BART 强基准分别高出 2.3 和 2.1,证实了该文所提出的方法在藏汉机器翻译任务上的有效性。

Abstract

In recent years, pretrained models have attracted extensive attention. To improve its effectiveness in low-resource settings such as Tibetan-Chinese machine translation, this paper proposes a technique to pretrain the Tibetan-Chinese machine translation model via dictionary injection. This approach is motivated by the fact that the bilingual dictionary is an easy resource of prior translation knowledge and popular solution for cross-lingual conversations. Empirical results show the proposed method achieves 2.3 and 2.1 improvements in BLEU scores over strong BART baselines.

关键词

藏汉 / 机器翻译 / 预训练 / 词典注入

Key words

Tibetan-Chinese / machine translation / pretraining / dictionary injection

引用本文

导出引用
桑杰端珠,才让加. 基于词典注入的藏汉机器翻译模型预训练方法. 中文信息学报. 2023, 37(8): 43-51
SANGJIE Duanzhu, CARING Jia. Dictionary Injection for Tibetan-Chinese Machine Translation Pretraining. Journal of Chinese Information Processing. 2023, 37(8): 43-51

参考文献

[1] SUTSKEVER I, VINYALS O, LE Q V. Sequence to sequence learning with neural networks[C]//Proceedings of Advances in Neural Information Processing Systems 27: Annual Conference on Neural Information Processing Systems, Quebec, Canada. 2014: 3104-3112.
[2] GEHRING J, AULI M, GRANGIER D, et al. Convolutional sequence to sequence learning[C]//Proceedings of the 34th International Conference on Machine Learning, Australia, 2017: 1243-1252.
[3] ASHISH V, NOAM S, NIKI P, et al. Attention is all you need[C]//Proceedings of Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems, USA. 2017: 6000-6010.
[4] BROWN P F, COCKE J, DELLA PIETRA S A, et al. A statistical approach to machine translation[J]. Computational Linguistics, 1990, 16(2): 79-85.
[5] WU Y, SCHUSTER M, CHEN Z, et al. Google's neural machine translation system: Bridging the gap between human and machine translation[C]//Proceedings of hte NIPS, 2016.
[6] HASSAN H, AUE A, CHEN C, et al. Achieving human parity on automatic Chinese to english new[J/OL]. ArXiv preprint arXiv:2018.1803.05567.
[7] SENNRICH R, HADDOW B, BIRCH A. Improving neural machine translation models with monolingual data[C/OL]//Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics. Berlin, Germany: Association for Computational Linguistics, 2016: 86-96.
[8] KUMARI S, JAISWAL N, PATIDAR M, et al. Domain adaptation for NMT via filtered iterative back-translation[C]//Proceedings of the 2nd Workshop on Domain Adaptation for NLP. Kyiv, Ukraine: Association for Computational Linguistics, 2021: 263-271.
[9] HE K, ZHANG X, REN S, et al. Deep residual learning for image recognition[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, USA, 2016: 770-778.
[10] DEVLIN J, CHANG M W, LEE K, et al. BERT: Pre-training of deep bidirectional transformers for language understanding[C]//Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2019: 4171-4186.
[11] BROWN T B, MANN B, RYDER N, et al. Language models are few-shot learners[C]//Proceedings of the Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems, 2020.
[12] LIU Y, OTT M, GOYAL N, et al. Roberta: A robustly optimized bert pretraining approach[J]. ArXiv preprint, arXiv:2019.1907.11692.
[13] RADFORD A, WU J, CHILD R, et al. Language models are unsupervised multitask learners[EB/OL]. https://cdn.openai.com/better-language-models/language_models_are_unsupervised_multitask_learners.pdf[2022-04-15].
[14] YANG Z, DAI Z, YANG Y, et al.XLNet: Generalized autoregressive pretraining for language understanding[C]//Proceedings of the Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems, Canada. 2019: 5754-5764.
[15] LEWIS M, LIU Y,GOYAL N, et al. BART: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension[C]//Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, 2020: 7871-7880.
[16] LIU Y, GU J, GOYAL N, et al. Multilingual denoising pre-training for neural machine translation[J]. Transactions of the Association for Computational Linguistics, 2020, 8: 726-742.
[17] FAN A, BHOSALE S, SCHWENK H, et al. Beyond English-centric multilingual machine translation[J]. Journal of Machine Learning Research, 2021, 22(107): 1-48.
[18] MATRAS Y. Mixed languages: A functional-communicative approach[J]. Bilingualism: Language and Cognition, 2000, 3(2): 79-99.
[19] JOSHI M, CHEN D, LIU Y, et al.SpanBERT: Improving pretraining by representing and predicting spans[J]. Transactions of the Association for Computational Linguistics, 2020, 8: 64-77.
[20] 孙义栋, 拥措, 杨丹. 基于volt的藏汉双向机器翻译[J]. 计算机与现代化, 2022(5): 28-32.
[21] 头旦才让, 仁青东主, 尼玛扎西,等. 基于改进字节对编码的汉藏机器翻译研究[J]. 电子科技大学学报, 2020, 50(2): 249-255,293.
[22] 慈祯嘉措, 桑杰端珠, 孙茂松, 等. 基于迭代式回译策略的藏汉机器翻译方法研究[J]. 中文信息学报, 2020, 34(11): 67-73,83.
[23] 李亚超, 熊德意, 张民, 等. 藏汉神经网络机器翻译研究[J]. 中文信息学报, 2017, 31(6): 103-109.
[24] 沙九, 冯冲,张天夫, 等. 多策略切分粒度的藏汉双向神经机器翻译研究[J]. 厦门大学学报(自然科学版), 2020, 59(2): 213-219.
[25] 慈祯嘉措, 桑杰端珠, 孙茂松, 等. 融合单语语言模型的藏汉机器翻译方法研究[J]. 中文信息学报, 2019, 33(12): 61-66.
[26] YANG Z, XU Z, CUI Y, et al. CINO: A Chinese minority pre-trained language model[C]//Proceedings of the 29th International Conference on Computational Linguistics, 2022: 3937-3949.
[27] CONNEAU A, KHANDELWAL K, GOYAL N, et al. Unsupervised cross-lingual representation learning at scale[C]//Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Online, 2020: 8440-8451.
[28] KASAI J, PAPPAS N, PENG H, et al. Deep encoder, shallow decoder: reevaluating the speed-quality tradeoff in machine translation[J]. arXiv preprint arXiv:2006.10360v1.2020.
[29] KONG X, RENDUCHINTALA A, CROSS J, et al. Multilingual neural machine translation with deep encoder and multiple shallow decoders[C]//Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics,2021: 1613-1624.
[30] DYER C, CHAHUNEAU V, SMITH N A. A simple fast and effective reparameterization of IBM model 2[C]//Proceedings of the Human Language Technologies: Conference of the North American Chapter of the Association of Computational Linguistics, USA. 2013: 644-648.
[31] ALEXIS C, GUILLAUME L. Cross-lingual language model pretraining[C]//Proceedings of the Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems, Canada. 2019: 7057-7067.
[32] 桑杰端珠, 才让加. 神经网络藏文分词方法研究[J]. 青海科技, 2018, 25: 15-21.
[33] KUDO T, RICHARDSON J.SentencePiece: A simple and language independent subword tokenizer and detokenizer for neural text processing[C]//Proceedings of the Conference on Empirical Methods in Natural Language Processing, 2018: 66-71.
[34] JOULIN A, GRAVE E, BOJANOWSKI P. FastText.zip: Compressing text classification models[C]//Proceedings of ICLR, 2017: 1-13.
[35] OTT M, EDUNOV S, BAEVSKI A, et al.fairseq: A fast, extensible toolkit for sequence modeling[C]//Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics. Minneapolis, Minnesota, 2019: 48-53.

基金

青海省重点研发与转化计划项目(2022-GX-104);青海省中央引导地方科技发展资金项目(2022ZY006)
PDF(2431 KB)

814

Accesses

0

Citation

Detail

段落导航
相关文章

/