基于多层次预训练策略和多任务学习的端到端蒙汉语音翻译

王宁宁,飞龙,张晖

PDF(2457 KB)
PDF(2457 KB)
中文信息学报 ›› 2024, Vol. 38 ›› Issue (10) : 71-79.
民族、跨境及周边语言信息处理

基于多层次预训练策略和多任务学习的端到端蒙汉语音翻译

  • 王宁宁,飞龙,张晖
作者信息 +

End-to-end Mongolian-Chinese Speech Translation Based on Multi-level Pre-training Strategies and Multi-task Learning

  • WANG Ningning, BAO Feilong, ZHANG Hui
Author information +
History +

摘要

端到端语音翻译将源语言语音直接翻译为目标语言文本,其需要“源语言语音-目标语言文本”作为训练数据,然而这类数据极其稀缺,该文提出了一种多层次预训练策略和多任务学习相结合的训练方法,首先分别对语音识别和机器翻译模型的各个模块进行多层次预训练,接着将语音识别和机器翻译模型连接起来构成语音翻译模型,然后使用迁移学习对预训练好的模型进行多步骤微调,在此过程中又运用多任务学习的方法,将语音识别作为语音翻译的一个辅助任务来组织训练,充分利用了已经存在的各种不同形式的数据来训练端到端模型,首次将端到端技术应用于资源受限条件下的蒙汉语音翻译,构建了首个翻译质量较高、实际可用的端到端蒙汉语音翻译系统。

Abstract

End-to-end speech translation interprets the speech in source language into the text in target language. To alleviate the shortage of “source-speech-target-text” data, this study proposes a training method that combines multi-level pre-training strategy and multi-task learning. First, we perform multi-level pre-training on each module of the speech recognition and machine translation models, and then connect the speech recognition and machine translation models to form a speech translation model. After that, we use transfer learning to fine-tune the pre-trained model via the multi-task learning in multiple steps. Specifically, we use the speech recognition as an auxiliary task to organize training, and employ the existing various forms of data to train the end-to-end model. We apply this method in the Mongolian Chinese speech translation and implement the first practical system with high performance.

关键词

蒙古语 / 端到端语音翻译 / 预训练 / 多任务学习

Key words

Mongolian / end-to-end speech translation / pre-training / multi-task learning

引用本文

导出引用
王宁宁,飞龙,张晖. 基于多层次预训练策略和多任务学习的端到端蒙汉语音翻译. 中文信息学报. 2024, 38(10): 71-79
WANG Ningning, BAO Feilong, ZHANG Hui. End-to-end Mongolian-Chinese Speech Translation Based on Multi-level Pre-training Strategies and Multi-task Learning. Journal of Chinese Information Processing. 2024, 38(10): 71-79

参考文献

[1] WANG Y H, BAO F L, ZHANG H, et al. Research on Mongolian speech recognition based on FSMN[C]//Proceedings of the National CCF Conference on Natural Language Processing and Chinese Computing, Springer, Cham, 2017: 243-254.
[2] SUN S, HOU H X, WU N E, et al. Neural machine translation based on prioritized experience replay[C]//Proceedings of the Artificial Neural Networks and Machine Learning-ICANN, 2020: 358-368.
[3] LIU R, SISMAN B, BAO F, et al. Modeling prosodic phrasing with multi-task learning in tacotron-based TTS[C]//Proceedings of the IEEE Signal Processing Letters, 2020: 1470-1474.
[4] WEISS R J, CHOROWSKI J, JAITLY N, et al. Sequence-to-sequence models can directly translate foreign speech[C]//Proceedings of the Interspeech, 2017: 2625-2629.
[5] LIU Y C,XIONG H, ZHANG J J, et al. End-to-end speech translation with knowledge distillation[C]//Proceedings of the Interspeech, 2019: 1128-1132.
[6] BANSAL S, KAMPER H, LIVESCU K, et al. Pretraining on high-resource speech recognition improves low-resource speech-to-text translation[C]//Proceedings of theConference of the North, 2019: 58-68.
[7] BRARD A, BESACIER L, KOCABIYIKOGLU A C, et al. End-to-end automatic speech translation of audiobooks[C]//Proceedings of the ICASSP IEEE International Conference on Acoustics, Speech and Signal Processing, 2018: 6224-6228.
[8] WANG C Y, WU Y, LIU S J, et al. Bridging the gap between pre-training and fine-tuning for end-to-end speech translation[C]//Proceedings of the AAAI Conference on Artificial Intelligence, 2020, 34(5): 9161-9168.
[9] GRAVES A, FERNNDEZ S, GOMEZ F, et al. Connectionist temporal classification: Labelling unsegmented sequence data with recurrent neural networks[C]//Proceedings of the 23rd International Conference on Machine learning, 2006: 369-376.
[10] GRAVES A. Sequence transduction with recurrent neural networks[J]. Computer Science, 2012, 58(3): 235-242.
[11] ANASTASOPOULOS A, CHIANG D. Tied multitask learning for neural speech translation[C]//Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2018: 82-91.
[12] MIKOLOV T, CORRADO G, KAI C, et al. Efficient estimation of word representations in vector space[C]//Proceedings of the International Conference on Learning Representations, 2013.
[13] SENNRICH R, HADDOW B, BIRCH A. Neural machine translation of rare words with subword units[C]//Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, 2016: 1715-1725.
[14] PANAYOTOV V, CHEN G, POVEY D, et al. Librispeech: An ASR corpus based on public domain audio books[C]//Proceedings of the ICASSP IEEE International Conference on Acoustics, Speech and Signal Processing, 2015: 5206-5210.
[15] MUDA L, BEGAM M, ELAMVAZUTHI I. Voice recognition algorithms using Mel frequency cepstral coefficient and dynamic time warping techniques[J]. Journal of Computing, 2010, 2(3): 138-143.
[16] WATANABE S, HORI T, KARITA S, et al. ESPnet: end-to-end speech processing toolkit[C]//Proceedings of the Interspeech, 2018: 2207-2211.
[17] PAPINENI K, ROUKOS S, WARD T, et al. BLEU: A method for automatic evaluation of machine translation[C]//Proceedings of 40th Annual Meeting of the Association for Computational Linguistics,2002: 311-318.
[18] INAGUMA H, DUH K, KAWAHARA T, et al. Multilingual end-to-end speech translation[C]//Proceedings of the IEEE Automatic Speech Recognition and Understanding Workshop, 2019: 570-577.

基金

国家自然科学基金(62066033);内蒙古自然科学基金杰出青年基金(2022JQ05);内蒙古自治区科技计划项目(2021GG0158);呼和浩特市高校院所协同创新项目、内蒙古大学青年科技英才培育项目(21221505);支持地方高校改革发展资金(学科建设)、内蒙古自治区一流学科科研专项项目(YLXKZX-ND-036)
PDF(2457 KB)

223

Accesses

0

Citation

Detail

段落导航
相关文章

/