自动语音翻译(AST)是将源语言语音转换为目标语言文字的技术。目前,端到端的语音翻译成为AST的研究主流,但面临数据稀缺问题。该文首先利用机器翻译和人工检验构建了20h的维吾尔语-汉语AST语音翻译数据集。其次,为提高端到端语音翻译模型的性能,使用语料相对丰富的目标语言语音识别数据集预训练模型,不仅解决了数据稀缺造成的模型无法收敛问题,而且能让模型学到目标语言的语言学知识;再次,在预训练解码器前添加映射模块,使其学到源语言到目标语言知识的映射关系,由此构建了端到端语音翻译模型。最后,使用CTC与Attention联合解码,强制语音标签对齐,提高翻译效果。实验结果表明,在维汉语音翻译数据集上达到了 61.45 BLEU 值。
Abstract
Automatic Speech Translation (AST) is a technology that converts speech in a source language into text in a target language. At present, end-to-end speech translation has become the mainstream of research. This paper first constructs a 20-hour Uyghur-Chinese AST speech translation dataset using machine translation and human inspection. Then, the target language speech recognition data set with relatively rich corpus is used to pre-train the model, which avoids the data scarcity and enables the model to learn the target language. An end-to-end speech translation model is established by adding a mapping module before the decoder, so that it can learn the mapping relationship between the source language and the target language. In addition, the CTC and attention joint decoding is adopted to enforce the alignment of voice tags and improve the translation quality. Experimental results show that our method achieves 61.45 BLEU score on the Uyghur-Chinese speech translation dataset.
关键词
语音翻译 /
端到端 /
数据集构建
{{custom_keyword}} /
Key words
Keywords:speech translation /
end-to-end /
data set construction
{{custom_keyword}} /
{{custom_sec.title}}
{{custom_sec.title}}
{{custom_sec.content}}
参考文献
[1] YE R,WANG M,LI L. End-to-end speech translation via cross-modal progressive training[C]//Proceedings of the Interspeech, 2021: 2267-2271.
[2] KO Y, SUDOH K, SAKTI S, et al. Asr posterior-based loss for multi-task end-to-end speech translation[C]//Proceedings of the Interspeech, 2021: 2272-2276.
[3] LIU X F, SONG W N, YU G G, et al. A study on the phonetic translation model of datong dialect based on attention mechanism[J]. Journal of Central North University(Natural Science Edition), 2020, 41 (3): 238-243.
[4] KANO T, SAKTI S, NAKAMURA S. End-to-end speech translation with transcoding by multi-task learning for distant language pairs[J]. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2020, 28: 1342-1355.
[5] SPERBER M, NEUBIG G, NIEHUES J, et al. Attention-passing models for robust and data-efficient end-to-end speech translation[J]. Transactions of the Association for Computational Linguistics, 2019, 7: 313-325.
[6] DONG Q, WANG M, ZHOU H, et al. Consecutive decoding for speech-to-text translation[C]//Proceedings of the AAAI Conference on Artificial Intelligence. 2021, 35(14): 12738-12748.
[7] ANASTASOPOULOS A, CHIANG D. Tied multitask learning for neural speech translation[J]. arXiv preprint arXiv:1802.06655, 2018.
[8] INAGUMA H, DUH K, KAWAHARA T, et al. Multilingual end-to-end speech translation[C]//Proceedings of the IEEE Automatic Speech Recognition and Understanding Workshop. IEEE, 2019: 570-577.
[9] LE H, PINO J, WANG C, et al. Lightweight adapter tuning for multilingual speech translation[J]. arXiv preprint arXiv:2106.01463, 2021.
[10] DI GANGI M A, NEGRI M, TURCHI M. One-to-many multilingual end-to-end speech translation[C]//Proceedings of the IEEE Automatic Speech Recognition and Understanding Workshop. IEEE, 2019: 585-592.
[11] XU C, HU B, LI Y, et al. Stacked acoustic-and-textual encoding: Integrating the pre-trained models into speech translation encoders[J]. arXiv preprint arXiv:2105.05752, 2021.
[12] BANSAL S, KAMPER H, LIVESCU K, et al. Pre-training on high-resource speech recognition improves low-resource speech-to-text translation[J]. arXiv preprint arXiv:1809.01431, 2018.
[13] STOIAN M C, BANSAL S, GOLDWATER S. Analyzing asr pretraining for low-resource speech-to-text translation[C]//Proceedings of the ICASSP IEEE International Conference on Acoustics, Speech and Signal Processing. IEEE, 2020: 7909-7913.
[14] TANG Y, PINO J, LI X, et al. Improving speech translation by understanding and learning from the auxiliary text translation task[J]. arXiv preprint arXiv:2107.05782, 2021.
[15] WANG C, WU Y, LIU S, et al. Curriculum pre-training for end-to-end speech translation[J]. arXiv preprint arXiv:2004.10093, 2020.
[16] PINO J, PUZON L, GU J, et al. Harnessing indirect training data for end-to-end automatic speech translation: tricks of the trade[J]. arXiv preprint arXiv:1909.06515, 2019.
[17] INDURTHI S, HAN H, LAKUMARAPU N K, et al. End-end speech-to-text translation with modality agnostic meta-learning[C]//Proceedings of the ICASSP IEEE International Conference on Acoustics, Speech and Signal Processing. IEEE, 2020: 7904-7908.
[18] INDURTHI S, HAN H, LAKUMARAPU N K, et al. Data efficient direct speech-to-text translation with modality agnostic meta-learning[J]. arXiv preprint arXiv:1911.04283, 2019.
[19] SALESKY E, BLACK A W. Phone features improve speech translation[J]. arXiv preprint arXiv:2005.13681, 2020.
[20] PINO J, XU Q, MA X, et al. Self-training for end-to-end speech translation[J]. arXiv preprint arXiv:2006.02490, 2020.
[21] DO Q T, SAKTI S, NAKAMURA S. Sequence-to-sequence models for emphasis speech translation[J]. IEEE/ACM Transactions on Audio Speech and Language Processing, 2018, 26(10): 1873-1883.
[22] SRIDHAR V K R, BANGALORE S, NARAYANAN S. Enriching machine-mediated speech-to-speech translation using contextual information[J]. Computer Speech & Language, 2013, 27(2): 492-508.
[23] SPERBER M, PAULIK M. Speech translation and the end-to-end promise: Taking stock of where we are[J]. arXiv preprint arXiv:2004.06358, 2020.
[24] AISIKAER R, YIN S, ZHANG Z, et al. Thuyg-20: A free uyghur speech database[J]. Journal of Tsinghua University(Science and Technology),2017, 57(2): 182-187.
[25] YANG Q, ZHANG J J, FAN C H. Contextual facilitation effect and response inhibition effect of uyghur and han college students in resolution of chinese ambiguity words[J]. Acta Psychologica Sinica, 2021, 53(07): 746-757.
[26] VASWANI A, SHAZEER N, PARMAR N, et al. Attention is all you need[C]//Proceedings of the 31st International Conferance on Neural Information Processing Systems, 2017, 30: 6000-6010.
[27] LU L, ZHANG X, RENAIS S. On training the recurrent neural network encoder-decoder for large vocabulary end-to-end speech recognition[C]//Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing. IEEE, 2016: 5060-5064.
[28] WATANABE S, HORI T, KIM S, et al. Hybrid ctc/attention architecture for end-to-end speech recognition[J]. Journal of Selected Topics in Signal Processing, 2017, 11(8): 1240-1253.
[29] GULATI A, QIN J, CHIU C C, et al. Conformer: Convolution-augmented transformer for speech recognition[J]. arXiv preprint arXiv:2005.08100, 2020.
[30] BERRAR D. Cross-validation[J]. Encyclopedia of Bioinformatics and Computational Biology, 2019: 542-545.
{{custom_fnGroup.title_cn}}
脚注
{{custom_fn.content}}
基金
国家社会科学基金(17BGL199);中央民族大学研究生精品示范课程(GRSCP202316)
{{custom_fund}}