多特征融合的越英端到端语音翻译方法

马候丽,董凌,王剑,王文君,高盛祥,余正涛

PDF(10096 KB)
PDF(10096 KB)
中文信息学报 ›› 2024, Vol. 38 ›› Issue (10) : 35-45.
机器翻译

多特征融合的越英端到端语音翻译方法

  • 马候丽1,2,董凌1,2,王剑1,2,王文君1,2,高盛祥1,2,余正涛1,2
作者信息 +

A Vietnamese-English End-to-end Speech Translation Method Based on Multi-feature Fusion

  • MA Houli1,2, DONG Ling1,2, WANG Jian1,2, WANG Wenjun1,2, GAO Shengxiang1,2, YU Zhengtao1,2
Author information +
History +

摘要

语音翻译的编码器需要同时编码语音中的声学信息和语义信息,单一的Fbank或Wav2vec2语音特征表征能力存在不足。通过分析人工的Fbank特征与自监督的Wav2vec2特征间的差异性,提出基于交叉注意力机制的声学特征融合方法,并探究了不同的自监督特征和融合方式,加强模型对语音中声学和语义信息的学习。结合越南语语音特点,以Fbank特征为主、Pitch特征为辅混合编码Fbank表征,构建多特征融合的越-英语音翻译模型。实验表明,使用多特征的语音翻译模型相比单特征翻译效果更优,与简单的特征拼接方法相比更有效,该文所提出的多特征融合方法在越-英语音翻译任务上提升了1.97个BLEU值。

Abstract

Speech translation requires to both acoustic and semantic information, which are beyond single representation of Fbank or Wav2vec2. This paper proposes a representation fusion method by cross-attention mechanism, and explores different self-supervised features and fusion method. Considering the characteristics of Vietnamese, a hybrid encoding representation with Fbank feature plus the Pitch feature is constructed for the Vietnamese-English speech translation. Experiments show that the proposed framework outperforms baselines by an improvement up to 1.97 BLEU.

关键词

语音翻译 / 越南语 / 特征融合

Key words

speech translation / Vietnamese / feature fusion

引用本文

导出引用
马候丽,董凌,王剑,王文君,高盛祥,余正涛. 多特征融合的越英端到端语音翻译方法. 中文信息学报. 2024, 38(10): 35-45
MA Houli, DONG Ling, WANG Jian, WANG Wenjun, GAO Shengxiang, YU Zhengtao. A Vietnamese-English End-to-end Speech Translation Method Based on Multi-feature Fusion. Journal of Chinese Information Processing. 2024, 38(10): 35-45

参考文献

[1] STENTIFORD F W M, STEER M G. Machine translation of speech[J]. British Telecom Technology Journal, 1988, 6(2): 116-122.
[2] MATTIA A D, ROLDANO C, LUISA B, et al. MUST-C: A multilingual speech translation corpus[C]//Proceedings of NAACL: Human Lanquage Technologies. Minneapolis, Minnesota, June: Associationfor Computational Linguistics, 2019: 2012-2017.
[3] ALEXANDRE B, OLIVIER P, LAURENT B, et al. Listen and translate: A proof of concept for end-to-end speech-to-text translation[C]//Proceedings of the NIPS Workshop on End-to-End Learning for Speech and Audio Processing. Barcelona, Spain, ffhal-01408086.
[4] HAN C, WANG M X, JI H, et al. Learning shared semantic space for speech-totext translation[C]//Proceedings of the Association for Computational Linquistics, 2021: 2214-2225.
[5] BERREBBI D, SHI J, YAN B, et al. Combining spectral and self-supervised features for low resource speech recognition and translation[C]//Proceedings of the Interspeech. Incheon, Korea, 2022: 3533-3537.
[6] NGUYEN V H. An end-to-end model for vietnamese speech recognition[C]//Proceedings of the IEEE-RIVFInternational Conference on Computing and Communication Technologies. Danang, Vietnam: IEEE, 2019: 1-6.
[7] DUONG L, ANASTASOPOULOS A, CHIANG D, et al. An attentional model for speech translation without transcription[C]//Proceedings of NAACL: Human Language Technologies. San Diego, California: Association for Computational Linguistics, 2016: 949-959.
[8] DANIEL S P, WILLIAM C, ZHANG Y, et al. SpecAugment: A simple data augmentation method for automatic speech recognition[C]//Proceedings of the Interspeech. Graz, Austria: 2019: 2613-2617.
[9] ANASTASOPOULOS A, CHIANG D. Tied multitask learning for neural speech translation[C]//Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. New Orleans, Louisiana, 2018: 82-91.
[10] MARCO G, MATTIA A, DI G, et al. End-to-end speech-translation with knowledge distillation: FBK@IWSLT2020[C]//Proceedings of the 17th International Conference on Spoken Language Translation. Online: Association for Computational Linguistics, 2020: 80-88.
[11] MOHAMED A R. Deep neural network acoustic models for ASR[D]. Canada: University of Toronto, 2014.
[12] BAEVSKI A, ZHOU Y H, Abdelrahman MOHAMED A. Wav2vec 2.0: A framework for self-supervised learning of speech representations[C]//Proceedings of the 34th International Conference on Neural Information Processing Sustems. Online: Curran Associates, Inc, 2020(33): 12449-12460.
[13] VU X S, VU T, TRAN M V, et al. HSD shared task in VLSP campaign: Hate speech detection for social good[J].arXiv preprint arXiv: 2007.06493, 2020.
[14] SENNRICH R, HADDOW B, BIRCH A. Neural machine translation of rare words with subword units[C]//Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics. Berlin, Germany, 2016: 1715-1725.
[15] WANG CH H, TANG Y, MA X T, et al. Fairseq S2T: Fast speech-to-text modeling with fairseq[C]//Proceedings of the 1st Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 10th International Joint Conference on Natural Language Processing: System Demonstrations. Suzhou, China, 2020: 33-39.
[16] BANSAL S, HERMANKAMPER K L, ADAM L, et al. Pre-training on high-resource speech recognition improves low-resource speech-to-text translation[C]//Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Minneapolis, Minnesota: 2019: 58-68.
[17] MULLER R, KORNBLITH S, GEOFLREY E H. When does label smoothing help?[C]//Proceedings of the 33rd International Conference on Neural Information Processing Systems. Vancouver, BC, Canada, 2019: 4694-4703.
[18] KINGMA D, BA J. Adam: A method for stochastic optimization[C]//Proceedings of the International Conference on Learning Representations. San Diego, CA, USA, 2015.
[19] NGUYEN H, BOUGARES F, TOMASHENKO N, et al. Investigating self-supervised pre-training for end-to-end speech translation[C]//Proceedings of the Interspeech. Shanghai, China, 2020: 1466-1470.

基金

国家自然科学基金(61732005,U21B2027,61972186);云南高新技术产业发展项目(201606);云南省重大科技专项计划(202103AA080015,202002AD080001-5);云南省基础研究计划(202001AS070014);云南省学术和技术带头人后备人才(202105AC160018)
PDF(10096 KB)

Accesses

Citation

Detail

段落导航
相关文章

/