语音翻译的编码器需要同时编码语音中的声学信息和语义信息,单一的Fbank或Wav2vec2语音特征表征能力存在不足。通过分析人工的Fbank特征与自监督的Wav2vec2特征间的差异性,提出基于交叉注意力机制的声学特征融合方法,并探究了不同的自监督特征和融合方式,加强模型对语音中声学和语义信息的学习。结合越南语语音特点,以Fbank特征为主、Pitch特征为辅混合编码Fbank表征,构建多特征融合的越-英语音翻译模型。实验表明,使用多特征的语音翻译模型相比单特征翻译效果更优,与简单的特征拼接方法相比更有效,该文所提出的多特征融合方法在越-英语音翻译任务上提升了1.97个BLEU值。
Abstract
Speech translation requires to both acoustic and semantic information, which are beyond single representation of Fbank or Wav2vec2. This paper proposes a representation fusion method by cross-attention mechanism, and explores different self-supervised features and fusion method. Considering the characteristics of Vietnamese, a hybrid encoding representation with Fbank feature plus the Pitch feature is constructed for the Vietnamese-English speech translation. Experiments show that the proposed framework outperforms baselines by an improvement up to 1.97 BLEU.
关键词
语音翻译 /
越南语 /
特征融合
{{custom_keyword}} /
Key words
speech translation /
Vietnamese /
feature fusion
{{custom_keyword}} /
{{custom_sec.title}}
{{custom_sec.title}}
{{custom_sec.content}}
参考文献
[1] STENTIFORD F W M, STEER M G. Machine translation of speech[J]. British Telecom Technology Journal, 1988, 6(2): 116-122.
[2] MATTIA A D, ROLDANO C, LUISA B, et al. MUST-C: A multilingual speech translation corpus[C]//Proceedings of NAACL: Human Lanquage Technologies. Minneapolis, Minnesota, June: Associationfor Computational Linguistics, 2019: 2012-2017.
[3] ALEXANDRE B, OLIVIER P, LAURENT B, et al. Listen and translate: A proof of concept for end-to-end speech-to-text translation[C]//Proceedings of the NIPS Workshop on End-to-End Learning for Speech and Audio Processing. Barcelona, Spain, ffhal-01408086.
[4] HAN C, WANG M X, JI H, et al. Learning shared semantic space for speech-totext translation[C]//Proceedings of the Association for Computational Linquistics, 2021: 2214-2225.
[5] BERREBBI D, SHI J, YAN B, et al. Combining spectral and self-supervised features for low resource speech recognition and translation[C]//Proceedings of the Interspeech. Incheon, Korea, 2022: 3533-3537.
[6] NGUYEN V H. An end-to-end model for vietnamese speech recognition[C]//Proceedings of the IEEE-RIVFInternational Conference on Computing and Communication Technologies. Danang, Vietnam: IEEE, 2019: 1-6.
[7] DUONG L, ANASTASOPOULOS A, CHIANG D, et al. An attentional model for speech translation without transcription[C]//Proceedings of NAACL: Human Language Technologies. San Diego, California: Association for Computational Linguistics, 2016: 949-959.
[8] DANIEL S P, WILLIAM C, ZHANG Y, et al. SpecAugment: A simple data augmentation method for automatic speech recognition[C]//Proceedings of the Interspeech. Graz, Austria: 2019: 2613-2617.
[9] ANASTASOPOULOS A, CHIANG D. Tied multitask learning for neural speech translation[C]//Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. New Orleans, Louisiana, 2018: 82-91.
[10] MARCO G, MATTIA A, DI G, et al. End-to-end speech-translation with knowledge distillation: FBK@IWSLT2020[C]//Proceedings of the 17th International Conference on Spoken Language Translation. Online: Association for Computational Linguistics, 2020: 80-88.
[11] MOHAMED A R. Deep neural network acoustic models for ASR[D]. Canada: University of Toronto, 2014.
[12] BAEVSKI A, ZHOU Y H, Abdelrahman MOHAMED A. Wav2vec 2.0: A framework for self-supervised learning of speech representations[C]//Proceedings of the 34th International Conference on Neural Information Processing Sustems. Online: Curran Associates, Inc, 2020(33): 12449-12460.
[13] VU X S, VU T, TRAN M V, et al. HSD shared task in VLSP campaign: Hate speech detection for social good[J].arXiv preprint arXiv: 2007.06493, 2020.
[14] SENNRICH R, HADDOW B, BIRCH A. Neural machine translation of rare words with subword units[C]//Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics. Berlin, Germany, 2016: 1715-1725.
[15] WANG CH H, TANG Y, MA X T, et al. Fairseq S2T: Fast speech-to-text modeling with fairseq[C]//Proceedings of the 1st Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 10th International Joint Conference on Natural Language Processing: System Demonstrations. Suzhou, China, 2020: 33-39.
[16] BANSAL S, HERMANKAMPER K L, ADAM L, et al. Pre-training on high-resource speech recognition improves low-resource speech-to-text translation[C]//Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Minneapolis, Minnesota: 2019: 58-68.
[17] MULLER R, KORNBLITH S, GEOFLREY E H. When does label smoothing help?[C]//Proceedings of the 33rd International Conference on Neural Information Processing Systems. Vancouver, BC, Canada, 2019: 4694-4703.
[18] KINGMA D, BA J. Adam: A method for stochastic optimization[C]//Proceedings of the International Conference on Learning Representations. San Diego, CA, USA, 2015.
[19] NGUYEN H, BOUGARES F, TOMASHENKO N, et al. Investigating self-supervised pre-training for end-to-end speech translation[C]//Proceedings of the Interspeech. Shanghai, China, 2020: 1466-1470.
{{custom_fnGroup.title_cn}}
脚注
{{custom_fn.content}}
基金
国家自然科学基金(61732005,U21B2027,61972186);云南高新技术产业发展项目(201606);云南省重大科技专项计划(202103AA080015,202002AD080001-5);云南省基础研究计划(202001AS070014);云南省学术和技术带头人后备人才(202105AC160018)
{{custom_fund}}