MonTTS:完全非自回归的实时、高保真蒙古语语音合成模型

刘瑞,康世胤,高光来,李劲东,飞龙

PDF(3759 KB)
PDF(3759 KB)
中文信息学报 ›› 2022, Vol. 36 ›› Issue (7) : 86-97.
民族、跨境及周边语言信息处理

MonTTS:完全非自回归的实时、高保真蒙古语语音合成模型

  • 刘瑞1,康世胤2,高光来1,李劲东3,飞龙1
作者信息 +

MonTTS: A Real-time and High-fidelity Mongolian TTS Model with Pure Non-autoregressive Mechanism

  • LIU Rui1, KANG Shiyin2, GAO Guanglai1, LI Jingdong3, BAO Feilong1
Author information +
History +

摘要

针对现有基于Tacotron模型的蒙古语语音合成系统存在的两个问题: ①合成效率较低; ②合成语音保真度较低,该文基于FastSpeech2模型提出了完全非自回归的实时、高保真蒙古语语音合成模型MonTTS。为了提高MonTTS模型合成蒙古语语音的韵律自然度/保真度,根据蒙古语声学特点提出以下三点创新改进: ①使用蒙古文音素序列来表征蒙古文发音信息; ②提出音素级的声学调节器以学习长时韵律变化; ③提出基于蒙古语语音识别和自回归语音合成两种时长对齐方法。同时,该文构建了一个当前最大规模的蒙古语语音合成数据库: MonSpeech。实验结果表明,MonTTS在韵律自然度方面的主观平均意见分数(Mean Opinion Score,MOS)达到4.53,显著优于当前最优的基于Tacotron的蒙古语语音合成基线系统和基线FastSpeech2模型;MonTTS合成实时率达3.63×10-3,满足实时高保真合成要求。最后,文中涉及的训练脚本和预训练模型全部开源(https://github.com/ttslr/MonTTS)。

Abstract

Aiming at real-time and high-fidelity Mongolian Text-to-Speech (TTS) generation, a FastSpeech2 based non-autoregressive Mongolian TTS system (short forMonTTS) is proposed. To improve the overall performance in terms of prosody naturalness and fidelity, MonTTS adopts three novel mechanisms: 1) Mongolian phoneme sequence is used to represent the Mongolian pronunciation; 2) phoneme-level variance adaptor is employed to learn the long-term prosody information; and 3) two duration aligners, i.e. Mongolian speech recognition and Mongolian autoregressive TTS based models, are used to provide the duration supervise signal. Besides, we build a large-scale Mongolian TTS corpus, named MonSpeech. The experimental results show that the MonTTS outperforms the state-of-the-art Tacotron-based Mongolian TTS and standard FastSpeech2 baseline systems significantly, with real-time rate (RTF) of 3.63× 10-3 and Mean Opinion Score (MOS) of 4.53(see https: //github.com/ttslr/MonTTS).

关键词

蒙古语语音合成 / 非自回归声学建模 / 非自回归神经声码器 / 实时 / 高保真

Key words

Mongolian text-to-speech (TTS) / non-autoregressive acoustic model / non-autoregressive neural vocoder / real-time / high-fidelity

引用本文

导出引用
刘瑞,康世胤,高光来,李劲东,飞龙. MonTTS:完全非自回归的实时、高保真蒙古语语音合成模型. 中文信息学报. 2022, 36(7): 86-97
LIU Rui, KANG Shiyin, GAO Guanglai, LI Jingdong, BAO Feilong. MonTTS: A Real-time and High-fidelity Mongolian TTS Model with Pure Non-autoregressive Mechanism. Journal of Chinese Information Processing. 2022, 36(7): 86-97

参考文献

[1] TaylorP. Text-to-speech synthesis[M]. Cambridge University Press, 2009.
[2] Zen H,Tokuda K, Black A W. Statistical parametric speech synthesis[J]. Speech Communication, 2009, 51(11): 1039-1064.
[3] Hunt A J, Black A W. Unit selection in a concatenative speech synthesis system using a large speech database[C]//Proceedings of the International Conference on Acoustics, Speech, and Signal Processing. IEEE, 1996, 1: 373-376.
[4] Tokuda K, Yoshimura T, Masuko T, et al. Speech parameter generation algorithms for HMM-based speech synthesis[C]//Proceedings of the International Conference on Acoustics, Speech, and Signal Processing. IEEE, 2000, 3: 1315-1318.
[5] Yamagishi J, Ling Z, King S.Robustness of HMM-based speech synthesis[C]//Proceedings of the Interspeech, 2008: 581-584.
[6] Ze H, Senior A, Schuster M. Statistical parametric speech synthesis using deep neural networks[C]//Proceedings of the International Conference on Acoustics, Speech and Signal Processing. IEEE, 2013: 7962-7966.
[7] Wu Z,Swietojanski P, Veaux C, et al. A study of speaker adaptation for DNN-based speech synthesis[C]//Proceedings of the Interspeech, 2015: 879-883.
[8] Zangar I, Mnasri Z, Colotte V, et al. Duration modeling using DNN for Arabic speech synthesis[C]//Proceedings of the 9th International Conference on Speech Prosody, 2018: 597-601
[9] Griffin D, Lim J. Signal estimation from modified short-time Fourier transform[J]. IEEE Transactions on Acoustics, Speech, and Signal processing,1984, 32(2): 236-243.
[10] Agiomyrgiannakis Y. Vocaine the vocoder and applications in speech synthesis[C]//Proceedings of the International Conference on Acoustics, Speech and Signal Processing. IEEE, 2015: 4230-4234.
[11] Kawahara H. Straigh T. Exploitation of the other aspect of VOCODER: Perceptually isomorphic decomposition of speech sounds[J]. Acoustical Science and Technology, 2006, 27(6): 349-353.
[12] Morise M, Yokomori F, Ozawa K. WORLD: A vocoder-based high-quality speech synthesis system for real-time applications[J]. IEICE Transactions on Information and Systems, 2016, 99(7): 1877-1884.
[13] Wang W, Xu S, Xu B. First step towards end-to-end parametric TTS synthesis: Generating spectral parameters with neural attention[C]//Proceedings of the Interspeech, 2016: 2243-2247.
[14] Wang Y, Skerry Ryan R J, Stanton D, et al.Tacotron: Towards end-to-end speech synthesis[C]//Proceedings of the Interspeech, 2017: 4006-4010.
[15] Shen J, Pang R, Weiss R J, et al. NaturalTTS synthesis by conditioning wavenet on mel spectrogram predictions[C]//Proceedings of the International Conference on Acoustics, Speech and Signal Processing. IEEE, 2018: 4779-4783.
[16] Li N, Liu S, Liu Y, et al. Neural speech synthesis with transformer network[C]//Proceedings of the AAAI Conference on Artificial Intelligence, 2019, 33(01): 6706-6713.
[17] Liu R,Sisman B, lai Gao G, et al. Expressive tts training with frame and style reconstruction loss[J]. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2021, 29: 1806-181.
[18] Liu R,Sisman B, Li J, et al. Teacher-student training for robust tacotron-based tts[C]//Proceedings of the International Conference on Acoustics, Speech and Signal Processing. IEEE, 2020: 6274-6278.
[19] Liu R,Sisman B, Bao F, et al. Modeling prosodic phrasing with multi-task learning in tacotron-based TTS[J]. IEEE Signal Processing Letters, 2020, 27: 1470-1474.
[20] Liu R,Sisman B, Li H. Graphspeech: Syntax-aware graph attention network for neural speech synthesis[C]//Proceedings of the International Conference on Acoustics, Speech and Signal Processing.IEEE, 2021: 6059-6063.
[21] Elias I, Zen H, Shen J, et al. Parallel Tacotron 2: A non-autoregressive neural TTS model with differentiable duration modeling[C]//Proceedings of the Interspeech, 2021: 141-145.
[22] Liu R,Sisman B, Lin Y, et al. Fast Talker: A neural text-to-speech architecture with shallow and group autoregression[J]. Neural Networks, 2021, 141: 306-314.
[23] Ren Y,Ruan Y, Tan X, et al. Fast Speech: Fast, robust and controllable text to speech[C]//Proceedings of the 33rd International Conference on Neural Information Processing Systems, 2019: 3171-3180.
[24] Ren Y, Hu C, Tan X, et al.Fastspeech 2: Fast and high-quality end-to-end text to speech[C]//Proceedings of the International Conference on Learning Representations, 2020: 1-15.
[25] van den Oord A, Dieleman S, Zen H, et al.WaveNet: A generative model for raw audio[C]//Proceedings of the 9th ISCA Speech Synthesis Workshop, 2016: 125-125.
[26] Kalchbrenner N, Elsen E, Simonyan K, et al. Efficient neural audio synthesis[C]//Proceedings of the International Conference on Machine Learning. PMLR, 2018: 2410-2419.
[27] Paine T L,Khorrami P, Chang S, et al. Fast wavenet generation algorithm[J]. arXiv preprint arXiv: 1611.09482, 2016.
[28] Oord A, Li Y,Babuschkin I, et al. Parallel wavenet: Fast high-fidelity speech synthesis[C]//Proceedings of the International Conference on Machine Learning. PMLR, 2018: 3918-3926.
[29] Prenger R, Valle R, Catanzaro B. Waveglow: A flow-based generative network for speech synthesis[C]//Proceedings of the International Conference on Acoustics, Speech and Signal Processing. IEEE, 2019: 3617-3621.
[30] Kumar K, Kumar R, deBoissiere T, et al. MelGAN: Generative adversarial networks for conditional waveform synthesis[C]//Proceedings of the 33rd International Conference on Neural Information Processing Systems,2019: 14910-14921.
[31] Kong J, Kim J, Bae J. HiFi-GAN: Generative adversarial networks for efficient and high fidelity speech synthesis[J]. Advances in Neural Information Processing Systems, 2020, 33: 17022-17033.
[32] Xu J, Tan X, Ren Y, et al.LRspeech: Extremely low-resource speech synthesis and recognition[C]//Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2020: 2802-2812.
[33] Liu R,Sisman B, Bao F, et al. Exploiting morphological and phonological features to improve prosodic phrasing for mongolian speech synthesis[J]. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2020, 29: 274-285.
[34] Bulag U E. Mongolian ethnicity and linguistic anxiety in China[J]. American Anthropologist, 2003, 105(4): 753-763.
[35] Davaatsagaan M, Paliwal K. Diphone-based concatenative speech synthesis system for mongolian[C]//Proceedings of the International Multi Conference of Engineers and Computer Scientists. 2008, 1: 19-21.
[36] Lai GG, Min H, Qin Z S. The research and implementation of Mongolian text to speech system[C]//Proceedings of the 6th International Conference on Signal Processing, 2002. IEEE, 2002, 1: 472-475.
[37] Qi B. On some problems about the text inmongolian speech synthesis[C]//Proceedings of the International Conference on Asian Language Processing. IEEE, 2017: 48-51.
[38] Gao G, Zhao S, et al. The research and implementation of mongolian text to speech system[C]//Proceedings of the 6th International Conference on Signal Processing. IEEE, 2002: 472-475.
[39] Zhao J, Gao G, Bao F, et al. Research on hmm-basedmongolian speech synthesis[J]. Computer Science, 2014, 41(1): 80-104.
[40] Liu R, Bao F, Gao G, et al. Mongolian text-to-speech system based on deep neural network[C]//Proceedings of the National Conference on Man-machine Speech Communication. Springer, Singapore, 2017: 99-108.
[41] Li J, Zhang H, Liu R, et al. End-to-endmongolian text-to-speech system[C]//Proceedings of the 11th international symposium on chinese spoken language processing. IEEE, 2018: 483-487.
[42] Liu R, Bao F, Gao G, et al. Improving Mongolian phrase break prediction by using syllable and morphological embeddings with BiLSTM Model[C]//Proceedings of the Interspeech, 2018: 57-61.
[43] Liu R, Bao F L, Gao G, et al. Phonologically awareBiLSTM model for mongolian phrase break prediction with attention mechanism[C]//Proceedings of the Pacific Rim International Conference on Artificial Intelligence. Springer, Cham, 2018: 217-231.
[44] Liu R, Bao F, Gao G. Building mongolian tts front-end with encoder-decoder model by using bridge method and multi-view features[C]//Proceedings of the International Conference on Neural Information Processing. Springer, Cham, 2019: 642-651.
[45] Wang Y, Bao F, Zhang H, et al.Joint alignment learning-attention based model for grapheme-to-phoneme conversion[C]//Proceedings of the International Conference on Acoustics, Speech and Signal Processing. IEEE, 2021: 7788-7792.
[46] Wang Y, Bao F, Zhang H, et al. Research on Mongolian speech recognition based on FSMN[C]//Proceedings of the National CCF Conference on Natural Language Processing and Chinese Computing. Springer, Cham, 2017: 243-254.
[47] McAuliffe M,Socolof M, Mihuc S, et al. Montreal forced aligner: Trainable text-speech alignment using kaldi[C]//Proceedings of the Interspeech, 2017: 498-502.
[48] Tachibana H,Uenoyama K, Aihara S. Efficiently trainable text-to-speech system based on deep convolutional networks with guided attention[C]//Proceedings of the International Conference on Acoustics, Speech and Signal Processing. IEEE, 2018: 4784-4788.
[49] Kominek J, Schultz T, Black A W. Synthesizer voice quality of new languages calibrated with mean mel cepstral distortion[C]//Proceedings of the Spoken Languages Technologies for Under-Resourced Languages,2008: 63-68.
[50] Senin P. Dynamic time warping algorithm review[J]. Information and Computer Science Department University of Hawaii at Manoa Honolulu, USA, 2008, 855(1-23): 40.

基金

内蒙古大学“骏马计划”高层次人才引进项目(100000-22311201/002);国家重点研发计划项目(2018YFE0122900);国家自然科学基金(61773224,62066033);内蒙古自然科学基金(2018MS06006);内蒙古自治区成果转化项目(CGZH2018125);内蒙古自治区应用技术研究与开发资金项目(2019GG372,2020GG0046)
PDF(3759 KB)

1864

Accesses

0

Citation

Detail

段落导航
相关文章

/