基于CNN-CTC的蒙古语层迁移语音识别模型

吕浩田,马志强,王洪彬,谢秀兰

PDF(2700 KB)
PDF(2700 KB)
中文信息学报 ›› 2022, Vol. 36 ›› Issue (6) : 52-60.
民族、跨境及周边语言信息处理

基于CNN-CTC的蒙古语层迁移语音识别模型

  • 吕浩田1,马志强1,2,王洪彬1,谢秀兰1
作者信息 +

CNN-CTC Based Layer Transfer Model for Mongolian Speech Recognition

  • LYU Haotian1, MA Zhiqiang1,2, WANG Hongbin1, XIE Xiulan1
Author information +
History +

摘要

针对蒙古语语音识别模型训练时语料资源匮乏,导致的低资源语料无法满足深度网络模型充分训练的问题。该文基于迁移学习提出了层迁移方法,针对层迁移设计了多种迁移策略构建基于CNN-CTC(卷积神经网络和连接时序分类器)的蒙古语层迁移语音识别模型,并对不同的迁移策略进行探究,从而得到最优模型。在10 000句英语语料数据集和5 000句蒙古语语料数据集上开展了层迁移模型训练中学习率选择实验、层迁移有效性实验、迁移层选择策略实验以及高资源模型训练数据量对层迁移模型的影响实验。实验结果表明,层迁移模型可以加快训练速度,且可以有效降低模型的WER;采用自下向上的迁移层选择策略可以获得最佳的层迁移模型;在有限的蒙古语语料资源下,基于CNN-CTC的蒙古语层迁移语音识别模型比普通基于CNN-CTC的蒙古语语音识别模型的WER降低10.18%。

Abstract

Focused on the low-resource corpus for the training of Mongolian speech recognition models, this paper proposes a layer transfer method based on transfer learning, and describes a variety of transfer strategies for Mongolian speech recognition based on CNN-CTC(Convolutional Neural Networks and Connectionist Temporal Classification). Using the English corpus with 10,000 sentences and the Mongolian corpus with 5000 sentences, we conducted an empirical study on the selection of learning rate in the model training, the verification of the effectiveness of layer transfer, the selection of the best transfer layer strategy, and the impact of high-resource model training data on the layer transfer model. The experimental results show that the layer transfer model can accelerate the training speed, and the bottom-up transfer layer selection strategy can achieve, under the limited Mongolian corpus resources, 10.18% lower WER than the ordinary Mongolian speech recognition model based on CNN-CTC.

关键词

语音识别 / 低语料资源 / 层迁移

Key words

acoustic model / low corpus resources / layer transfer

引用本文

导出引用
吕浩田,马志强,王洪彬,谢秀兰. 基于CNN-CTC的蒙古语层迁移语音识别模型. 中文信息学报. 2022, 36(6): 52-60
LYU Haotian, MA Zhiqiang, WANG Hongbin, XIE Xiulan. CNN-CTC Based Layer Transfer Model for Mongolian Speech Recognition. Journal of Chinese Information Processing. 2022, 36(6): 52-60

参考文献

[1] Miao Y, Gowayyed M, Metze F. EESEN: End-to-end speech recognition using deep RNN models and WFST-based decoding[C]//Proceedings of the IEEE Workshop on Automatic Speech Recognition and Understanding. IEEE, 2015: 167-174.
[2] Hannun A, Case C, Casper J, et al. Deep speech: scalingup end-to-end speech recognition[J/OL]. arXiv preprint arXiv:1412.5567, 2014.
[3] Bahdanau D, Chorowski J, Serdyuk D, et al. End-to-end attention-based large vocabulary speech recognition[C]//Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing. IEEE, 2016: 4945-4949.
[4] He Y, Sainath T N, Prabhavalkar R, et al. Streaming end-to-end speech recognition for mobile devices[C]//Proceedings of the ICASSP IEEE International Conference on Acoustics, Speech and Signal Processing. IEEE, 2019: 6381-6385.
[5] Chiu C C, Sainath T N, Wu Y, et al. State-of-the-art speech recognition with sequence-to-sequence models[C]//Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing. IEEE, 2018: 4774-4778.
[6] Rao K, Sak H, Prabhavalkar R. Exploring architectures, data and units for streaming end-to-end speech recognition with RNN-transducer[C]//Proceedings of the IEEE Automatic Speech Recognition and Understanding Workshop. IEEE, 2017: 193-199.
[7] Kim S, Hori T, Watanabe S. Joint CTC-attention based end-to-end speech recognition using multi-task learning[C]//Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing. IEEE, 2017: 4835-4839.
[8] Watanabe S, Hori T, Kim S, et al. Hybrid CTC/attention architecture for end-to-end speech recognition[J]. IEEE Journal of Selected Topics in Signal Processing, 2017, 11(8): 1240-1253.
[9] Hori T, Watanabe S, Zhang Y, et al. Advances in joint CTC-attention based end-to-end speech recognition with a deep CNN encoder and RNN-LM[C]//Proceedings of the Conference of the International Speech Communication Association, 2017: 949-953.
[10] Rao K, Senior A W, Sak H, et al. Flat start training of CD-CTC-SMBR LSTM RNN acoustic models[C]//Proceedings of the International Conference on Acoustics, Speech, and Signal Processing, 2016: 5405-5409.
[11] Thomas S, Ganapathy S, Hermansky H. Multilingual MLP features for low-resource LVCSR systems[C]//Proceedings of the Acoustics, Speech and Signal Processing, IEEE International Conference, 2012: 4269-4272.
[12] Thomas S, Seltzer M L, Church K, et al. Deep neural network features and semi-supervised training for low resource speech recognition[C]//Proceedings of the Acoustics, Speech and Signal Processing, IEEE International Conference, 2013: 6704-6708.
[13] Kanda N, Takeda R, Obuchi Y. Elastic spectral distortion for low resource speech recognition with deep neural networks[C]//Proceedings of the Automatic Speech Recognition and Understanding, 2013 IEEE Workshop, 2013: 309-314.
[14] Xu H, Do V H, Xiao X, et al. A comparative study of BNF and DNN multilingual training on cross-linguallow-resource speech recognition[C]//Proceedings of the 16th Annual Conference of the International Speech Communication Association,2015:2132-2136.
[15] Zhang W, Fung P. Low resource speech recognition with automatically learned sparse inverse covariance matrices[C]//Proceedings of the Acoustics, Speech and Signal Processing, IEEE International Conference, 2012: 4737-4740.
[16] Zhang W, Fung P. Sparse inverse covariance matrices for low resource speech recognition[J]. IEEE Transactions on Audio, Speech, and Language Processing, 2013, 21(3): 659-668.
[17] Miao Y, Metze F, Rawat S. Deep maxout networks for low-resource speech recognition[C]//Proceedings of the IEEE Workshop on Automatic Speech Recognition and Understanding, 2013: 398-403.
[18] Miao Y, Metze F, Waibel A. Subspace mixture model for low-resource speech recognition in cross-lingual settings[C]//Proceedings of the Acoustics, Speech and Signal Processing, IEEE International Conference, 2013: 7339-7343.
[19] Vu N T, Imseng D, Povey D, et al. Multilingual deep neural network based acoustic modeling for rapid language adaptation[C]//Proceedings of the Acoustics, Speech and Signal Processing, IEEE International Conference, 2014: 7639-7643.
[20] Besacier L, Barnard E, Karpov A, et al. Automatic speech recognition for under-resourced languages: a survey[J]. Speech Communication, 2014, 56(1): 85-100.
[21] Wheatley B, Kondo K, Anderson W, et al. An evaluation of cross-language adaptation for rapid HMM development in a new language[C]//Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing, 1994: 237-240.
[22] Constantinescu A, Chollet G. On cross-language experiments and data-driven units for alisp (automatic language independent speech processing)[C]//Proceedings of the IEEE Workshop on Automatic Speech Recognition and Understanding Proceedings. 1997: 606-613.
[23] Gokcen S, Gokcen J M. A multilingual phoneme and model set: toward a universal base for automatic speech recognition[C]//Proceedings of the IEEE Workshop on Automatic Speech Recognition and Understanding Proceedings. 1997: 599-605.
[24] Kohler J. Language adaptation of multilingual phone models for vocabulary independent speech recognition tasks[C]//Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing. IEEE, 1998, 1: 417-420.
[25] Schultz T, Waibel A. Language-independent and language-adaptive acoustic modeling for speech recognition[J]. Speech Communication, 2001, 35(1-2): 31-51.
[26] 钱彦旻,刘加.低数据资源条件下基于优化的数据选择策略的无监督语音识别声学建模[J].清华大学学报(自然科学版),2013,53(07): 1001-1004.
[27] 刘迪源,郭武.基于区分性准则的 Bottleneck 特征及其在 LVCSR 中的应用[J].数据采集与处理,2016,31(02): 331-337.
[28] 秦楚雄,张连海.低资源语音识别中融合多流特征的卷积神经网络声学建模方法[J].计算机应用,2016,36(09): 2609-2615.
[29] 秦楚雄,张连海.基于 DNN 的低资源语音识别特征提取技术[J].自动化学报,2017,43(07): 1208-1219.
[30] 黄光许,田垚,康健,等.低资源条件下基于i-vector特征的LSTM 递归神经网络语音识别系统[J].计算机应用研究,2017,34(02): 392-396.
[31] 舒帆,屈丹,张文林,等.采用长短时记忆网络的低资源语音识别方法[J].西安交通大学学报,2017,51(10): 120-127.
[32] Zhang Y, Chuangsuwanich E, Glass J, et al. Prediction-adaptation-correction recurrent neural networks for low-resource language speech recognition[C]//Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing. 2016: 5415-5419.
[33] Yosinski J, Clune J, Bengio Y, et al. How transferable are features in deep neural networks?[C]//Proceedings of the 27th International Conference on Neural Information Processing Systems, 2014: 3320-3328.

基金

国家自然科学基金(61762070,61862048);内蒙古自治区自然科学基金(2019MS06004);内蒙古自治区科技重大专项(2019ZD015);内蒙古自治区关键技术攻关计划(2019GG273)
PDF(2700 KB)

1333

Accesses

0

Citation

Detail

段落导航
相关文章

/