针对蒙古语语料少导致蒙古语说话人自适应语音识别系统效果差的问题,该文提出一种基于I-vector特征融合的说话人特征提取方法。首先在低资源语料和高资源语料上分别训练I-vector模型,之后利用两者训练得到的I-vector特征作为中间数据进行最后的特征融合训练。在蒙古语和TIMIT语料库上的实验结果表明,融合训练后I-vector说话人特征表现较优,与融合前的I-vector特征相比,平均WER降低了0.7%,平均SER降低了3.1%。
Abstract
Focused on the adaptive Mongolian speech recognition, this paper proposes a speaker feature extraction method based on I-vector resource fusion. First, I-vector models are trained on low-resource corpus and high-resource corpus. Then I-vector features obtained from the two corpus are used as intermediate data for final feature fusion training. Experiments on Mongolian and TIMIT corpora show that proposed method reduced the error by 0.7% according to WER and 3.1% according to SER.
关键词
I-Vector /
说话人自适应 /
特征提取 /
蒙古语 /
低资源
{{custom_keyword}} /
Key words
I-vector /
speaker adaptation /
feature extraction /
Mongolian /
low resource
{{custom_keyword}} /
{{custom_sec.title}}
{{custom_sec.title}}
{{custom_sec.content}}
参考文献
[1] TAN T,QIAN Y,YU D,et al. Speaker-aware training of LSTM-RNNs for acoustic modelling[C]//Proceedings of the IEEE International Conference on Acoustics,Speech and Signal Processing. IEEE,2016: 5280-5284.
[2] GUPTA V,KENNY P,OUELLET P,et al. I-vector-based speaker adaptation of deep neural networks for French broadcast audio transcription[C]//Proceedings of the IEEE International Conference on Acoustics,Speech and Signal Processing. IEEE,2014: 6334-6338.
[3] HUANG Z,TANG J,XUE S,et al. Speaker adaptation of RNN-BLSTM for speech recognition based on speaker code[C]//Proceedings of the IEEE International Conference on Acoustics,Speech and Signal Processing. IEEE,2016: 5305-5309.
[4] ABDELHAMID O,JIANG H. Fast speaker adaptation of hybrid NN/HMM model for speech recognition based on discriminative learning of speaker code[C]//Proceedings of the IEEE International Conference on Acoustics,Speech and Signal Processing. IEEE,2013: 7942-7946.
[5] YU D,YAO K,SU H,et al. KL-divergence regularized deep neural network adaptation for improved large vocabulary speech recognition[C]//Proceedings of the IEEE International Conference on Acoustics,Speech and Signal Processing. IEEE,2013: 7893-7897.
[6] TTH L,GOSZTOLYA G. Adaptation of DNN acoustic models using KL-divergence regularization and multi-task training[C]//Proceedings of the International Conference on Speech and Computer. Springer,Cham,2016: 108-115.
[7] KIM M,KIM Y,YOO J,et al. Regularized speaker adaptation of KL-HMM for dysarthric speech recognition[J]. IEEE Transactions on Neural Systems and Rehabilitation Engineering,2017,25(9): 1581-1591.
[8] PRICE R,ISO K,SHINODA K. Speaker adaptation of deep neural networks using a hierarchy of output layers[C]//Proceedings of the IEEE Spoken Language Technology Workshop. IEEE,2014: 153-158.
[9] HUANG Z,LI J,SINISCALCHI S M,et al. Rapid adaptation for deep neural networks through multi-task learning[C]//Proceedings of the 16th Annual Conference of the International Speech Communication Association, 2015: 3625-3629.
[10] SAMARAKOON L,SIM K C. Low-rank bases for factorized hidden layer adaptation of DNN acoustic models[C]//Proceedings of the IEEE Spoken Language Technology Workshop. IEEE,2016: 652-658.
[11] SAMARAKOON L,SIM K C. Multi-attribute factorized hidden layer adaptation for dnn acoustic models[C]//Proceedings of the Interspeech, 2016: 3484-3488.
[12] SWIETOJANSKI P,LI J,RENALS S. Learning hidden unit contributions for unsupervised acoustic model adaptation[J]. IEEE/ACM Transactions on Audio,Speech,and Language Processing,2016,24(8): 1450-1463.
[13] ZHAO Y,LI J,GONG Y. Low-rank plus diagonal adaptation for deep neural networks[C]//Proceedings of the IEEE International Conference on Acoustics,Speech and Signal Processing. IEEE,2016: 5005-5009.
[14] KUMAR K,LIU C,GONG Y. Non-negative intermediate-layer DNN adaptation for a 10-kb speaker adaptation profile[C]//Proceedings of the IEEE International Conference on Acoustics,Speech and Signal Processing. IEEE,2016: 5285-5289.
[15] KENNY P,DUMOUCHEL P. Experiments in speaker verification using factor analysis likelihood ratios[C]//Proceedings of the Speaker and Language Recognition Workshop, 2004.
[16] DEHAK N,KENNY P J,DEHAK R,et al. Front-end factor analysis for speaker verification[J]. IEEE Transactions on Audio,Speech,and Language Processing,2010,19(4): 788-798.
[17] SAON G,SOLTAU H,NAHAMOO D,et al. Speaker adaptation of neural network acoustic models using i-vectors[C]//Proceedings of the Automatic Speech Recognition and Understanding. IEEE,2014: 55-59.
[18] YANG J,ZHANG W,LIU J. Investigation of normalization methods in speaker adaptation of deep neural network using I-vector[J]. 中国科学院大学学报,2017, 34(05): 633-639.
[19] POVEY D,GHOSHAL A,BOULIANNE G,et al. The Kaldi speech recognition toolkit[C]//Proceedings of the WorkShop on Automatic Speech Recognition and Understanding. IEEE Signal Processing Society,2011: 1-4.
[20] HUNT M J. Figures of merit for assessing connected-word recognisers[J]. Speech Communication,1990,9(4): 329-336.
{{custom_fnGroup.title_cn}}
脚注
{{custom_fn.content}}
基金
国家自然科学基金(61762070,61862048);内蒙古自治区自然科学基金(2019MS06004);内蒙古自治区科技重大专项(2019ZD015);内蒙古自治区关键技术攻关计划项目(2019GG273)
{{custom_fund}}