基于I-Vector特征融合的蒙古语说话人特征提取方法

PDF(2744 KB)

中文信息学报 ›› 2023, Vol. 37 ›› Issue (1) : 71-78.

民族、跨境及周边语言信息处理

基于I-Vector特征融合的蒙古语说话人特征提取方法

韩佳俊¹,马志强^1,2,王洪彬¹,谢秀兰¹

作者信息 +

A Speaker Feature Extraction Method Based on I-vector Resource Fusion

HAN Jiajun¹,MA Zhiqiang^1,2,WANG Hongbin¹,XIE Xiulan¹

Author information +

History +

摘要

针对蒙古语语料少导致蒙古语说话人自适应语音识别系统效果差的问题,该文提出一种基于I-vector特征融合的说话人特征提取方法。首先在低资源语料和高资源语料上分别训练I-vector模型,之后利用两者训练得到的I-vector特征作为中间数据进行最后的特征融合训练。在蒙古语和TIMIT语料库上的实验结果表明,融合训练后I-vector说话人特征表现较优,与融合前的I-vector特征相比,平均WER降低了0.7%,平均SER降低了3.1%。

Abstract

Focused on the adaptive Mongolian speech recognition, this paper proposes a speaker feature extraction method based on I-vector resource fusion. First, I-vector models are trained on low-resource corpus and high-resource corpus. Then I-vector features obtained from the two corpus are used as intermediate data for final feature fusion training. Experiments on Mongolian and TIMIT corpora show that proposed method reduced the error by 0.7% according to WER and 3.1% according to SER.

导出引用

韩佳俊,马志强,王洪彬,谢秀兰. 基于I-Vector特征融合的蒙古语说话人特征提取方法. 中文信息学报. 2023, 37(1): 71-78

HAN Jiajun,MA Zhiqiang,WANG Hongbin,XIE Xiulan. A Speaker Feature Extraction Method Based on I-vector Resource Fusion. Journal of Chinese Information Processing. 2023, 37(1): 71-78

参考文献

[1] TAN T,QIAN Y,YU D,et al. Speaker-aware training of LSTM-RNNs for acoustic modelling[C]//Proceedings of the IEEE International Conference on Acoustics,Speech and Signal Processing. IEEE,2016: 5280-5284.
[2] GUPTA V,KENNY P,OUELLET P,et al. I-vector-based speaker adaptation of deep neural networks for French broadcast audio transcription[C]//Proceedings of the IEEE International Conference on Acoustics,Speech and Signal Processing. IEEE,2014: 6334-6338.
[3] HUANG Z,TANG J,XUE S,et al. Speaker adaptation of RNN-BLSTM for speech recognition based on speaker code[C]//Proceedings of the IEEE International Conference on Acoustics,Speech and Signal Processing. IEEE,2016: 5305-5309.
[4] ABDELHAMID O,JIANG H. Fast speaker adaptation of hybrid NN/HMM model for speech recognition based on discriminative learning of speaker code[C]//Proceedings of the IEEE International Conference on Acoustics,Speech and Signal Processing. IEEE,2013: 7942-7946.
[5] YU D,YAO K,SU H,et al. KL-divergence regularized deep neural network adaptation for improved large vocabulary speech recognition[C]//Proceedings of the IEEE International Conference on Acoustics,Speech and Signal Processing. IEEE,2013: 7893-7897.
[6] TTH L,GOSZTOLYA G. Adaptation of DNN acoustic models using KL-divergence regularization and multi-task training[C]//Proceedings of the International Conference on Speech and Computer. Springer,Cham,2016: 108-115.
[7] KIM M,KIM Y,YOO J,et al. Regularized speaker adaptation of KL-HMM for dysarthric speech recognition[J]. IEEE Transactions on Neural Systems and Rehabilitation Engineering,2017,25(9): 1581-1591.
[8] PRICE R,ISO K,SHINODA K. Speaker adaptation of deep neural networks using a hierarchy of output layers[C]//Proceedings of the IEEE Spoken Language Technology Workshop. IEEE,2014: 153-158.
[9] HUANG Z,LI J,SINISCALCHI S M,et al. Rapid adaptation for deep neural networks through multi-task learning[C]//Proceedings of the 16th Annual Conference of the International Speech Communication Association, 2015: 3625-3629.
[10] SAMARAKOON L,SIM K C. Low-rank bases for factorized hidden layer adaptation of DNN acoustic models[C]//Proceedings of the IEEE Spoken Language Technology Workshop. IEEE,2016: 652-658.
[11] SAMARAKOON L,SIM K C. Multi-attribute factorized hidden layer adaptation for dnn acoustic models[C]//Proceedings of the Interspeech, 2016: 3484-3488.
[12] SWIETOJANSKI P,LI J,RENALS S. Learning hidden unit contributions for unsupervised acoustic model adaptation[J]. IEEE/ACM Transactions on Audio,Speech,and Language Processing,2016,24(8): 1450-1463.
[13] ZHAO Y,LI J,GONG Y. Low-rank plus diagonal adaptation for deep neural networks[C]//Proceedings of the IEEE International Conference on Acoustics,Speech and Signal Processing. IEEE,2016: 5005-5009.
[14] KUMAR K,LIU C,GONG Y. Non-negative intermediate-layer DNN adaptation for a 10-kb speaker adaptation profile[C]//Proceedings of the IEEE International Conference on Acoustics,Speech and Signal Processing. IEEE,2016: 5285-5289.
[15] KENNY P,DUMOUCHEL P. Experiments in speaker verification using factor analysis likelihood ratios[C]//Proceedings of the Speaker and Language Recognition Workshop, 2004.
[16] DEHAK N,KENNY P J,DEHAK R,et al. Front-end factor analysis for speaker verification[J]. IEEE Transactions on Audio,Speech,and Language Processing,2010,19(4): 788-798.
[17] SAON G,SOLTAU H,NAHAMOO D,et al. Speaker adaptation of neural network acoustic models using i-vectors[C]//Proceedings of the Automatic Speech Recognition and Understanding. IEEE,2014: 55-59.
[18] YANG J,ZHANG W,LIU J. Investigation of normalization methods in speaker adaptation of deep neural network using I-vector[J]. 中国科学院大学学报,2017, 34(05): 633-639.
[19] POVEY D,GHOSHAL A,BOULIANNE G,et al. The Kaldi speech recognition toolkit[C]//Proceedings of the WorkShop on Automatic Speech Recognition and Understanding. IEEE Signal Processing Society,2011: 1-4.
[20] HUNT M J. Figures of merit for assessing connected-word recognisers[J]. Speech Communication,1990,9(4): 329-336.