为了提高蒙古语语音识别性能,该文首先将时延神经网络融合前馈型序列记忆网络应用于蒙古语语音识别任务中,通过对长序列语音帧建模来充分挖掘上下文相关信息;此外研究了前馈型序列记忆网络“记忆”模块中历史信息和未来信息长度对模型的影响;最后分析了融合的网络结构中隐藏层个数及隐藏层节点数对声学模型性能的影响。实验结果表明,时延神经网络融合前馈型序列记忆网络相比深度神经网络、时延神经网络和前馈型序列记忆网络具有更好的性能,单词错误率与基线深度神经网络模型相比降低22.2%。
Abstract
In order to improve Mongolian speech recognition, the Time Delay Neural Network (TDNN) and Feed-forward Sequential Memory Network (FSMN) are combined to model the long sequence speech frames. In addition, we investigate the influence caused by the information from the preceding and the subsequent frames in the memory block over FSMN. We compare the performance of the TDNN-LSTM using different hidden layers and nodes. The results show that the fusion of TDNN and FSMN produces better performance than DNN, TDNN and FSMN, reducing the word error rate (WER) by 22.2% compared with the DNN baseline.
关键词
蒙古语 /
语音识别 /
时延神经网络 /
前馈型序列记忆网络
{{custom_keyword}} /
Key words
Mongolian /
speech recognition /
Time Delay Neural Network /
Feed-forward Sequential Memory Network
{{custom_keyword}} /
{{custom_sec.title}}
{{custom_sec.title}}
{{custom_sec.content}}
参考文献
[1] 何珏,刘加.汉语连续语音中HMM模型状态数优化方法研究[J].中文信息学报,2006,20(6):83-88.
[2] Hinton G,Deng L,Dong Y,et al. Deep neural networks for acoustic modeling in speech recognition:The shared views of four research groups[J].IEEE Signal Processing Magazine,2012,29(6),82-97.
[3] Pan J,Liu C,Wang Z,et al. Investigation of deep neural networks(DNN) for large vocabulary continuous speech recognition:Why DNN surpasses GMMS in acoustic modeling [C]//Proceedings of the 8th International Symposium on Chinese Spoken Language Processing,2012:301-305.
[4] Waibel A,Hanazawa T,Hinton G,et al. Phoneme recognition using time-delay neural networks[J].IEEE Transactions on Acoustics,Speech,and Signal Processing,1989,37(3),328-339.
[5] Peddinti V,Povey D,Khudanpur S. A time delay neural network architecture for efficient modeling of long temporal contexts[C]//Proceedings of 16th INTERSPEECH,2015:3214-3218.
[6] Zhang S L,Jiang H,Wei S,et al. Feedforward sequential memory neural networks without recurrent feedback[J].arXiv:1510.02693.2015.
[7] Zhang S,Liu C,Jiang H,et al. Feedforward sequential memory networks:A new structure to learn long-term dependency [J].arXiv:1512.08301.2015
[8] Gao G L,Zhang S. A Mongolian speech recognition system based on HMM[C]//Proceedings of International Conference on Intelligent Computing,2006:667-676.
[9] Qilao H,Gao G L. Researching of speech recognition oriented Mongolian acoustic model[C]//Proceedings of 2008 Chinese Conference on Pattern Recognition(CCPR),2008:1-6.
[10] Bao F,Gao G L. Improving of acoustic model for the Mongolian speech recognition system[C]//Proceedings of 2009 Chinese Conference on Pattern Recognition(CCPR),2009:1-5.
[11] 飞龙,高光来,王宏伟.基于词干的蒙古语语音关键词检测方法的研究[J].中文信息学报,2016,30(1) :124-128.
[12] Bao F,Gao G L,Yan X,et al. Segmentation-based Mongolian LVCSR approach[C]//Proceedings of 38th ICASSP,2013:1-5.
[13] Zhang H,Bao F,Gao G L. Mongolian speech recognition based on deep neural networks[C]//Proceedings of 15th Chinese Computational Linguistics and Natural Language Processing Based on Naturally Annotated Big Data,2015:180-188.
[14] Zhang H W,Bao F,Gao G L,et al. Comparison on neural network based acoustic model in Mongolian speech recognition[C]//Proceedings of 20th Asian Language Processing(IALP),2016 International Conference,2016:1-5.
[15] Povey D,Ghoshal A,Boulianne G,et al. The Kaldi speech recognition toolkit[C]//Proceedings of the IEEE Workshop on Automatic Speech Recognition and Understanding Workshop,Hawaii, USA:IEEE,2011.
[16] 肖云鹏,叶卫平.基于特征参数归一化的鲁棒语音识别方法综述[J].中文信息学报,2010,24(5):106-117.
[17] Maas A L,Hannun A Y,Ng A Y. Rectifier nonlinearities improve neural network acoustic models[C]//Proceedings of 30th ICML Workshop on Deep Learning for Audio,Speech and Language Processing,2013.
{{custom_fnGroup.title_cn}}
脚注
{{custom_fn.content}}
基金
国家自然科学基金(61563040,61773224);内蒙古自然科学基金(2016ZD06)
{{custom_fund}}