该文研究了两种用于改善深度神经网络声学建模框架下自由表述口语语音评测任务后验概率估计的方法: 1)使用RNN语言模型对一遍解码N-best候选做语言模型得分重估计来获得更准确的识别结果以重新估计后验概率;2)借鉴多语种神经网络训练框架,提出将方言数据聚类状态加入解码神经网络输出节点,在后验概率估计中引入方言似然度得分以评估方言程度的新方法。实验表明,这两种方法估计出的后验概率与人工分相关度分别绝对提升了3.5%和1.0%,两种方法融合后相关度绝对提升4.9%;对于一个真实的评测任务,结合该文改进的后验概率评分特征,总体评分相关度绝对提升2.2%。
Abstract
Two methods under the deep neural network acoustic modeling framework are proposed to improve the estimation of posterior probability for evaluation of pronunciation of freely-spoken speech: 1) the posterior probability is re-estimated with more accurate recognition results by employing RNN language model to re-score the N-best candidates produced from the first decoding process; 2) the influence of dialect to posterior probability is taken into account by involving likelihood scores produced by dialect clustered nodes added to deep neural network acoustic model which is re-trained as a multi-lingual style. Experimental results show that these methods increase the correlation (between posterior probabilities and human scores) for 3.5% and 1.0% respectively, and the combination of these two methods achieves 4.9% increase. In a real evaluation task, a 2.2% absolute improvement is observed in correlation between machine scores and human scores.
关键词
自由表述口语 /
语音评测 /
后验概率 /
深度神经网络 /
RNN语言模型
{{custom_keyword}} /
Key words
freely spoken speech /
pronunciation quality evaluation /
posterior probability /
deep neural network /
RNN language model
{{custom_keyword}} /
{{custom_sec.title}}
{{custom_sec.title}}
{{custom_sec.content}}
参考文献
[1] Witt S M. Use of speech recognition in computer-assisted language learning[D]. University of Cambridge, 1999.
[2] 严可, 戴礼荣. 基于音素评分模型的发音标准度评测研究[J]. 中文信息学报, 2011, 25(5): 101-108.
[3] 严可, 魏思, 戴礼荣. 针对发音质量评测的声学模型优化算法[J]. 中文信息学报, 2013 (1): 98-107.
[4] Witt S M, Young S J. Phone-level pronunciationscoring and assessment for interactive language learning[J]. Speech communication, 2000, 30(2): 95-108.
[5] 魏思, 刘庆升, 胡郁, 等. 普通话水平测试电子化系统[J]. 中文信息学报, 2006, 20(6): 89-96.
[6] 严可, 胡国平, 魏思, 等. 面向大规模英语口语机考的复述题自动评分技术[J]. 清华大学学报 (自然科学版), 2009, 1: 1356-1362.
[7] Manning C D. Foundations of statistical natural language processing[M]. MIT press, 1999:194-234.
[8] Goodman J T. A bit of progress in language modeling[J]. Computer Speech & Language, 2001, 15(4): 403-434.
[9] Mikolov T. Statistical language models based on neural networks[D]. Brno University of Technology, 2012.
[10] Young S,Evermann G, Gales M, et al. The HTK book (for HTK version 3.4)[J]. Cambridge University Engineering Department,2006,2(2): 2-3.
[11] Huang J T, Li J, Yu D, et al. Cross-language knowledge transfer using multilingual deep neural network with shared hidden layers[C]//Proceedings of the 2013 IEEE International Conference on. IEEE, 2013: 7304-7308.
[12] Dahl G E, Yu D, Deng L, et al.Context-dependent pre-trained deep neural networks for large-vocabulary speech recognition[J]. Audio, Speech, and Language Processing, IEEE Transactions on, 2012, 20(1): 30-42.
[13] 刘庆升, 魏思, 胡郁, 等. 基于语言学知识的发音质量评价算法改进[J]. 中文信息学报, 2007, 21(4): 92-96.
[14] Bourlard H A, Morgan N. Connectionist speech recognition: a hybrid approach[M]. Springer Science & Business Media, 1994.
[15] 魏思. 基于统计模式识别的发音错误检测研究[D].中国科学技术大学博士学位论文, 2008.
[16] Mikolov T, Kombrink S, Burget L, et al. Extensions of recurrent neural network language model[C]//Proceedings of Acoustics, Speech and Signal Processing (ICASSP), 2011 IEEE International Conference on. IEEE, 2011: 5528-5531.
[17] Mikolov T, Deoras A, Kombrink S, et al. Empirical Evaluation and Combination of Advanced Language Modeling Techniques [C]//Proceedings of the Interspeech. 2011 (s 1): 605-608.
[18] Thomas S, Seltzer M L, Church K, et al. Deep neural network features and semi-supervised training for low resource speech recognition[C]//Proceedings of Acoustics, Speech and Signal Processing (ICASSP), 2013 IEEE International Conference on. IEEE, 2013: 6704-6708.
[19] 国家语言文字工作委员会普通话培训测试中心.普通话水平测试实施纲要[M].北京: 商务印书馆,2004.
[20] Boersma P. Accurate short-term analysis of the fundamental frequency and the harmonics-to-noise ratio of a sampled sound[C]//Proceedings of the institute of phonetic sciences. 1993, 17(1193): 97-110.
[21] Stolcke A. SRILM-an extensible language modeling toolkit[C]//Proceedings of the Interspeech. 2002; 901-904.
[22] Mikolov T, Kombrink S, Deoras A, et al. RNNLM-Recurrent neural network language modeling toolkit[C]//Proceedings of the 2011 ASRU Workshop. 2011: 196-201.
[23] Graves A, Mohamed A R, Hinton G. Speech recognition with deep recurrent neural networks[C]//Proceedings of the Acoustics, Speech and Signal Processing (ICASSP), 2013 IEEE International Conference on. IEEE, 2013: 6645-6649.
{{custom_fnGroup.title_cn}}
脚注
{{custom_fn.content}}
基金
国家自然科学基金(61273264)
{{custom_fund}}