Improved Posterior Probability Estimation Methods for the Freely-Spoken Speech Evaluation
XU Sukui1, DAI Lirong1, WEI Si2, LIU Qingfeng1,2, GAO Qianyong2
1 National Engineering Laboratory of Speech and Language Information Processing, University of Science and Technology of China, Hefei,Anhui 230027, China; 1 Anhui USTC iFlytek Co., Ltd., Hefei,Anhui 230088, China
Abstract:Two methods under the deep neural network acoustic modeling framework are proposed to improve the estimation of posterior probability for evaluation of pronunciation of freely-spoken speech: 1) the posterior probability is re-estimated with more accurate recognition results by employing RNN language model to re-score the N-best candidates produced from the first decoding process; 2) the influence of dialect to posterior probability is taken into account by involving likelihood scores produced by dialect clustered nodes added to deep neural network acoustic model which is re-trained as a multi-lingual style. Experimental results show that these methods increase the correlation (between posterior probabilities and human scores) for 3.5% and 1.0% respectively, and the combination of these two methods achieves 4.9% increase. In a real evaluation task, a 2.2% absolute improvement is observed in correlation between machine scores and human scores.
[1] Witt S M. Use of speech recognition in computer-assisted language learning[D]. University of Cambridge, 1999. [2] 严可, 戴礼荣. 基于音素评分模型的发音标准度评测研究[J]. 中文信息学报, 2011, 25(5): 101-108. [3] 严可, 魏思, 戴礼荣. 针对发音质量评测的声学模型优化算法[J]. 中文信息学报, 2013 (1): 98-107. [4] Witt S M, Young S J. Phone-level pronunciationscoring and assessment for interactive language learning[J]. Speech communication, 2000, 30(2): 95-108. [5] 魏思, 刘庆升, 胡郁, 等. 普通话水平测试电子化系统[J]. 中文信息学报, 2006, 20(6): 89-96. [6] 严可, 胡国平, 魏思, 等. 面向大规模英语口语机考的复述题自动评分技术[J]. 清华大学学报 (自然科学版), 2009, 1: 1356-1362. [7] Manning C D. Foundations of statistical natural language processing[M]. MIT press, 1999:194-234. [8] Goodman J T. A bit of progress in language modeling[J]. Computer Speech & Language, 2001, 15(4): 403-434. [9] Mikolov T. Statistical language models based on neural networks[D]. Brno University of Technology, 2012. [10] Young S,Evermann G, Gales M, et al. The HTK book (for HTK version 3.4)[J]. Cambridge University Engineering Department,2006,2(2): 2-3. [11] Huang J T, Li J, Yu D, et al. Cross-language knowledge transfer using multilingual deep neural network with shared hidden layers[C]//Proceedings of the 2013 IEEE International Conference on. IEEE, 2013: 7304-7308. [12] Dahl G E, Yu D, Deng L, et al.Context-dependent pre-trained deep neural networks for large-vocabulary speech recognition[J]. Audio, Speech, and Language Processing, IEEE Transactions on, 2012, 20(1): 30-42. [13] 刘庆升, 魏思, 胡郁, 等. 基于语言学知识的发音质量评价算法改进[J]. 中文信息学报, 2007, 21(4): 92-96. [14] Bourlard H A, Morgan N. Connectionist speech recognition: a hybrid approach[M]. Springer Science & Business Media, 1994. [15] 魏思. 基于统计模式识别的发音错误检测研究[D].中国科学技术大学博士学位论文, 2008. [16] Mikolov T, Kombrink S, Burget L, et al. Extensions of recurrent neural network language model[C]//Proceedings of Acoustics, Speech and Signal Processing (ICASSP), 2011 IEEE International Conference on. IEEE, 2011: 5528-5531. [17] Mikolov T, Deoras A, Kombrink S, et al. Empirical Evaluation and Combination of Advanced Language Modeling Techniques [C]//Proceedings of the Interspeech. 2011 (s 1): 605-608. [18] Thomas S, Seltzer M L, Church K, et al. Deep neural network features and semi-supervised training for low resource speech recognition[C]//Proceedings of Acoustics, Speech and Signal Processing (ICASSP), 2013 IEEE International Conference on. IEEE, 2013: 6704-6708. [19] 国家语言文字工作委员会普通话培训测试中心.普通话水平测试实施纲要[M].北京: 商务印书馆,2004. [20] Boersma P. Accurate short-term analysis of the fundamental frequency and the harmonics-to-noise ratio of a sampled sound[C]//Proceedings of the institute of phonetic sciences. 1993, 17(1193): 97-110. [21] Stolcke A. SRILM-an extensible language modeling toolkit[C]//Proceedings of the Interspeech. 2002; 901-904. [22] Mikolov T, Kombrink S, Deoras A, et al. RNNLM-Recurrent neural network language modeling toolkit[C]//Proceedings of the 2011 ASRU Workshop. 2011: 196-201. [23] Graves A, Mohamed A R, Hinton G. Speech recognition with deep recurrent neural networks[C]//Proceedings of the Acoustics, Speech and Signal Processing (ICASSP), 2013 IEEE International Conference on. IEEE, 2013: 6645-6649.