Abstract:In this paper, the HMM-based trainable speech synthesis was applied for Chinese application. The appropriate HMM parameters are selected and optimized, and the contextual features and corresponding question set for tree-based HMM clustering are designed by considering the characteristics of Chinese, to improve the effect of HMM modeling and training. From the evaluation results, the preference score of the synthetic speech after the above improvement is 98.5%. Furthermore, in order to improve the rhythm of synthetic speech, a two-level based model is introduced for duration modeling and prediction, and the duration prediction RMSE was improved from 29.56ms to 27.01ms. From the evaluation results of the final system, the synthetic speech is stable, fluent and rhythmed. As the speech synthesis system only requires very small storage, it is specially fit for embedded application.
[1] R. H. Wang, Qingfeng Liu, Deyu Xia, Towards A Chinese Text-To-Speech System With Higher Naturalness [A] , In: Proc. of ICSLP [C]. Sydney, 1998, p2047 - 2050. [2] R. H. Wang, Zhongke Ma, Wei Li, Donglai Zhu, A Corpus-Based Chinese Speech Synthesis with Contextual-Dependent Unit Selection[A]. In: Proc. of ICSLP [C]. Beijing, 2000, p391 - 394. [3] L. R. Rabiner, A tutorial on hidden Markov models and selected applications in speech recognition . Proc. of IEEE, 1989 [J]. vol. 77, pp. 257 - 286. [4] R. E. Donovan and E. M. Eide, The IBM trainable speech synthesis system[A]. In: Proc. of ICSLP [C]. Sydney, 1998, vol. 5, pp. 1703 - 1706. [5] X. Huang, A. Acero, H. Hon, Y. Ju, J. Liu, S. Merdith, and M. Plumpe, Recent improvements on Microsoft’s trainable text-to-speech system - Whistler[A]. In: Proc. of ICASSP [C]. Munich, 1997, pp. 959-962. [6] T. masuko, K. Tokuda, T. Kobayashi, and S. Imai, Speech synthesis from HMMs using dynamic features[A]. In: Proc. of ICASSP[C]. Atlanta, 1996, pp. 389 - 392. [7] T. Yoshimura, K. Tokuda, T. Masuko, T. Kobayashi, and T. Kitamura, Simultaneous modeling of spectrum, pitch and duration in HMM-based speech synthesis[A]. In: Proc. of Eurospeech [C]. Budapest, 1999, vol.5, pp. 2347 - 2350. [8] K. Tokuda, T. Masuko, N. Miyazaki, and T. Kobayashi, Hidden Markov models based on multi-space probability distribution for pitch pattern modeling. In: Proc. of ICASSP [C]. Arizona, 1999, pp. 229 - 232. [9] T. Yoshimura, K. Tokuda, T. Masuko, T. Kobayashi and T. Kitamura, Duration modeling in HMM-based speech synthesis system[A]. In: Proc. of ICSLP [C]. Sydney, 1998, vol. 2, pp. 29 - 32. [10] H. Kawahara, I. Masuda-Katsuse and A. de Cheveigne, Restructuring speech representations using pitch-adaptive time-frequency smoothing and an instantaneous-frequency-based F0 extraction: possible role of a repetitive structure in sounds, Speech Communication [J]. 1999, vol. 27, pp. 187 - 207. [11] T. Fukada, K. Tokuda, T. Kobayashi and S. Imai, An adaptive algorithm for mel-cepstral analysis of speech [A]. In: Proc. of ICASSP [C]. 1992, vol. 1, pp. 137 - 140, 1992. [12] F. Itakura, Line spectral representation of linear predictive coefficients, Journal of Acoustic Society of America [J]. 1990, vol. 87 (4) , pp. 1738 - 1752.