本文将基于HMM的可训练语音合成方法应用到中文语音合成。通过对HMM建模参数的合理选择和优化,并基于中文语音特性设计上下文属性集以及用于模型聚类的问题集,提高其建模和训练效果。从对比评测实验结果来看, 98.5%的合成语音在改进后其音质得到改善。此外,针对合成语音节奏感不强的问题,提出了一种基于状态和声韵母单元的两层模型用于时长建模和预测,集外时长预测RMSE由29,56ms降为27.01ms。从最终的合成系统效果来看,合成语音整体稳定流畅,而且节奏感也比较强。由于合成系统所需的存贮量非常小,特别适合嵌入式应用。
Abstract
In this paper, the HMM-based trainable speech synthesis was applied for Chinese application. The appropriate HMM parameters are selected and optimized, and the contextual features and corresponding question set for tree-based HMM clustering are designed by considering the characteristics of Chinese, to improve the effect of HMM modeling and training. From the evaluation results, the preference score of the synthetic speech after the above improvement is 98.5%. Furthermore, in order to improve the rhythm of synthetic speech, a two-level based model is introduced for duration modeling and prediction, and the duration prediction RMSE was improved from 29.56ms to 27.01ms. From the evaluation results of the final system, the synthetic speech is stable, fluent and rhythmed. As the speech synthesis system only requires very small storage, it is specially fit for embedded application.
关键词
计算机应用 /
中文信息处理 /
语音合成 /
HMM /
可训练语音合成 /
时长模型
{{custom_keyword}} /
Key words
computer application /
Chinese information processing /
speech synthesis /
HMM /
trainable TTS /
duration modeling
{{custom_keyword}} /
{{custom_sec.title}}
{{custom_sec.title}}
{{custom_sec.content}}
参考文献
[1] R. H. Wang, Qingfeng Liu, Deyu Xia, Towards A Chinese Text-To-Speech System With Higher Naturalness [A] , In: Proc. of ICSLP [C]. Sydney, 1998, p2047 - 2050.
[2] R. H. Wang, Zhongke Ma, Wei Li, Donglai Zhu, A Corpus-Based Chinese Speech Synthesis with Contextual-Dependent Unit Selection[A]. In: Proc. of ICSLP [C]. Beijing, 2000, p391 - 394.
[3] L. R. Rabiner, A tutorial on hidden Markov models and selected applications in speech recognition . Proc. of IEEE, 1989 [J]. vol. 77, pp. 257 - 286.
[4] R. E. Donovan and E. M. Eide, The IBM trainable speech synthesis system[A]. In: Proc. of ICSLP [C]. Sydney, 1998, vol. 5, pp. 1703 - 1706.
[5] X. Huang, A. Acero, H. Hon, Y. Ju, J. Liu, S. Merdith, and M. Plumpe, Recent improvements on Microsoft’s trainable text-to-speech system - Whistler[A]. In: Proc. of ICASSP [C]. Munich, 1997, pp. 959-962.
[6] T. masuko, K. Tokuda, T. Kobayashi, and S. Imai, Speech synthesis from HMMs using dynamic features[A]. In: Proc. of ICASSP[C]. Atlanta, 1996, pp. 389 - 392.
[7] T. Yoshimura, K. Tokuda, T. Masuko, T. Kobayashi, and T. Kitamura, Simultaneous modeling of spectrum, pitch and duration in HMM-based speech synthesis[A]. In: Proc. of Eurospeech [C]. Budapest, 1999, vol.5, pp. 2347 - 2350.
[8] K. Tokuda, T. Masuko, N. Miyazaki, and T. Kobayashi, Hidden Markov models based on multi-space probability distribution for pitch pattern modeling. In: Proc. of ICASSP [C]. Arizona, 1999, pp. 229 - 232.
[9] T. Yoshimura, K. Tokuda, T. Masuko, T. Kobayashi and T. Kitamura, Duration modeling in HMM-based speech synthesis system[A]. In: Proc. of ICSLP [C]. Sydney, 1998, vol. 2, pp. 29 - 32.
[10] H. Kawahara, I. Masuda-Katsuse and A. de Cheveigne, Restructuring speech representations using pitch-adaptive time-frequency smoothing and an instantaneous-frequency-based F0 extraction: possible role of a repetitive structure in sounds, Speech Communication [J]. 1999, vol. 27, pp. 187 - 207.
[11] T. Fukada, K. Tokuda, T. Kobayashi and S. Imai, An adaptive algorithm for mel-cepstral analysis of speech [A]. In: Proc. of ICASSP [C]. 1992, vol. 1, pp. 137 - 140, 1992.
[12] F. Itakura, Line spectral representation of linear predictive coefficients, Journal of Acoustic Society of America [J]. 1990, vol. 87 (4) , pp. 1738 - 1752.
{{custom_fnGroup.title_cn}}
脚注
{{custom_fn.content}}
基金
国家自然科学基金资助项目(60475015)
{{custom_fund}}