本文对自然言语的韵律组织中的不确定性及其对合成语音自然度的影响进行了初步探讨,并在此基础上,提出在韵律预测中用最小错误概率准则代替传统的最大生成概率准则,从而在预测结果中保留多种等价的韵律实现。本文还进一步提出一种将基于最小错误准则的韵律预测与单元选择结合的算法,首先根据最小错误准则在所有候选单元中筛选出最不可能造成韵律错误的样本,然后再依据最平滑拼接准则从各种韵律等价的路径中选出一条能达到最平滑拼接的作为最后输出。
Abstract
This paper explores the uncertainty of prosody in a speech corpus , which contains two read versions of 1000 sentences by a professional voice talent under the same linguistic and affective planning. It is found that corresponding prosodic features in the two versions change in a rather wide range. The scope of local variations can be as large as 45 - 50 % of the overall variation range of a speaker. Based on such observation , this paper proposes a minimum error-rate criterion (MERC) to replace the traditional maximum correct-rate criterion in prosody generation. Furthermore , this paper proposes an approach to integrate the MERC into the unit selection algorithm. Among all instances of a speech unit , those that have the lowest possibility to result unnatural prosody are picked out first , and then the most suitable path is selected from all prosodic equivalent candidates under the smoothest criterion to assure the smoothest concatenation of all units on this path.
关键词
计算机应用 /
中文信息处理 /
言语 /
韵律的不确定性 /
单元选择 /
最小错误准则
{{custom_keyword}} /
Key words
computer application /
Chinese information processing /
speech /
uncertainty of prosody /
unit selection /
minimum error-rate criterion
{{custom_keyword}} /
{{custom_sec.title}}
{{custom_sec.title}}
{{custom_sec.content}}
参考文献
[1] Wang , M. Q. and Hirschberg , J. , 1991. Predicting intonational phrasing from text [A] . In : Proc. Association for Computational Linguistics 29th annual meeting[C] , 285 - 292.
[2] Ostendorf , M. and Veilleux , N. , 1994. A hierarchical stochastic model for automatic prediction of prosodic boundary location. Computational Linguistics[J] . 20 (1) : 27 - 54.
[3] Wightman , C. W. and Ostendorf , M. , 1994. Automatic labeling of prosodic patterns , IEEE Trans. on Speech and Audio Processing[J] . 2 (4) : 469 - 481.
[4] Hirschberg , J. and Prieto , P. , 1996. Training intonational phrasing rules automatically for English and Spanish text-to-speech. Speech Communication[J] . 18 : 281 - 290.
[5] Taylor , P. and Black , A. W. , 1998. Assigning phrase breaks from part-of-speech sequences , Computer Speech and Language[J] . 12 : 99 - 117.
[6] Chu , M. and Qian , Y. , 2001. Locating boundaries for prosodic constituents in unrestricted Mandarin texts. Computational Linguistics and Chinese Language Processing[J] . 6 (1) : 61 - 82.
[7] Fujisaki , H. , Hirose , K. , Takahashi , N. andMorikawa , H. , 1986. Acoustic characteristics and the underlying rules of intonation of the common Japanese used by radio and TV announcers[A]. In : Proc. ICASSP'86[C] , 2039 - 2042.
[8] Ross , K.N. and Ostendorf , M. , 1999. A dynamical system model for generating fundamental frequency for speech synthesis. IEEE transactions on speech and audio processing[J] . 7 (3) : 295 - 309.
[9] Chen , S. , Hwang , S. and Wang , Y. , 1998. An RNN-based prosodic information synthesizer for Mandarin text-to-speech. IEEE transactions on speech and audio processing[J] . 6 (3) : 226 - 239.
[10] 贺琳,初敏,吕士楠,钱瑶,冯勇强. 2001 ,汉语合成语料库的韵律层级标注研究,新世纪的现代语音学-第五届全国语音学学术会议[M] ,北京:清华大学出版社,323 - 326.
[11] Young et. al , The HTKBook[M] .
[12] Chu , M. , Peng , H. , Yang H. Y. and Chang E. , 2001a. Selecting non-uniform units from a very large corpus for concatenative speech synthesizer. Proc. ICASSP'01[C] , Salt Lake City.
[13] Chu , M. , Peng , H. and Chang , E. , 2001b. A concatenative Mandarin TTS system without prosody model and prosody modification. In : Proc. 4th ISCA Workshop on Speech Synthesis[C] , Scotland.
[14] Chu , M. , Peng , H. , Zhao , Y. , Niu , Z. Y. and Chang , E. , 2003. Microsoft Mulan - a bilingual TTS system. Proc. ICASSP'03[C] .
[15] Huang , X. D. , Acero , A. and Hon , H. W. , 2001. Spoken Language Processing - a Guide to Theory , Algorithm , and System Development , Prentice Hall PTR[M] , New Jersey , 175 - 189.
{{custom_fnGroup.title_cn}}
脚注
{{custom_fn.content}}