为实现可视语音合成和双模态语音识别,需要建立符合条件的双模态语料库。该文提出了一种汉语双模态语料库的建立方法。根据视频中唇部发音特征,对已有的三音子模型聚类,形成视频三音子。在视频三音子的基础上,利用评估函数对原始语料中的句子打分,并实现语料的自动选取。与其他双模态语料库相比,该文所建立的语料库在覆盖率、覆盖效率和高频词分布律有了较大改进,能够更加真实反映汉语中的双模态语言现象。
Abstract
This paper proposes a method of constructing the Chinese bimodal corpus which is vital to the data-driven visual speech synthesis and bimodal speech recognition. According to the visual features of the lip in pronunciation in the video, the fuzzy c-means clustering method is used to cluster triphone model and establish the visual triphone model. Based on visual triphone model, evaluation function is utilized to score sentences in the original corpus and finally the corpus is thus selected automatically. Compared with other bimodal corpus, the proposed method substantially improves the Chinese bimodal corpus in the coverage rate, the coverage efficiency and the high-frequency words distribution, revealing the bimodal phenomenon of Chinese Mandarin more faithfully.
Key words computer application; Chinese information processing; visual speech synthesis; bimodal speech recognition; bimodal corpus; visual triphone; evaluation function
关键词
计算机应用 /
中文信息处理 /
可视语音合成 /
双模态语料 /
视频三音子 /
评估函数
{{custom_keyword}} /
Key words
computer application /
Chinese information processing /
visual speech synthesis /
bimodal speech recognition /
bimodal corpus /
visual triphone /
evaluation function
{{custom_keyword}} /
{{custom_sec.title}}
{{custom_sec.title}}
{{custom_sec.content}}
参考文献
[1] C.Bregler, M.Covell. Video Rewrite: Driving visual speech with audio[C]//The ACM Siggraph Conf. Computer Graphics, Los Angeles,USA,1997.
[2] J. Williams, K. Katsaggelos. An HMM-Based Speech- to-Video Synthesizer[J]. IEEE Transactions on Neural Networks, 2002,13(4): 900-915.
[3] F.J.Huang, E.Cossato, H.P.Graf. Triphone based unit selection for concatenative visual speech synthesis[J].IEEE International Conference on Acoustics, Speech, and Processing. 2002, 2:2037-2040.
[4] 刘鹏,王作英. 多模式汉语连续语音识别中视觉特征的提取和应用[J].中文信息学报,2004,18(4):79-84.
[5] J.S.Lee and C.H.Park. Robust Audio-Visual Speech Recognition Based on Late Integration[J]. IEEE Transactions on Multimedia, August 2008,10(5):767-779.
[6] 徐彦君,杜利民.汉语听觉视觉双模态数据库CAVSR1.0[J].声学学报,2000,25(1): 42-49.
[7] 洪晓鹏,姚鸿勋,徐铭辉.基于句子级的唇读语料库及其切分算法[J].计算机工程与应用.2005,(3): 174-177.
[8] 吴华,徐波,黄泰翼.基于三音子模型的语料自动选取算法[J].软件学报. 2000,11(2): 271-276.
[9] Hui Zhao, Chaojing Tang. Visual Speech Synthesis based on Chinese Dynamic Visemes[C]//Proceedings of the 2008 IEEE International Conference on Information and Automation, Zhangjiajie, China, 2008: 20-23.
[10] 康恒,刘文举.基于综合因素的汉语连续语音库语料自动选取[J].中文信息学报,2003,17(4): 27-32.
[11] 祖漪清.汉语连续语音数据库的语料设计[J].声学学报,1999,(3): 236-247.
{{custom_fnGroup.title_cn}}
脚注
{{custom_fn.content}}
基金
“十一五”武器装备预研项目(51329060101)
{{custom_fund}}