赵晖,林成龙,唐朝京. 基于视频三音子的汉语双模态语料库的建立[J]. 中文信息学报, 2009, 23(5): 98-104.
ZHAO Hui, LIN Chenglong, TANG Chaojing. Construction of Chinese Bimodal Corpus Based on Visual Triphone. , 2009, 23(5): 98-104.
基于视频三音子的汉语双模态语料库的建立
赵晖,林成龙,唐朝京
国防科技大学 电子科学与工程学院,湖南 长沙 410073
Construction of Chinese Bimodal Corpus Based on Visual Triphone
ZHAO Hui, LIN Chenglong, TANG Chaojing
College of Electronic Science and Engineering, National University of Defence Technology, Changsha, Hunan 410073, China
Abstract:This paper proposes a method of constructing the Chinese bimodal corpus which is vital to the data-driven visual speech synthesis and bimodal speech recognition. According to the visual features of the lip in pronunciation in the video, the fuzzy c-means clustering method is used to cluster triphone model and establish the visual triphone model. Based on visual triphone model, evaluation function is utilized to score sentences in the original corpus and finally the corpus is thus selected automatically. Compared with other bimodal corpus, the proposed method substantially improves the Chinese bimodal corpus in the coverage rate, the coverage efficiency and the high-frequency words distribution, revealing the bimodal phenomenon of Chinese Mandarin more faithfully. Key words computer application; Chinese information processing; visual speech synthesis; bimodal speech recognition; bimodal corpus; visual triphone; evaluation function
[1] C.Bregler, M.Covell. Video Rewrite: Driving visual speech with audio[C]//The ACM Siggraph Conf. Computer Graphics, Los Angeles,USA,1997. [2] J. Williams, K. Katsaggelos. An HMM-Based Speech- to-Video Synthesizer[J]. IEEE Transactions on Neural Networks, 2002,13(4): 900-915. [3] F.J.Huang, E.Cossato, H.P.Graf. Triphone based unit selection for concatenative visual speech synthesis[J].IEEE International Conference on Acoustics, Speech, and Processing. 2002, 2:2037-2040. [4] 刘鹏,王作英. 多模式汉语连续语音识别中视觉特征的提取和应用[J].中文信息学报,2004,18(4):79-84. [5] J.S.Lee and C.H.Park. Robust Audio-Visual Speech Recognition Based on Late Integration[J]. IEEE Transactions on Multimedia, August 2008,10(5):767-779. [6] 徐彦君,杜利民.汉语听觉视觉双模态数据库CAVSR1.0[J].声学学报,2000,25(1): 42-49. [7] 洪晓鹏,姚鸿勋,徐铭辉.基于句子级的唇读语料库及其切分算法[J].计算机工程与应用.2005,(3): 174-177. [8] 吴华,徐波,黄泰翼.基于三音子模型的语料自动选取算法[J].软件学报. 2000,11(2): 271-276. [9] Hui Zhao, Chaojing Tang. Visual Speech Synthesis based on Chinese Dynamic Visemes[C]//Proceedings of the 2008 IEEE International Conference on Information and Automation, Zhangjiajie, China, 2008: 20-23. [10] 康恒,刘文举.基于综合因素的汉语连续语音库语料自动选取[J].中文信息学报,2003,17(4): 27-32. [11] 祖漪清.汉语连续语音数据库的语料设计[J].声学学报,1999,(3): 236-247.