多模式汉语连续语音识别中视觉特征的提取和应用

刘鹏,王作英

PDF(324 KB)
PDF(324 KB)
中文信息学报 ›› 2004, Vol. 18 ›› Issue (4) : 80-85.

多模式汉语连续语音识别中视觉特征的提取和应用

  • 刘鹏,王作英
作者信息 +

To Extract and Apply Visual Features in Mandarin Multimodal Continuous Speech Recognition

  • LIU Peng,WANG Zuo-ying
Author information +
History +

摘要

本文对在汉语多模式汉语语音识别系统中利用视觉特征进行了研究,给出了基于多流隐马尔科夫模型(Multi-stream HMM, MSHMM)的听视觉融合方案,并对有关视觉特征的两项关键技术:嘴唇定位和视觉特征提取进行了详细讨论。首先,我们研究了基于模板匹配的嘴唇跟踪方法;然后研究了基于线性变换的低级视觉特征,并与基于动态形状模型的特征作了比较;实验结果表明,引入视觉信息后无噪环境下语音识别声学层首选错误率相对下降36.09%,在噪声环境下的鲁棒性也有明显提高。

Abstract

In this paper , we investigate on the using of visual feature in Mandarin multimodal speech recognition. The audio-visual fusion strategy based on multi-stream hidden Markov model is presented. Then key technologies about visual feature , including lip location and visual feature extraction , are discussed. Firstly , we research on the lip location algorithm based on model matching and the low Subsequently , the low-level visual feature based on linear transform is investigated and compared to the high-level visual feature based on active shape models. It is shown by experiments that the word error rate of the first candidate of acoustic level is reduced by 36.09% relatively with visual feature used , compared to audio speech recognition system. It is also demonstrated from more experiments that our audio-visual systemprovides significant robustness enhancement in noise environment.

关键词

计算机应用 / 中文信息处理 / 多模式 / 听-视觉融合 / 视觉特征提取 / 鲁棒性

Key words

computer application / Chinese information processing / multimodal / audio-visual fusion / visual feature extraction / robustness

引用本文

导出引用
刘鹏,王作英. 多模式汉语连续语音识别中视觉特征的提取和应用. 中文信息学报. 2004, 18(4): 80-85
LIU Peng,WANG Zuo-ying. To Extract and Apply Visual Features in Mandarin Multimodal Continuous Speech Recognition. Journal of Chinese Information Processing. 2004, 18(4): 80-85

参考文献

[1] Tsuhan Chen , Audiovisual speech processing[J] , IEEE Signal Processing Magazine ,Jan ,2001 ,18 :9 - 21.
[2] Petajan , E. D. , Automatic lip reading to enhance speech recognition , Ph.D. thesis ,[D] University of Illinois at Urbana-Champaign , 1984.
[3] D.L. Swets and J. J. Weng , Using Discriminant Eigenfeatures for Image Retrieval [J] , IEEE Trans. Pattern Analysis and Machine Intelligence , Aug. 1996 ,18 (8) :831 - 836.
[4] J. Luettin , N. A. Thacker and S. W.Beet , Active Shape Models for Visual Speech Feature Extraction[M] , D. G. Storck (editor) , Speechreading by Man and Machine : Models , Systems and Applications , volume 150 of NATO ASI Series F: Computer and Systems Sciences. Springer-Verlag , Berlin , 1996.
[5] P. Duchnowski , M. Hunke , D. B  usching , U. Meier , and A. Waibel , Toward movement-invariant automatic lip-reading and speech recognition ,[A] In : Proc. International Conference on Spoken Language Processing[C] , 1995 , 109 - 112.
[6] G. Potamianos , E. Cosatto , H. P. Graf , and D. B. Roe , Speaker independent audio-visual database for bimodal ASR ,[A] In : Proc. European Tutorial Workshop on Audio-Visual Speech Processing[C] , Rhodes , 1997 ,65 - 68.
[7] R. A. Fisher , The Statistical Utilization of Multiple Measurements[J] , Annals of Eugenics ,1938 ,8 :376 - 386.
[8] Chalapathy Neti , Gerasimos Potamianos , Juergen Luettin , Iain Matthews , Audio-Visual Speech Recognition[M] , Workshop 2000 Final Report of IBM, 2000.
[9] T. F. Cootes , G. J. Edwards , and C. J. Taylor , Active appearance models ,[A] In : Proc. European Conference on Computer Vision[C] ,1998 ,484 - 498.
[10] J. A. Nelder and R. Mead , A simplex method for function optimization [J] Comput. J. , 1965 ,7 (4) :308 - 313.
[11] B.-H. Juang , S. Katagiri , Discriminative Learning for Minimum Error Classification [J] IEEE Trans. on Signal Processing , Dec ,1992 ,40 (12) :3043 - 3054.

基金

国家863计划资助项目(2001AA114071)
PDF(324 KB)

804

Accesses

0

Citation

Detail

段落导航
相关文章

/