汉语大词汇量连续语音识别系统研究进展

倪崇嘉,刘文举,徐波

PDF(1059 KB)
PDF(1059 KB)
中文信息学报 ›› 2009, Vol. 23 ›› Issue (1) : 112.
综述

汉语大词汇量连续语音识别系统研究进展

  • 倪崇嘉1,2,刘文举1,徐波1
作者信息 +

Research on Large Vocabulary Continuous Speech Recognition System for Mandarin Chinese

  • NI Chong-jia1,2, LIU Wen-ju1, XU Bo1
Author information +
History +

摘要

大词汇量连续语音识别(LVCSR)技术近年来发展迅速,并在许多领域得到了广泛的应用,国内外许多大公司加大了对语音识别技术的研究,不少商业化的语音识别系统已经面世,并得到较为广泛的使用。该文综述了近年来大词汇量连续语音识别技术的研究进展,描述了汉语大词汇量连续语音识别系统,主要是基于统计方法的语音识别系统的框架与设计方法,对语音识别系统的一些关键技术和原理进行了分析,并对近年来国内外对语音识别研究发展动向进行了讨论。

Abstract

The technology of large vocabulary continuous speech recognition(LVCSR)has developed quickly and achieved broad application in recent years. Many big companies has reinforced the speech recognition research and various commercial systems have appeared in the market. This paper reviews the recent research progresses of LVCSR and describes the main frames and designs of current mandarin Chinese LVCSR systems. The key issues and principles in LCVSR are analyzed in detail. The prospects and research trends for LVCSR at home and abroad are also discussed.

关键词

计算机应用 / 中文信息处理 / 综述 / 语音识别 / 模型自适应 / 搜索技术

Key words

computer application / Chinese information processing / overview / speech recognition / model adaptation / search technology

引用本文

导出引用
倪崇嘉,刘文举,徐波. 汉语大词汇量连续语音识别系统研究进展. 中文信息学报. 2009, 23(1): 112
NI Chong-jia, LIU Wen-ju, XU Bo〖. Research on Large Vocabulary Continuous Speech Recognition System for Mandarin Chinese. Journal of Chinese Information Processing. 2009, 23(1): 112

参考文献

[1] Rabiner, L. and B.H. Juang, Fundamentals of Speech Recognition [M].北京: 清华大学出版社&Prentice Hall PTR, 1999.
[2] T.K.Vintsyuk, Speech recognition by dynamic programming [J]. Kibernetika, 1968,(1):11-18.
[3] MAKHOUL, J., Linear Prediction: A Tutorial Review [C]// Proc. IEEE, 1975, 63(4).
[4] F.Jelinek, Continuous speech recognition by statistical methods [C]// Proc IEEE, 1976, 64(4): 532-556.
[5] Lee, K.-F., Automatic speech recognition: The development of the SPHINX system [M].Boston: Kluwer Academic Publishers 1989.
[6] PRICE, P.J., A database for continuous speech recognition in a 1000-word domain [C]// Pro. ICASSP, 1988. 11: 651-654.
[7] 钱跃良, 等. 2005年度863计划中文信息处理与智能人机接口技术评测回顾 [J]. 中文信息学报, 2006, 20(z1): 1-6.
[8] Davis, S.B. and P. Mermelstein, Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences [C]// IEEE Trans. on Acoustic,Speech and Signal Processing, 1980, 28(4): 357-366.
[9] Hermansky, H., Perceptual linear predictive (PLP) analysis of speech [J]. Journal of the Acoustical Society of America, 1990, 87(4): 1738-1752
[10] Viiki, O., D. Bye, and K. Laurila, A recursive feature vector normalization approach for robust speech recognition in noise [C]// Pro. ICASSP, 1998: 733-736.
[11] Huang, X., A. Acero, and H.-W. Hon, Spoken Language Processing: A Guide to Theory, Algorithm and System Development [M]. New Jersey: Prentice-Hall PTR, 2001.
[12] Torre, A.d.l., et al. Non-Linear Transformations Of The Feature Space For Robust Speech Recognition [C]// Proc. ICASSP. 2002.
[13] Lee, L. and R. Rose, A Frequency Warping Approach to Speaker Normalization [C]// IEEE Transactions on Speech And Audio Processing, 1998, 6(1): 49-60.
[14] Kim, D.Y., C.K. Un, and N.S. Kim, Speech recognition in noisy environments using first-order vector Taylor series [J]. Speech Communication, 1998, 24(1): 39-49.
[15] Zhao, X.Y. and Z.J. Ou, Closely coupled array processing and model-based compensation for microphone array speech recognition [C]// IEEE Transactions on Audio Speech and Language Processing, 2007, 15(3): 1114-1122.
[16] Deng, L., et al., Analysis and comparison of two speech feature extraction/compensation algorithms [C]// IEEE Signal Processing Letters, 2005, 12(6): 477-480.
[17] J, D., D. L, and A. A. Evaluation of SPLICE on the Aurora 2 and 3 Tasks [C]// Proc. of ICSLP. 2002.
[18] Droppo, J., A. Acero, and L. Deng, Uncertainty decoding with splice for noise robust speech recognition [C]// Proc. ICASSP, Vol. I-Iv, 2002: 57-60.
[19] Sher, Y.J., et al., MAP-based perceptual modeling for noisy speech recognition [J]. Journal of Information Science and Engineering, 2006, 22(5): 999-1013.
[20] Duda, R.O. and H. P.E., Pattern Classification and Scene Analysis[M]. New York: John wiley and Sons 1973.
[21] Kumar, N., Investigation of silicon auditory models and generalization of linear discriminant analysis for improved speech recognition[D]. 1997, John Hopkins University: Baltimore.
[22] Kumar, N. and A.G. Andreou, Heteroscedastic discriminant analysis and reduced rank HMMs for improved speech recognition [J]. Speech Communication, 1998, 26(4): 283-297.
[23] Saon, G., et al., Maximum likelihood discriminant feature spaces [C]// Proc. ICASSP Vols I-Vi, 2000: 1129-1132.
[24] Nenadic, Z., Information discriminant analysis: Feature extraction with an information-theoretic objective [C]// IEEE Transactions on Pattern Analysis and Machine Intelligence, 2007, 29(8): 1394-1407.
[25] Axelrod, S., et al., Discriminative estimation of subspace constrained Gaussian mixture models for speech recognition [C]// IEEE Transactions on Audio Speech and Language Processing, 2007, 15(1): 172-189.
[26] Mika, S., Fisher Discriminant Analysis With Kernels [C]// Proc. IEEE International Workshop on Neural Networks for Signal Processing, 1999,(41-48).
[27] Leggetter, C.J. and P.C. Woodland, Maximum-Likelihood Linear-Regression for Speaker Adaptation of Continuous Density Hidden Markov-Models [J]. Computer Speech and Language, 1995, 9(2): 171-185.
[28] Zhang, B., S. Matsoukas and R. Schwartz. Discriminatively trained region dependent feature transforms for speech recognition [C]// Proc. ICASSP, Vol. 1-13, 2006: 313-316.
[29] Beyerlein, P., et al., Large vocabulary continuous speech recognition of Broadcast News - The Philips/RWTH approach[J]. Speech Communication, 2002, 37(1-2): 109-131.
[30] Hain, T., et al., Automatic transcription of conversational telephone speech[C]// IEEE Transactions on Speech and Audio Processing, 2005, 13(6): 1173-1185.
[31] Zhang, B. and S. Matsoukas, Minimum phoneme error based heteroscedastic linear discriminant analysis for speech recognition[C]// Proc.ICASSP, Vol. 1-5, 2005: 1925-I928.
[32] Hirsimaki, T., et al., Unlimited vocabulary speech recognition with morph language models applied to Finnish[J]. Computer Speech and Language, 2006, 20(4): 515-541.
[33] 高升. 语境相关的声学模型和搜索策略研究[D]. 2001, 中国科学院自动化研究所: 北京.
[34] 徐波. 汉语非特定人听写机系统研究和集成[D]. 1997, 中国科学院自动化研究所: 北京.
[35] Odell, J.J., The Use of Context in Large Vocabulary Speech Recognition[D]. 1995, University of Cambridge: Cambridge
[36] 林焘,王理嘉. 语音学教程 [M]. 北京: 北京大学出版社.
[37] Young, S.J., J.J. Odell, and P.C. Woodland. Tree-Based State Tying for High Accuracy Modelling[C]// Proceedings ARPA Workshop on Human Language Technology. 1994.
[38] Xu, B., et al., Integrating tone information in continuous Mandarin recognition[C]// Proc. ISSPIS, 1999.
[39] Seneff, C.W.a.S. A study of tones and tempo in continuous mandarin digit strings and their application in telephone quality speech recognition[C]// Proc. ICSLP. 1998.
[40] Huang, H.C.-H. and F. Seide. Pitch tracking and tone features for mandarin speech recognition[C]// Proc. ICASSP. 2000.
[41] Croft, W.B. and J. Lafferty, Language Modeling for Informatioan Retrieval[M]. 2003: Kluwer-Academic Publishers.
[42] P.F.Brown, et al., A statistical approach to machine translation[J]. Computat.Linguistics, 1990, 16: 79-85.
[43] Ney, H., One decade of statistical machine translation: 1996-2005[C]// IEEE Workshop on Automatic Speech Recognition and Understanding (Asru), 2005: 2-2.
[44] Rosenfeld, R., Two decades of statistical language modeling: Where do we go from here?[C]// Proc. IEEE 2000, 88(8): 1270-1278.
[45] Schwenk, H., Continuous space language models[J]. Computer Speech and Language, 2007, 21(3): 492-518.
[46] Afify, M. and O. Siohan, Comments on vocal fact length normalization equals linear transformation in cepstral space[C]// IEEE Transactions on Audio Speech and Language Processing, 2007, 15(5): 1731-1732.
[47] Emami, A. and F. Jelinek, A neural syntactic language model[J]. Machine Learning, 2005, 60(1-3): 195-227.
[48] Brown PF, D.V., DeSouza PV, Lai JC, Mercer RL., Class-Based n-gram models of natural language[J]. Computational Linguistics, 1992, 18(4): 467-479.
[49] Singh-Miller, N. and M. Collins, Trigger-based language modeling using a loss-sensitive perceptron algorithm[C]// Proc. ICASSP Vol. Iv, Pts 1-3, 2007: 25-28.
[50] Bellegarda, J.R., Latent semantic mapping: Dimensionality reduction via globally optimal continuous parameter modeling[C]// 2005 IEEE Workshop on Automatic Speech Recognition and Understanding (Asru), 2005: 127-132.
[51] Mrva, D., PLSA-Based Language Model for Conversational Telephone Speech[C]// Proc. ICSLP, 2004: 2257-2260.
[52] Seneviratne, V. and S. Young, The Hidden Vector State Language Model[C]// Proc. Interspeech. 2005.
[53] Gao JF, S.H., Wen Y. . Exploring headword dependency and predictive clustering for language modeling[C]// Proc. of the Empirical Methods in Natural Language Processing 2002.
[54] Wang, S.J., et al., Combining statistical language models via the latent maximum entropy principle[J]. Machine Learning, 2005, 60(1-3): 229-250.
[55] Chelba, C., Exploiting Syntactic Structure for Natural Language Modeling[M]. Johns Hopkins University, 2000.
[56] Roark, B., Probabilistic top-down parsing and language modeling[J]. Computational Linguistics, 2001, 27(2): 249-276.
[57] Gao, J.F. and H. Suzuki, Unsupervised learning of dependency structure for language modeling[C]// 41st Annual Meeting of the Association for Computational Linguistics, Proceedings of the Conference, 2003: 521-528.
[58] Rosenfeld, R., Incorporating Linguistic Structure into Statistical Language Models[J]. Philosophical Transactions of the Royal Society, 2000: 1311-1324.
[59] Galescu, L. and J. Allen. Hierarchical Statistical Language Models: Experiments On In-Domain Adaptation[C]// Proc. ICSLP. 2000.
[60] Bellegarda, J.R., Statistical language model adaptation: review and perspectives[J]. Speech Communication, 2004, 42(1): 93-108.
[61] Bacchiani, M. and B. Roark, Unsupervised language model adaptation.[C] //Proc. ICASSP, Vol I, Proceedings, 2003: 224-227.
[62] Rabiner, L., A Tutorial on Hidden Markov-Models And Selected Applications in Speech Recognition[C]// Proc. IEEE, 1989, 77(2): 257-286.
[63] Povey, D., Discriminative Training for Large Vocabulary Speech Recognition[D]. Cambridge,: University of Cambridge: Cambridge, 2004.
[64] Juang, B.H., W. Chou, and C.H. Lee, Minimum classification error rate methods for speech recognition [C]// IEEE Transactions on Speech and Audio Processing, 1997, 5(3): 257-265.
[65] Doumpiotis, V. and W. Byrne, Lattice segmentation and minimum Bayes risk discriminative training for large vocabulary continuous speech recognition [J]. Speech Communication, 2006, 48(2): 142-160.
[66] Wang, L. and P. Woodland, MPE-based discriminative linear transform for speaker adaptation [C]// Proc. ICASSP, Vol I, Proceedings, 2004: 321-324.
[67] Povey, D., et al., FMPE: Discriminatively trained features for speech recognition [C]// Proc. ICASSP2005: Vol. 1-5, 1961-1964.
[68] Wessel, F. and H. Ney, Unsupervised training of acoustic models for large vocabulary continuous speech recoornition [C]// IEEE Transactions on Speech and Audio Processing, 2005, 13(1): 23-31.
[69] Lamel, L., J.L. Gauvain, and G. Adda, Lightly supervised and unsupervised acoustic model training [J]. Computer Speech and Language, 2002, 16(1): 115-129.
[70] 张树武. 汉语语言处理及语言模型研究[D]. 1997, 北京: 中国科学院自动化研究所.
[71] Rosenfeld, R., A maximum entropy approach to adaptive statistical language modeling [J]. Computer Speech and Language, 1996, 10(3): 187-228.
[72] Kuo, K.H.J., et al., Discriminative training of language models for speech recognition [C]// Proc. ICASSP 2002: , Vol. I-Iv, 325-328.
[73] Kneser, R. and H. Ney. Improved backing-off for m-gram language modeling [J]. 1995.
[74] Taraba, P., Kneser-Ney smoothing with a correcting transformation for small data sets [C]// IEEE Transactions on Audio Speech and Language Processing, 2007, 15(6): 1912-1921.
[75] Siivola, V., T. Hirsimaki, and S. Virpioja, On growing and pruning Kneser-Ney smoothed N-gram models [C]// IEEE Transactions on Audio Speech and Language Processing, 2007, 15(5): 1617-1624.
[76] Chen, S.F. and J. Goodman. An Empirical Study of Smoothing Techniques for Language Modeling[J]. Proceedings of the 34th Annual Meeting of the Association for Computational Linguistics. 1996.
[77] Ney, H. and S. Ortmanns, Progress in dynamic programming search for LVCSR [C]// Proc. IEEE, 2000, 88(8): 1224-1240.
[78] Seide, F., The use of virtual hypothesis copies in decoding of large-vocabulary continuous speech [J]. IEEE Transactions on Speech and Audio Processing, 2005, 13(4): 520-533.
[79] Douglas B. Paul. An efficient A stack decoder algorithm for continuous speech recognition with a stochastic language model [J]. Proc. ICASSP. 1992.
[80] Lalit R. Bahl, S.V.D.G., P. S. Gopalakrishnan, and Robert L. Mercer, A fast approximate acoustic match for large speech recognition [C]// IEEE Trans. on Speech and Audio Processing, 1993, 1(1): 59-67.
[81] Cheng, O., J. Dines, and M.M. Doss, A generalized dynamic composition algorithm of weighted finite state transducers for large vocabulary speech recognition [C]// Proc. ICASSP 2007: , Vol. Iv, Pts 1-3: 345-348.
[82] Aubert, X.L., An overview of decoding techniques for large vocabulary continuous speech recognition [J]. Computer Speech and Language, 2002, 16(1): 89-114.
[83] Stephan Kanthak, H.N., M. Mohri,, A Comparison of Two LVR Search Optimization Techniques[C]//Proc. ICSLP 2002: 1309-1312.
[84] Fiscus, J.G., J. Ajot, and J.S. Garofolo, The Rich Transcription 2007 Meeting Recognition Evaluation[EB/OL]. 2007.

基金

国家重点基础研究发展计划(973)资助项目(2004CB318105);国家高技术研究发展计划(863)资助项目(2006AA01Z194,20060101Z4073);国家自然科学基金资助项目(60675026,60121302,90820011)
PDF(1059 KB)

789

Accesses

0

Citation

Detail

段落导航
相关文章

/