基于不同单元的端到端语音识别

PDF(1871 KB)

中文信息学报 ›› 2024, Vol. 38 ›› Issue (1) : 166-172.

语音信息处理

基于不同单元的端到端语音识别

张岩,艾斯卡尔·艾木都拉,米吉提·阿不里米提

作者信息 +

End-to-end ASR via Different Acoustic Unit

ZHANG Yan, Askar HAMDULLA, Mijit ABLIMIT

Author information +

History +

摘要

端到端语音识别技术不需要文本和语音序列的强制对齐过程,且比传统语音识别系统有着更为简单直观的结构和更好的适应能力,它不需要精准的发音词典,在资源匮乏语言的语音识别研究中有更好的发展前景。该文在循环神经网络(RNN)和链接时序分类(CTC)的基础上,实现维吾尔语不同粒度的端到端的语音识别系统,且在较少的语料库(THUYG公开语料库)上将该方法和传统的HMM语音识别框架进行比较。单音素基础上端到端方法的表现超过传统HMM-GMM框架,CER下降10.6%,而且经过稍微减少冗余后的以单字符作为建模单元的端到端语音识别系统对比基于三音素的HMM-GMM系统CER下降2.23%。对于资源匮乏语言,粒度单元的优化方法将是提高性能的下一个研究目标。

Abstract

End-to-end speech recognition technology has a simpler and more intuitive framework with better adaptability than traditional speech recognition framework. Based on RNN and CTC, this paper implements an end-to-end speech recognition system of Uyghur language via different acoustic unit. We compare this method with the traditional HMM speech recognition framework in a small corpora (THUYG). The experimental results show that the end-to-end speech recognition system based on mono-phone outperforms the HMM-GMM based on mono-phone and triphone by 10.6% and 2.23% lower CER, respectively.

导出引用

张岩,艾斯卡尔·艾木都拉,米吉提·阿不里米提. 基于不同单元的端到端语音识别. 中文信息学报. 2024, 38(1): 166-172

ZHANG Yan, Askar HAMDULLA, Mijit ABLIMIT. End-to-end ASR via Different Acoustic Unit. Journal of Chinese Information Processing. 2024, 38(1): 166-172

参考文献

[1] ABLIMIT M, KAWAHARA T, HAMDULLA A. Lexicon optimization based on discriminative learning for automatic speech recognition of agglutinative language[J]. Speech Communication, 2014, 60: 78-87.
[2] ABLIMIT M, KAWAHARA T, HAMDULLA A. Morpheme segmentation and concatenation approaches for Uyghur LVCSR[J]. International Journal of Hybrid Information Technology, 2015, 8(8): 327-342.
[3] 艾斯卡尔·肉孜, 殷实, 张之勇等 THUYG-20: 免费的维吾尔语语音数据库[J]. 清华大学学报(自然科学版),2017(02): 73-78.
[4] WANG D, ZHENG T F. Transfer learning for speech and language processing[C]//Proceedings of the Asia-Pacific Signal and Information Processing Association Annual Summit and Conference. IEEE, 2015: 1225-1237.
[5] 米吉提·阿不里米提,艾克白尔·帕塔尔,艾斯卡尔·艾木都拉. 基于层次化结构的语言模型单元集优化[J]. 清华大学学报(自然科学版),2017(03): 36-42.
[6] 王俊超, 黄浩, 徐海华, 等. 基于迁移学习的低资源度维吾尔语语音识别[J]. 计算机工程, 2018, 44(10): 287-291.
[7] RABINER L R . A tutorial on hidden markov models and selected applications in speech recognition[C]//Proceedings of the IEEE, 1989, 77(2): 257-287.
[8] MOHAMED A, DAHL G E, HINTON G . Acoustic modeling using deep belief networks[J]. IEEE Transactions on Audio, Speech and Language Processing, 2012, 20(1): 14-22.
[9] ABDEL-HAMID O, MOHAMED A R, JIANG H, et al. Convolutional neural networks for speech recognition[J]. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2014, 22(10): 1533-1545.
[10] GERS F A,SCHMIDHUBER J, CUMMINS F. Learning to forget: Continual prediction with LSTM[J]. Neural Computation, 2000, 12(10): 2451-2471.
[11] HANNUN A, CASE C, CASPER J, et al. Deep speech: Scaling up end-to-end speech recognition[J]. arXiv preprint arXiv:1412.5567, 2014.
[12] ZHANG Y,PEZESHKI M, BRAKEL P, et al. Towards end-to-end speech recognition with deep convolutional neural networks[J]. arXiv preprint arXiv:1701.02720, 2017.
[13] CHAN W,JAITLY N, LE Q, et al. Listen, attend and spell: A neural network for large vocabulary conversational speech recognition[C]//Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing. 2016: 4960-4964.
[14] CHIU CC, SAINATH T N, WU Y, et al. State-of-the-art speech recognition with sequence-to-sequence models[C]//Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing. 2018: 4774-4778.
[15] GRAVES A, FERNNDEZ S, GOMEZ F, et al. Connectionist temporal classification: Labelling unsegmented sequence data with recurrent neural networks[C]//Proceedings of the 23rd International Conference on Machine Learning. 2006: 369-376.
[16] GRAVES A, MOHAMED A R, HINTON G. Speech recognition with deep recurrent neural networks[C]//Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing, 2013, 38: 6654-6659.
[17] ABLIMIT M, PARHAT S, HAMDULLA A, et al. A multilingual language processing tool for Uyghur, Kazak and Kirghiz[C]//Proceedings of the Asia-Pacific Signal and Information Processing Association Annual Summit and Conference. IEEE, 2017: 737-740.
[18] ABLIMIT M, PARHAT S, HAMDULLA A, et al. Multilingual stemming and term extraction for Uyghur, Kazak and Kirghiz[C]//Proceedings of the Asia-Pacific Signal and Information Processing Association Annual Summit and Conference. IEEE, 2018: 587-590.
[19] KUNZE J, KIRSCH L,KURENKOV I, et al. Transfer learning for speech recognition on a budget[J]. arXiv preprint arXiv:1706.00290, 2017.
[20] BAEVSKI A, ZHOU H, MOHAMED A, et al. wav2vec 2.0: A framework for self-supervised learning of speech representations[J]. arXiv preprint arXiv:2006.11477, 2020.
[21] HOCHREITER S, SCHMIDHUBER J. Long short-term memory[J]. Neural Computation, 1997, 9(8): 1735-1780.
[22] GRAVES A,JAITLY N. Towards end-to-end speech recognition with recurrent neural networks[C]//Proceedings of the International Conference on Machine Learning, 2014: 1764-1772.

基金

国家重点研究与发展计划(2017YFC0820602)

PDF(1871 KB)

580

Accesses

Citation

Detail

段落导航

摘要
Abstract
关键词
Key words
引用本文
参考文献
基金

选择文件类型/文献管理软件名称

选择包含的内容

摘要

Abstract

关键词

Key words

引用本文

{{custom_sec.title}}

{{custom_sec.title}}

参考文献

{{custom_fnGroup.title_cn}}

脚注

基金

Published
2024-03-25
Issue Date
2024-03-26