端到端语音识别技术不需要文本和语音序列的强制对齐过程,且比传统语音识别系统有着更为简单直观的结构和更好的适应能力,它不需要精准的发音词典,在资源匮乏语言的语音识别研究中有更好的发展前景。该文在循环神经网络(RNN)和链接时序分类(CTC)的基础上,实现维吾尔语不同粒度的端到端的语音识别系统,且在较少的语料库(THUYG公开语料库)上将该方法和传统的HMM语音识别框架进行比较。单音素基础上端到端方法的表现超过传统HMM-GMM框架,CER下降10.6%,而且经过稍微减少冗余后的以单字符作为建模单元的端到端语音识别系统对比基于三音素的HMM-GMM系统CER下降2.23%。对于资源匮乏语言,粒度单元的优化方法将是提高性能的下一个研究目标。
Abstract
End-to-end speech recognition technology has a simpler and more intuitive framework with better adaptability than traditional speech recognition framework. Based on RNN and CTC, this paper implements an end-to-end speech recognition system of Uyghur language via different acoustic unit. We compare this method with the traditional HMM speech recognition framework in a small corpora (THUYG). The experimental results show that the end-to-end speech recognition system based on mono-phone outperforms the HMM-GMM based on mono-phone and triphone by 10.6% and 2.23% lower CER, respectively.
关键词
端到端技术 /
语音识别 /
维吾尔语 /
链接时序分类
{{custom_keyword}} /
Key words
end-to-end /
ASR /
Uyghur /
cnnectionist temporal classification
{{custom_keyword}} /
{{custom_sec.title}}
{{custom_sec.title}}
{{custom_sec.content}}
参考文献
[1] ABLIMIT M, KAWAHARA T, HAMDULLA A. Lexicon optimization based on discriminative learning for automatic speech recognition of agglutinative language[J]. Speech Communication, 2014, 60: 78-87.
[2] ABLIMIT M, KAWAHARA T, HAMDULLA A. Morpheme segmentation and concatenation approaches for Uyghur LVCSR[J]. International Journal of Hybrid Information Technology, 2015, 8(8): 327-342.
[3] 艾斯卡尔·肉孜, 殷实, 张之勇等 THUYG-20: 免费的维吾尔语语音数据库[J]. 清华大学学报(自然科学版),2017(02): 73-78.
[4] WANG D, ZHENG T F. Transfer learning for speech and language processing[C]//Proceedings of the Asia-Pacific Signal and Information Processing Association Annual Summit and Conference. IEEE, 2015: 1225-1237.
[5] 米吉提·阿不里米提,艾克白尔·帕塔尔,艾斯卡尔·艾木都拉. 基于层次化结构的语言模型单元集优化[J]. 清华大学学报(自然科学版),2017(03): 36-42.
[6] 王俊超, 黄浩, 徐海华, 等. 基于迁移学习的低资源度维吾尔语语音识别[J]. 计算机工程, 2018, 44(10): 287-291.
[7] RABINER L R . A tutorial on hidden markov models and selected applications in speech recognition[C]//Proceedings of the IEEE, 1989, 77(2): 257-287.
[8] MOHAMED A, DAHL G E, HINTON G . Acoustic modeling using deep belief networks[J]. IEEE Transactions on Audio, Speech and Language Processing, 2012, 20(1): 14-22.
[9] ABDEL-HAMID O, MOHAMED A R, JIANG H, et al. Convolutional neural networks for speech recognition[J]. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2014, 22(10): 1533-1545.
[10] GERS F A,SCHMIDHUBER J, CUMMINS F. Learning to forget: Continual prediction with LSTM[J]. Neural Computation, 2000, 12(10): 2451-2471.
[11] HANNUN A, CASE C, CASPER J, et al. Deep speech: Scaling up end-to-end speech recognition[J]. arXiv preprint arXiv:1412.5567, 2014.
[12] ZHANG Y,PEZESHKI M, BRAKEL P, et al. Towards end-to-end speech recognition with deep convolutional neural networks[J]. arXiv preprint arXiv:1701.02720, 2017.
[13] CHAN W,JAITLY N, LE Q, et al. Listen, attend and spell: A neural network for large vocabulary conversational speech recognition[C]//Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing. 2016: 4960-4964.
[14] CHIU CC, SAINATH T N, WU Y, et al. State-of-the-art speech recognition with sequence-to-sequence models[C]//Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing. 2018: 4774-4778.
[15] GRAVES A, FERNNDEZ S, GOMEZ F, et al. Connectionist temporal classification: Labelling unsegmented sequence data with recurrent neural networks[C]//Proceedings of the 23rd International Conference on Machine Learning. 2006: 369-376.
[16] GRAVES A, MOHAMED A R, HINTON G. Speech recognition with deep recurrent neural networks[C]//Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing, 2013, 38: 6654-6659.
[17] ABLIMIT M, PARHAT S, HAMDULLA A, et al. A multilingual language processing tool for Uyghur, Kazak and Kirghiz[C]//Proceedings of the Asia-Pacific Signal and Information Processing Association Annual Summit and Conference. IEEE, 2017: 737-740.
[18] ABLIMIT M, PARHAT S, HAMDULLA A, et al. Multilingual stemming and term extraction for Uyghur, Kazak and Kirghiz[C]//Proceedings of the Asia-Pacific Signal and Information Processing Association Annual Summit and Conference. IEEE, 2018: 587-590.
[19] KUNZE J, KIRSCH L,KURENKOV I, et al. Transfer learning for speech recognition on a budget[J]. arXiv preprint arXiv:1706.00290, 2017.
[20] BAEVSKI A, ZHOU H, MOHAMED A, et al. wav2vec 2.0: A framework for self-supervised learning of speech representations[J]. arXiv preprint arXiv:2006.11477, 2020.
[21] HOCHREITER S, SCHMIDHUBER J. Long short-term memory[J]. Neural Computation, 1997, 9(8): 1735-1780.
[22] GRAVES A,JAITLY N. Towards end-to-end speech recognition with recurrent neural networks[C]//Proceedings of the International Conference on Machine Learning, 2014: 1764-1772.
{{custom_fnGroup.title_cn}}
脚注
{{custom_fn.content}}
基金
国家重点研究与发展计划(2017YFC0820602)
{{custom_fund}}