基于拼音约束联合学习的汉语语音识别

梁仁凤,余正涛,高盛祥,黄于欣,郭军军,许树理

PDF(4244 KB)
PDF(4244 KB)
中文信息学报 ›› 2022, Vol. 36 ›› Issue (10) : 167-172.
语音信息处理

基于拼音约束联合学习的汉语语音识别

  • 梁仁凤1,2,余正涛1,2,高盛祥1,2,黄于欣1,2,郭军军1,2,许树理1,2
作者信息 +

Chinese Speech Recognition Based on Pinyin Constraint and Joint Learning

  • LIANG Renfeng1,2, YU Zhengtao1,2, GAO Shengxiang1,2, HUANG Yuxin1,2, GUO Junjun1,2 , XU Shuli1,2
Author information +
History +

摘要

当前的语音识别模型在英语、法语等表音文字中已取得很好的效果。然而,汉语是一种典型的表意文字,汉字与语音没有直接的对应关系,但拼音作为汉字读音的标注符号,与汉字存在相互转换的内在联系。因此,在汉语语音识别中利用拼音作为解码时的约束,可以引入一种更接近语音的归纳偏置。该文基于多任务学习框架,提出一种基于拼音约束联合学习的汉语语音识别方法,以端到端的汉字语音识别为主任务,以拼音语音识别为辅助任务,通过共享编码器,同时利用汉字与拼音识别结果作为监督信号,增强编码器对汉语语音的表达能力。实验结果表明,相比基线模型,该文提出的方法取得了更优的识别效果,词错误率降低了2.24%。

Abstract

In contrast to phonetic languages achieving good performance of Automatic Speech Recognition (ASR) like English and France, Chinese is a logographic language without direct association with its pronunciation. To employ Pinyin which is the symbol system for the pronunciation of Chinese words, we propose an Automatic Speech Recognition method using Pinyin as a constraint for the decoding via multi-task learning framework. We introduce both Pinyin and Chinese character supervising signal to enhance the Chinese speech representing ability in the shared encoder, with Chinese character target ASR as the primary task and Pinyin target ASR as the auxiliary task. Experiments show that the proposed model gains a better recognition result with 2.24% reduction of the word error rate (WER).

关键词

端到端 / 汉语语音识别 / 联合学习 / 拼音

Key words

end-to-end / Chinese speech recognition / joint learning / Pinyin

引用本文

导出引用
梁仁凤,余正涛,高盛祥,黄于欣,郭军军,许树理. 基于拼音约束联合学习的汉语语音识别. 中文信息学报. 2022, 36(10): 167-172
LIANG Renfeng, YU Zhengtao, GAO Shengxiang, HUANG Yuxin, GUO Junjun , XU Shuli. Chinese Speech Recognition Based on Pinyin Constraint and Joint Learning. Journal of Chinese Information Processing. 2022, 36(10): 167-172

参考文献

[1] Sainath T N, Mohamed A R, Kingsbury B, et al. Deep convolutional neural networks for LVCSR [C]// Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing, 2013: 8614-8618.
[2] Liu A H, Sung T W, Chuang S P, et al. Sequence-to-sequence automatic speech recognition with word embedding regularization and fused decoding [C]// Proceedings of the CASSP IEEE International Conference on Acoustics, Speech and Signal Processing, 2020: 7879-7883.
[3] Graves A, Fernandez S, Gomez F, et al. Connectionist temporal classification: Labelling unsegmented sequence data with recurrent neural networks[C]//Proceedings of the 23rd International Conference on Machine Learning, 2006: 369-376.
[4] Moritz N, Hori T, Roux J L. Streaming end-to-end speech recognition with joint CTC-attention based models[C]// Proceedings of the IEEE Automatic Speech Recognition and Understanding Workshop, 2019: 936-943.
[5] Bahdanau D, Cho K, Bengio Y. Neural machine translation by jointly learning to align and translate[C]//Proceedings of ICLR, 2015: 3104-3112.
[6] Chorowski J, Bahdanau D, Cho K, et al. End-to-end continuous speech recognition using attention-based recurrent NN: First results[J]. arXiv preprint arXiv: 1412.1602, 2014.
[7] Kim S, Hori T, Watanabe S. Joint CTC-attention based end-to-end speech recognition using multi-task learning[C]// Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing, 2017: 4835-4839.
[8] Moritz N, Hori T, Roux J L. Triggered attention for end-to-end speech recognition[C]// Proceedings of the CASSP IEEE International Conference on Acoustics, Speech and Signal Processing, 2019: 5666-5670.
[9] Sarl L, Moritz N, Hori T, et al. Unsupervised speaker adaptation using attention-based speaker memory for end-to-end ASR[C]//Proceedings of the ICASSP IEEE International Conference on Acoustics, Speech and Signal Processing, 2020: 7384-7388.
[10] Chan W, Lane I. On online attention-based speech recognition and joint mandarin character-pinyin training[C]//Proceedings of Interspeech, 2016: 3404-3408.
[11] Zhou S, Dong L, Xu S, et al. Syllable based sequence-to-sequence speech recognition with the transformer in mandarin chinese[C]//Proceedings of Interspeech, 2018: 791-795.
[12] Qu Z, Haghani P, Weinstein E, et al. Syllable-based acoustic modeling with CTC-SMBR-LSTM [C]//Proceedings of the IEEE Automatic Speech Recognition and Understanding Workshop, 2017: 173-177.
[13] Liu Y, Hua J, Li X, et al. Chinese syllable-to-character conversion with recurrent neural network based supervised sequence labelling[C]// Proceedings of the Asia-Pacific Signal and Information Processing Association Annual Summit and Conference, 2015: 350-353.
[14] Vaswani A, Shazeer N, Parmar N, et al. Attention is all you need[C]//Proceedings of the 31st International Conference on Neural Information Processing Systems, 2017: 6000-6010.
[15] Gangi M, Negri M, Turchi M. Adapting transformer to end-to-end spoken language transltion[C]// Proceedings of Interspeech, 2019: 1133-1137.
[16] Weiss R J, Chorowski J, Jaitly N, et al. Sequence-to-sequence models can directly translate foreign speech[C]//Proceedings of Interspeech, 2017: 2625-2629.
[17] Caruana R. Multitask learning[J]. Machine Learning, 1997, 28(1): 41-75.
[18] Bu H, Du J, Na X, et al. Aishell-1: An open-source mandarin speech corpus and a speech recognition baseline[C]// Proceedings of the 20th Conference of the Oriental Chapter of the International Coordinating Committee on Speech Databases and Speech I/O Systems and Assessment, 2017: 1-5.

基金

国家自然科学基金(61732005, U21B2027, 61972186);云南高新技术产业发展项目(201606);云南省重大科技专项计划(202103AA080015, 202002AD080001-5);云南省基础研究计划(202001AS070014);云南省学术和技术带头人后备人才(202105AC160018)
PDF(4244 KB)

1324

Accesses

0

Citation

Detail

段落导航
相关文章

/