蒙古语声学模型的训练过程是模型学习发音数据与标注数据之间关系的一个过程。针对以音素为建模粒子的蒙古语声学模型建模,由于蒙古语词的发音与语义存在一对多映射现象,会造成解码出的蒙古语文本错误,进而导致蒙古语语音识别系统识别率降低的问题。对此,该文以端到端模型为基础,以蒙古语音素、字母为蒙古语声学模型建模粒子,设计了基于BLSTM-CTC的蒙古语声学模型,并给出了动量训练算法。实验结果表明,基于蒙古语字母的BLSTM-CTC蒙古语声学模型可以有效降低蒙古语语音识别系统中异形同音词的词错率。
Abstract
The training process of Mongolian acoustic model is a process where the model learns the relationship between pronunciation data and annotation data. Aiming at the modeling of Mongolianacoustic model based on phonemes, deal with the one-to-many mapping phenomenon between pronunciation and semantics, which will the decoding of Mongolian text will be wrong and will lead to the problem of low recognition rate of Mongolian speech recognition system. In this regard, this paper designs an End-to-End Mongolian acoustic model with both phonemes and letters used. Specifically, a Mongolian acoustic model based on BLSTM-CTC is described, and a momentum training algorithm is applied. The experimental results show that the proposed method can effectively reduce the word error rate of heteromorphous homophones in Mongolian speech recognition system.
关键词
异形同音词 /
建模粒子 /
端到端 /
蒙古语声学模型 /
语音识别
{{custom_keyword}} /
Key words
heteromorphic homophone /
modeling unit /
End-to-End /
Mongolian acoustic model /
speech recognition
{{custom_keyword}} /
{{custom_sec.title}}
{{custom_sec.title}}
{{custom_sec.content}}
参考文献
[1] Renals S, Morgan N, Bourlard H, et al. Connectionist probability estimators in HMM speech recognition[J]. IEEE Transactions on Speech and Audio Processing, 1994, 2(1): 161-174.
[2] Chan W,Jaitly N, Le Q, et al. Listen, attend and spell: A neural network for large vocabulary conversational speech recognition[C]//Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing. Shanghai: IEEE, 2016: 4960-4964.
[3] 包希日莫,高光来.蒙古语声学模型状态聚类: 问题集设计[J].内蒙古大学学报: 自然科学版,2013,44(01): 87-92.
[4] 马志强, 李图雅, 杨双涛, 等. 基于深度神经网络的蒙古语声学模型建模研究[J]. 智能系统学报,2018, 13(3): 486-492
[5] Pundak G, Sainath T. Lower frame rate neural network acoustic models[C]//Proceedings of the Interspeech Annual Conference of the International Speech Communication Association. San Francisco: ISCA, 2016: 22-26.
[6] Lu L, Kong L, Dyer C, et al. Segmental recurrent neural networks for end-to-end speech recognition[C]//Proceedings of the Interspeech Annual Conference of the International Speech Communication Association. San Francisco: ISCA, 2016: 385-389.
[7] GravesA, Fernández S, Gomez F, et al. Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks[C]//Proceedings of the 23rd International Conference on Machine learning. Pittsburgh: ACM, 2006: 369-376.
[8] Chen Z, Zhuang Y, Qian Y, et al. Phone synchronous speech recognition with CTC lattices[J]. IEEE/ACM transactions on audio, speech, and language processing, 2016, 25(1): 90-101.
[9] Maas A,Xie Z, Jurafsky D, et al. Lexicon-free conversational speech recognition with neural networks[C]//Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2015: 345-354.
[10] 全国信息技术标准化技术委员会. GB/T 25914-2010, 信息技术 传统蒙古文名义字符、变形显现字符和控制字符使用规则[S]. 北京: 中华人民共和国国家质量监督检验检疫总局、中国国家标准化管理委员会, 2011.
[11] 全国信息技术标准化技术委员会(SAC/TC 28). GB/T 13000-2010, 信息技术 通用多八位编码字符集(UCS)[S]. 北京: 中华人民共和国国家质量监督检验检疫总局、中国国家标准化管理委员会, 2011.
[12] Watanabe S, Hori T, Kim S, et al. Hybrid CTC/attention architecture for end-to-end speech recognition[J]. IEEE Journal of Selected Topics in Signal Processing, 2017, 11(8): 1240-1253.
[13] Lu L, Kong L, Dyer C, et al. Multitask learning with CTC and segmental CRF for speech recognition[C]//Proceedings of the Interspeech Annual Conference of the International Speech Communication Association. Stockholm: ISCA, 2017: 954-958.
[14] Graves A,Jaitly N. Towards end-to-end speech rec-ognition with recurrent neural networks[C]//Proceedings of the 31th International Conference on Machine Learning. Beijing: JMLR.org, 2014: 1764-1772.
[15] Chiu C C, Sainath T N, Wu Y, et al. State-of-the-art speech recognition with sequence-to-sequence models[C]//Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing. Calgary: IEEE, 2018: 4774-4778.
[16] Sak H, Senior A, Rao K, et al. Fast and accurate recurrent neural network acoustic models for speech recognition[C]//Proceedings of the Interspeech Annual Conference of the International Speech Communication Association. Dresden: ISCA, 2015: 1468-1472.
{{custom_fnGroup.title_cn}}
脚注
{{custom_fn.content}}
基金
国家自然科学基金 (61762070,61862048);内蒙古自治区自然科学基金 (2019MS06004);内蒙古自治区科技重大专项(2019ZD015);内蒙古自治区关键技术攻关计划项目(2019GG273)
{{custom_fund}}