基于改进Conformer的新闻领域端到端语音识别

张济民,早克热·卡德尔,艾山·吾买尔,申云飞,汪烈军

PDF(1474 KB)
PDF(1474 KB)
中文信息学报 ›› 2024, Vol. 38 ›› Issue (4) : 156-164.
语音信息处理

基于改进Conformer的新闻领域端到端语音识别

  • 张济民1,2,早克热·卡德尔1,2,艾山·吾买尔1,2,申云飞2,3,汪烈军1,2
作者信息 +

End-to-End Speech Recognition in News Field Based on Conformer

  • ZHANG Jimin1,2, ZAOKERE Kadeer1,2, AISHAN Wumaier1,2, SHEN Yunfei2,3, WANG Liejun1,2
Author information +
History +

摘要

目前,开源的中文语音识别数据集大多面向通用领域,缺少面向新闻领域的开源语音识别语料库,因此该文构建了面向新闻领域的中文语音识别数据集CH_NEWS_ASR,并使用ESPNET-0.9.6框架的RNN、Transformer和Conformer等模型对数据集的有效性进行了验证,实验表明,该文所构建的语料在最好的模型上CER为4.8%,SER为39.4%。由于新闻联播主持人说话语速相对较快,该文构建的数据集文本平均长度为28个字符,是Aishell_1数据集文本平均长度的2倍;且以往的研究中训练目标函数通常为基于字或词水平,缺乏明确的句子水平关系,因此该文提出了一个句子层级的一致性模块,与Conformer模型结合,直接减少源语音和目标文本的表示差异,在开源的Aishell_1数据集上其CER降低0.4%,SER降低2%;在CH_NEWS_ASR数据集上其CER降低0.9%,SER降低3%,实验结果表明,该方法在不增加模型参数量的前提下能有效提升语音识别的质量。

Abstract

The open source Chinese speech recognition data sets are usually developed for the general domain. This paper constructs a news-oriented Chinese speech recognition data set named CH_NEWS_ASR, and verifies the validity of the data set by the RNN, Transformer and Conformer models under ESPNET-0.9.6 framework. As news broadcasters speak relatively fast, the average text length in this dataset is 28 characters, which is 2 times of the average text length of Aishell_1 dataset. In this paper, we propose a sentence-level consistency module combined with the Conformer model to directly reduce the representation differences between source speech and target text. Experiments demonstrate that, on the Aishell_1 dataset, the CER is reduced by 0.4% and the SER by 2%; on the CH_NEWS_ASR dataset, the CER is reduced by 0.9% and the SER by 3%.

关键词

端到端语音识别 / Conformer / 句子层级一致性

Key words

end-to-end speech recognition / conformer / sentence-level agreement

引用本文

导出引用
张济民,早克热·卡德尔,艾山·吾买尔,申云飞,汪烈军. 基于改进Conformer的新闻领域端到端语音识别. 中文信息学报. 2024, 38(4): 156-164
ZHANG Jimin, ZAOKERE Kadeer, AISHAN Wumaier, SHEN Yunfei, WANG Liejun. End-to-End Speech Recognition in News Field Based on Conformer. Journal of Chinese Information Processing. 2024, 38(4): 156-164

参考文献

[1] 王勇和, 飞龙, 高光来. 基于 TDNN-FSMN 的蒙古语语音识别技术研究[J]. 中文信息学报, 2018, 32(9): 28-34.
[2] 张宇, 张鹏远, 颜永红. 基于注意力 LSTM 和多任务学习的远场语音识别[J]. 清华大学学报(自然科学版),2018,58(3): 249-253.
[3] GRAVES A, JAITLY N. Towards end-to-end speech recognition with recurrent neural networks[C]//Proceedings of the International Conference on Machine Learning. PMLR, 2014: 1764-1772.
[4] MIAO Y, GOWAYYED M, METZE F. EESEN: End-to-end speech recognition using deep RNN models and WFST-based decoding[C]//Proceedings of the IEEE Workshop on Automatic Speech Recognition and Understanding. IEEE, 2015: 167-174.
[5] CHAN W, JAITLY N, Le Q, et al. Listen, attend and spell: A neural network for large vocabulary conversational speech recognition[C]//Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing. IEEE, 2016: 4960-4964.
[6] GRAVES A, MOHAMED A, HINTON G. Speech recognition with deep recurrent neural networks[C]//Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing. IEEE, 2013: 6645-6649.
[7] BU H, DU J, NA X, et al. Aishell-1: An open-source mandarin speech corpus and a speech recognition baseline[C]//Proceedings of the 20th Conference of the Oriental Chapter of the International Coordinating Committee on Speech Databases and Speech I/O Systems and Assessment. IEEE, 2017: 1-5.
[8] GULATI A, QIN J, CHIU C C, et al. Conformer: Convolution-augmented transformer for speech recognition[J]. arXiv preprint arXiv:2005.08100, 2020.
[9] VASWANI A, SHAZEER N, PARMAR N, et al. Attention is all you need[C]//Proceedings of the 31st International Conference on Neural Information Processing Systems, 2017: 6000-6010.
[10] GIANNAKOPOULOS T. pyaudioanalysis: An open-source python library for audio signal analysis[J]. PloS one, 2015, 10(12): e0144610.
[11] BHARATI A, MOONA R, SINGH S, et al. Mteval: An evaluation methodology for machine translation systems[C]//Proceedings of SIMPLE Symp on Indian Morphology, Phonology and Lang Engineering, 2004.
[12] 牛米佳,飞龙,高光来. 蒙古语长音频语音文本自动对齐的研究[J]. 中文信息学报, 2020, 34(1): 51-57.
[13] WATANABE S, HORI T, KARITA S, et al. Espnet: End-to-end speech processing toolkit[J]. arXiv preprint arXiv:1804.00015, 2018.
[14] ALIGULIYEV R M. A new sentence similarity measure and sentence based extractive technique for automatic text summarization[J]. Expert Systems with Applications, 2009, 36(4): 7764-7772.
[15] LIANG X, WANG D, HUANG M. Improved sentence similarity algorithm based on vsm and its application in question answering system[C]//Proceedings of the IEEE International Conference on Intelligent Computing and Intelligent Systems. IEEE, 2010, 1: 368-371.
[16] SU B H, KUAN T W, TSENG S P, et al. Improved TF-IDF weight method based on sentence similarity for spoken dialogue system[C]//Proceedings of the International Conference on Orange Technologies. IEEE, 2016: 36-39.
[17] REI M, CUMMINS R. Sentence similarity measures for fine-grained estimation of topical relevance in learner essays[J]. arXiv preprint arXiv:1606.03144, 2016.
[18] WANG R, UTIYAMA M, FINCH A, et al. Sentence selection and weighting for neural machine translation domain adaptation[J]. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2018, 26(10): 1727-1741.

基金

新疆维吾尔自治区科技创新领军人才项目——高层次领军人才(2022TSYCLJ0036)
PDF(1474 KB)

624

Accesses

0

Citation

Detail

段落导航
相关文章

/