大词汇量连续语音识别系统的性能很大程度上取决于语音库的质量,而语音库设计的中心环节就是语料选取。但是传统语料选取方法往往考虑因素单一,不利于语音识别系统有效利用语言信息。本语音库的语料选取方法综合考虑了多种因素:三音子覆盖率、三音子覆盖效率、三音子稀疏度、常用词分布等,并完全实现程序自动选取,充分利用了原始语料,使选取结果的信息量更加丰富。程序自动选取结果可以覆盖94.1%的三音子,75.4%的最常用词,覆盖效率和稀疏度也比传统方法有了较大改善。
Abstract
The performance of continuous speech recognition systems depends much on speech database. Text selection is the key step in designing of the speech database. Conventional text selection methods consider too few factors for the recognition systems to use linguistic information effectually. This paper describes a method which can select text automatically and consider multiple factors : triphone covering rate , triphone covering efficiency , triphone sparse rate and distribution of commonly used words , etc. The set of selected text covers 94.1% triphones , 75.4% most commonly used words , and also the covering rate and sparse rate are improved than that of conventional methods.
关键词
计算机应用 /
中文信息处理 /
语音库 /
三音子 /
高频词 /
覆盖率
{{custom_keyword}} /
Key words
computer application /
Chinese information processing /
speech database /
triphone /
commonly used words /
covering rate
{{custom_keyword}} /
{{custom_sec.title}}
{{custom_sec.title}}
{{custom_sec.content}}
参考文献
[1] Jan P. H. van Santen , Adam L. Buchsbaum. Methods for Optimal Text Selection [C] . Proceedings of Eurospeech’97. 1997 , (2) :557 - 561.
[2] Helene Francois , Olivier Boeffard. The Greedy Algorithm and its Applicationto the Construction of a Continuous Speech Database [C] . In Proceedings of the 3rd International Conference on Language Resources and Evaluation (LREC 2002) . (5) :1420 - 1426.
[3] 祖漪清. 汉语连续语音数据库的语料设计[J] . 声学学报. 1999 , (3) :236 - 247.
[4] 祖漪清. 连续语音数据库设计的科学性问题[Z] . 语音研究报告http://www.cass.net.cn/s18-yys/yuyin/rpr-il/zuyq-98.htm.
[5] 吴华,徐波,黄泰翼. 基于三音子模型的语料自动选取算法[J] . 软件学报. 2000 ,11 (2) :271 - 276.
[6] 林焘,王理嘉.“语音学教程”[M] . 北京:北京大学出版社,1999.
[7] 曹剑芬. 普通话语音的环境音变与双音子和三音子结构[J] . 语言文字应用. 1996 , (2) :58 - 63.
[8] 曹剑芬. 普通话双音子和三音子结构系统代表语料集[J] . 语言文字应用. 1997 , (1) :60 - 68.
[9] 祖漪清. 实现语音数据库科学性的重要环节[J] . 语言文字应用. 1998 , (1) :93 - 97.
{{custom_fnGroup.title_cn}}
脚注
{{custom_fn.content}}
基金
国家自然科学基金资助项目(60172055);国家“863”资助项目(2001AA114181);北京市自然科学基金资助项目(4002012)
{{custom_fund}}