基于支持向量机的音字转换模型

姜维,关毅,王晓龙,刘秉权

PDF(384 KB)
PDF(384 KB)
中文信息学报 ›› 2007, Vol. 21 ›› Issue (2) : 100-105.
综述

基于支持向量机的音字转换模型

  • 姜维,关毅,王晓龙,刘秉权
作者信息 +

Pinyin-to-Character Conversion Model Based on Support Vector Machines

  • JIANG Wei, GUAN Yi , WANG Xiao-long, LIU Bing-quan
Author information +
History +

摘要

针对N-gram在音字转换中不易融合更多特征,本文提出了一种基于支持向量机(SVM)的音字转换模型,有效提供可以融合多种知识源的音字转换框架。同时,SVM优越的泛化能力减轻了传统模型易于过度拟合的问题,而通过软间隔分类又在一定程度上克服小样本中噪声问题。此外,本文利用粗糙集理论提取复杂特征以及长距离特征,并将其融合于SVM模型中,克服了传统模型难于实现远距离约束的问题。实验结果表明,基于SVM音字转换模型比传统采用绝对平滑算法的Trigram模型精度提高了1.2%;增加远距离特征的SVM模型精度提高1.6%。

Abstract

In order to overcome the difficulty in fusing more features into n-gram, a Pinyin-to-Character conversion model based on Support Vector Machines (SVM) is proposed in this paper, providing the ability of integrating more statistical information. Meanwhile, the excellent generalization performance effectively overcomes the overfitting problem existing in the traditional model, and the soft margin strategy overcomes the noise problem to some extent in the corpus. Furthermore, rough set theory is applied to extract complicated and long distance features, which are fused into SVM model as a new kind of feature, and solve the problem that traditional models suffer from fusing long distance dependency. The experimental result showed that this SVM Pinyin-to-Character conversion model achieved 1.2% higher precision than the trigram model, which adopted absolute smoothing algorithm, moreover, the SVM model with long distance features achieved 1.6% higher accuracy.

关键词

人工智能 / 自然语言处理 / 支持向量机 / 音字转换 / 粗糙集理论 / 远距离特征

Key words

artificial intelligence / natural language processing / support vector machines / Pinyin-to-Character conversion / rough sets / long distance feature

引用本文

导出引用
姜维,关毅,王晓龙,刘秉权. 基于支持向量机的音字转换模型. 中文信息学报. 2007, 21(2): 100-105
JIANG Wei, GUAN Yi , WANG Xiao-long, LIU Bing-quan. Pinyin-to-Character Conversion Model Based on Support Vector Machines. Journal of Chinese Information Processing. 2007, 21(2): 100-105

参考文献


[1] Chomsky N. Syntactic structures [M]. Mouton: 1964.
[2] Wang Xuan, Wang Xiaolong. A Computer syllable- to-character conversion technique based on large scale corpus [J]. Computer Research and Development. 1998.
[3] Liu Bingquan, Wang Xiaolong and Wang Yuying. Incorporating Linguistic Rules in Statistical Chinese Language Model for Pinyin-to-Character Conversion [J]. High Technology Letters. 2001,7(2): 8-13.
[4] Wang Xiaolong, Chen Qingcai, and Daniel S.Yeung, Mining PinYin-to-Character Conversion Rules From Large-Scale Corpus: A Rough Set Approach [J]. IEEE TRANSACTION ON SYSTEMS. MAN. AND CYBERNETICS-PART B:CYBERNETICS. 2004, 34(2).
[5] Zhou GuoDong and L. KimTeng. Interpolation of n-grams and mutualinformation based trigger pair language models for mandarin speech recognition [J]. Comput. Speech Lang,1998, 13: 125-141.
[6] 李明琴, 王作英, 陆大纟金. 语音识别音字转换中的快速容错算法[J]. 中文信息学报, 2002,16(5): 38-43.
[7] Jesús Giménez and Lluís Márquez . SVMTool: A general POS tagger generator based on Support Vector Machines [A]. In: Proceedings of the 4th International Conference on Language Resources and Evaluation (LREC’04) [C]. Lisbon, Portugal. 2004 .
[8] T. Kudoh and Y. Matsumotl. Use of Support Vector Learning for Chunk Indentification [A]. In: proceedings of the Fourth Conference on Computational Natural Language Learning(CoNLL-2000) [C]. 2000. 142-144.
[9] 张子荣, 初敏. 解决多音字字-音转换的一种统计学习方法[J]. 中文信息学报,2002,16(3): 39-45.

基金

国家自然科学基金重点项目资助(60435020);国家自然科学基金项目资助(60504021)
PDF(384 KB)

659

Accesses

0

Citation

Detail

段落导航
相关文章

/