In order to overcome the difficulty in fusing more features into n-gram, a Pinyin-to-Character conversion model based on Support Vector Machines (SVM) is proposed in this paper, providing the ability of integrating more statistical information. Meanwhile, the excellent generalization performance effectively overcomes the overfitting problem existing in the traditional model, and the soft margin strategy overcomes the noise problem to some extent in the corpus. Furthermore, rough set theory is applied to extract complicated and long distance features, which are fused into SVM model as a new kind of feature, and solve the problem that traditional models suffer from fusing long distance dependency. The experimental result showed that this SVM Pinyin-to-Character conversion model achieved 1.2% higher precision than the trigram model, which adopted absolute smoothing algorithm, moreover, the SVM model with long distance features achieved 1.6% higher accuracy.
JIANG Wei, GUAN Yi , WANG Xiao-long, LIU Bing-quan.
Pinyin-to-Character Conversion Model Based on Support Vector Machines. Journal of Chinese Information Processing. 2007, 21(2): 100-105
{{custom_sec.title}}
{{custom_sec.title}}
{{custom_sec.content}}
参考文献
[1] Chomsky N. Syntactic structures [M]. Mouton: 1964. [2] Wang Xuan, Wang Xiaolong. A Computer syllable- to-character conversion technique based on large scale corpus [J]. Computer Research and Development. 1998. [3] Liu Bingquan, Wang Xiaolong and Wang Yuying. Incorporating Linguistic Rules in Statistical Chinese Language Model for Pinyin-to-Character Conversion [J]. High Technology Letters. 2001,7(2): 8-13. [4] Wang Xiaolong, Chen Qingcai, and Daniel S.Yeung, Mining PinYin-to-Character Conversion Rules From Large-Scale Corpus: A Rough Set Approach [J]. IEEE TRANSACTION ON SYSTEMS. MAN. AND CYBERNETICS-PART B:CYBERNETICS. 2004, 34(2). [5] Zhou GuoDong and L. KimTeng. Interpolation of n-grams and mutualinformation based trigger pair language models for mandarin speech recognition [J]. Comput. Speech Lang,1998, 13: 125-141. [6] 李明琴, 王作英, 陆大纟金. 语音识别音字转换中的快速容错算法[J]. 中文信息学报, 2002,16(5): 38-43. [7] Jesús Giménez and Lluís Márquez . SVMTool: A general POS tagger generator based on Support Vector Machines [A]. In: Proceedings of the 4th International Conference on Language Resources and Evaluation (LREC’04) [C]. Lisbon, Portugal. 2004 . [8] T. Kudoh and Y. Matsumotl. Use of Support Vector Learning for Chunk Indentification [A]. In: proceedings of the Fourth Conference on Computational Natural Language Learning(CoNLL-2000) [C]. 2000. 142-144. [9] 张子荣, 初敏. 解决多音字字-音转换的一种统计学习方法[J]. 中文信息学报,2002,16(3): 39-45.