藏语自动分词中的几个关键问题的研究

完么扎西,尼玛扎西

PDF(6692 KB)
PDF(6692 KB)
中文信息学报 ›› 2014, Vol. 28 ›› Issue (4) : 132-139.
少数民族语言信息处理

藏语自动分词中的几个关键问题的研究

  • 完么扎西1,尼玛扎西2
作者信息 +

Research on Several Key Issues in Tibetan Word Segmentation

  • Wanmezhaxi1, Nimazhaxi2
Author information +
History +

摘要

在分析现有的藏语自动分词方法基础上,该文通过分析藏文构词规则、句法结构、词的前后词性关系、后加字的添接法和格助词的用法等来重点研究了未登录词、紧缩词和交集型歧义的识别及处理方法,并提出了“重组法”,“排除—还原法”和“词性规则法”三种方法。经测试,在文学类、诗歌类、医学类和新闻类等大小为1M的藏语语料中未登录词、紧缩词和交集型歧义的识别准确率分别达到99.84%、99.95%和92.02%。

Abstract

This paper analyses Tibetan word formation rules, syntactic structures, adjacent Part-Of-Speeches, the pattern of the suffix character as well as the usage of case-auxiliary words. Focusing on the processing of out-of-vocabulary words, abbreviations and overlapping ambiguities, three methods are proposed as the re-combination method the exclusion-restoration method, and the POS rule method, respectively. Experiments on a 1M Tibetan corpus of literature, poetry, medicine and news indicate the precision of the above methods are 99.84%, 99.95% and 92.02%, respectively.

关键词

未登录词 / 紧缩词 / 交集型歧义

Key words

out-of-vocabulary word / abbreviation / overlapping ambiguity

引用本文

导出引用
完么扎西,尼玛扎西. 藏语自动分词中的几个关键问题的研究. 中文信息学报. 2014, 28(4): 132-139
Wanmezhaxi, Nimazhaxi. Research on Several Key Issues in Tibetan Word Segmentation. Journal of Chinese Information Processing. 2014, 28(4): 132-139

参考文献

[1] 陈玉忠,李保利,俞士汶. 藏文自动分词系统的设计与实现[J].中文信息学报, 2003,17(03):15-20.
[2] 才智杰.藏文自动分词系统中紧缩词的识别[J].中文信息学报, 2009,23(01):35-37.
[3] 才智杰,才让卓玛.藏文自动分词系统的设计[J]. 计算机工程与科学,2011,33(5): 151-154.
[4] 祁坤钰.信息处理用藏文自动分词研究[J].西北民族大学学报(哲学社会科学版), 2006,26(04):92-97.
[5] 刘汇丹,诺明花,赵维纳,等. SegT: 一个实用的藏文分词系统[J]. 中文信息学报, 2012, 26(1):97-103.
[6] Huidan Liu, Weina Zhao, Minghua Nuo, et al. Tibetan Number Identification Based on Classification of Number Components in Tibetan Word Segmentation[C]//Proceedings of the 23rd International Conference on Computational Linguistics (Posters Volume) (Coling 2010),2010:719-724.
[7] 噶玛司都.司都文法详解[M].西宁: 青海民族出版社,2003.
[8] 色多五世罗桑崔臣嘉措.藏文文法根本颂色多氏大疏[M].兰州: 甘肃人名出版社,1981.
[9] 吉太加. 现代藏文语法通论[M].兰州: 甘肃民族出版社,2000.
[10] 马进武. 藏语语法四种结构明晰[M].北京: 民族出版社,2008.
[11] 格桑央京等.实用藏文文法教程[M].成都: 四川民族出版社,2004.
[12] Yuan Sun, Xiaodong Yan, Xiaobing Zhao, et al. A resolution of overlapping ambiguity in Tibetan word segmentation[C]//Proceedings of the 3rd International Conference on Computer Science and Information Technology, 2010: 222-225.

基金

科技部973计划前期研究专项项目课题“藏语语音识别技术研究”(2009CB326201)资助;“长江学者与创新团队发展计划”藏文信息技术创新团队(IRT0975)计划资助;“西藏大学211工程”三期项目资助。
PDF(6692 KB)

Accesses

Citation

Detail

段落导航
相关文章

/