在分析现有的藏语自动分词方法基础上,该文通过分析藏文构词规则、句法结构、词的前后词性关系、后加字的添接法和格助词的用法等来重点研究了未登录词、紧缩词和交集型歧义的识别及处理方法,并提出了“重组法”,“排除—还原法”和“词性规则法”三种方法。经测试,在文学类、诗歌类、医学类和新闻类等大小为1M的藏语语料中未登录词、紧缩词和交集型歧义的识别准确率分别达到99.84%、99.95%和92.02%。
Abstract
This paper analyses Tibetan word formation rules, syntactic structures, adjacent Part-Of-Speeches, the pattern of the suffix character as well as the usage of case-auxiliary words. Focusing on the processing of out-of-vocabulary words, abbreviations and overlapping ambiguities, three methods are proposed as the re-combination method the exclusion-restoration method, and the POS rule method, respectively. Experiments on a 1M Tibetan corpus of literature, poetry, medicine and news indicate the precision of the above methods are 99.84%, 99.95% and 92.02%, respectively.
关键词
未登录词 /
紧缩词 /
交集型歧义
{{custom_keyword}} /
Key words
out-of-vocabulary word /
abbreviation /
overlapping ambiguity
{{custom_keyword}} /
{{custom_sec.title}}
{{custom_sec.title}}
{{custom_sec.content}}
参考文献
[1] 陈玉忠,李保利,俞士汶. 藏文自动分词系统的设计与实现[J].中文信息学报, 2003,17(03):15-20.
[2] 才智杰.藏文自动分词系统中紧缩词的识别[J].中文信息学报, 2009,23(01):35-37.
[3] 才智杰,才让卓玛.藏文自动分词系统的设计[J]. 计算机工程与科学,2011,33(5): 151-154.
[4] 祁坤钰.信息处理用藏文自动分词研究[J].西北民族大学学报(哲学社会科学版), 2006,26(04):92-97.
[5] 刘汇丹,诺明花,赵维纳,等. SegT: 一个实用的藏文分词系统[J]. 中文信息学报, 2012, 26(1):97-103.
[6] Huidan Liu, Weina Zhao, Minghua Nuo, et al. Tibetan Number Identification Based on Classification of Number Components in Tibetan Word Segmentation[C]//Proceedings of the 23rd International Conference on Computational Linguistics (Posters Volume) (Coling 2010),2010:719-724.
[7] 噶玛司都.司都文法详解[M].西宁: 青海民族出版社,2003.
[8] 色多五世罗桑崔臣嘉措.藏文文法根本颂色多氏大疏[M].兰州: 甘肃人名出版社,1981.
[9] 吉太加. 现代藏文语法通论[M].兰州: 甘肃民族出版社,2000.
[10] 马进武. 藏语语法四种结构明晰[M].北京: 民族出版社,2008.
[11] 格桑央京等.实用藏文文法教程[M].成都: 四川民族出版社,2004.
[12] Yuan Sun, Xiaodong Yan, Xiaobing Zhao, et al. A resolution of overlapping ambiguity in Tibetan word segmentation[C]//Proceedings of the 3rd International Conference on Computer Science and Information Technology, 2010: 222-225.
{{custom_fnGroup.title_cn}}
脚注
{{custom_fn.content}}
基金
科技部973计划前期研究专项项目课题“藏语语音识别技术研究”(2009CB326201)资助;“长江学者与创新团队发展计划”藏文信息技术创新团队(IRT0975)计划资助;“西藏大学211工程”三期项目资助。
{{custom_fund}}