该文在分析了现有藏文词性标注方法的基础上,提出感知机训练模型的判别式藏语词性标注方法,重点研究了符合藏语词法特性的模型训练特征模板、模型训练和词性标注方法。并且在人工标注的测试集上获得了98.26%的词性标注精确率,可以实际应用到藏语自然语言处理中。
Abstract
This paper describes a discriminative method for Tibetan part-of-speech tagging with perceptron training model. We focus on how to build the feature template that is in line with Tibetan lexical features, how to train discriminative models and the method of part-of-speech tagging. The method achieves an extremely high precision of 98.26% over a manually created test corpus, which shows that it’s a practical solution for Tibetan natural language processing.
关键词
词性标注 /
感知机模型 /
特征选择 /
藏语词性标注
{{custom_keyword}} /
Key words
POS tagging /
perceptron model /
feature selection /
Tibetan part-of-speech tagging
{{custom_keyword}} /
{{custom_sec.title}}
{{custom_sec.title}}
{{custom_sec.content}}
参考文献
[1] 孙萌,刘群等. 基于判别式分类和重排序技术的藏文分词[C]//第十二届全国少数民族语言文字信息处理学术研讨会论文集,2011.
[2] 才让加. 藏语语料库词语分类体系及标记集研究[J].中文信息学报, 2009,23(4):146-148.
[3] 扎西加,珠杰. 面向信息处理的藏文分词规范研究[J].中文信息学报,2009.24(3):113-123.
[4] 才智杰,才让卓玛.班智达藏文标注词典设计[J].中文信息学报,2010,24(5):46-49.
[5] 史晓东,卢亚军. 央金藏文分词系统[J].中文信息学报,2011,25(4):54-56.
[6] 刘遥峰,王志良,王传经.中文分词和词性标注模型[J].计算机工程,2010,36(4):16-19.
[7] Collins,Michael. Discriminative training methods for hidden markov models: Theory and experiments with perceptron algorithms[C]//Proceedings of the Empirical Methods in Natural Language processing Conference, Philadelphia, America, 2002: 1-8.
[8] 扎塘·降白益西坚参.新编藏文文法[M].拉萨: 西藏人民出版社, 1997.
[9] 格桑居冕.实用藏文文法[M]. 成都:四川民族出版社. 1987.
[10] 宗成庆.统计自然语言处理[M].北京: 清华大学出版社2008.
{{custom_fnGroup.title_cn}}
脚注
{{custom_fn.content}}
基金
973计划前期研究专项(2010CB334708);国家自然科学基金(61063033, 61163018, 61363055);教育部“春晖计划”合作科研项目(Z2012102)
{{custom_fund}}