本文提出一种基于判别式模型的藏文分词方法,重点研究最小构词粒度和分词结果重排序对藏文分词效果的影响。在构词粒度方面,分别考察了以基本字丁、基本字丁-音节点、音节为最小构词粒度对分词效果的影响,实验结果表明选定音节为最小构词粒度分词的F值最高,为91.21%;在分词结果重排序方面,提出一种基于词图的最短路径重排序策略,将判别式解码生成的切分结果压缩为加权有向图,图中节点表示音节间隔,而边所覆盖的音节作为候选切分并赋予不同权重,选择一条最短路径从而实现整句切分,最终分词结果的F值达到96.25%。
Abstract
This paper presents a discriminative model based approach for Tibetan word segmentation, which aims to investigate the influence of word-formation granularity and reranking strategy on the Tibetan segmentation. As for word-formation granularity, we discuss the influence of using the basic Tibetan character, the basic Tibetan character-tsheg as well as the syllable as word-formation unit on the Tibetan segmentation. The experimental results show that using syllable as word-formation unit obtains a highest F-measure score of 91.21%. And with a word lattice and shortest-path based reranking strategy, we further boost F-measure up to 96.25% on segmentation.
关键词
判别式 /
藏文分词 /
构词粒度 /
重排序
{{custom_keyword}} /
Key words
discriminative model /
Tibetan word segmentation /
word-formation unit /
reranking
{{custom_keyword}} /
{{custom_sec.title}}
{{custom_sec.title}}
{{custom_sec.content}}
参考文献
[1] 陈玉忠,李保利,俞士汶,等. 基于格助词和接续特征的藏文自动分词方案[J].语言文字应用,2003(1):75-82.
[2] 陈玉忠,李保利,俞士汶. 藏文自动分词系统的设计与实现[J].中文信息学报, 2003,17(03):15-20.
[3] 祁坤钰.信息处理用藏文自动分词研究[J]. 西北民族大学学报(哲学社会科学版), 2006,(4):92-97.
[4] 才智杰. 藏文自动分词系统中紧缩词的识别[J]. 中文信息学报, 2009,23(1):35-37.
[5] 才智杰. 班智达藏文自动分词系统的设计与实现[J]. 青海师范大学民族师范学报,2010,21(2):75-77.
[6] 孙媛,罗桑强巴,杨锐,等.藏语交集型歧义字段切分方法研究[C]//第十二届中国少数民族语言文字信息处理学术研讨会论文集,2009.
[7] 刘汇丹,诺明花,赵维纳,等. SegT: 一个实用的藏文分词系统[J]. 中文信息学报, 2012, 26(1):97-103.
[8] 苏俊峰,祁坤钰,本太. 基于HMM 的藏语语料库词性自动标注研究[J]. 西北民族大学学报(自然科学版), 2009, 30(1): 42-45.
[9] 史晓东,卢亚军. 央金藏文分词系统[J]. 中文信息学报,2011,25(4):54-56.
[10] Nianwen Xue, Libin Shen. Chinese word segmentation as LMR tagging[C]//Proceedings of the 2nd SIGHAN Workshop on Chinese Language Processing, in conjunction with ACL03, Sapporo, Japan, 2003: 176-179.
[11] Collins,Michael. Discriminative training methods for hidden markov models: Theory and experiments with perceptron algorithms[C]//Proceedings of the Empirical Methods in Natural Language processing Conference, Philadelphia, America, 2002: 1-8.
[12] Huidan Liu, Minghua Nuo, Longlong Ma, et al. Tibetan Word Segmentation as Syllable Tagging Using Conditional Random Fields[C]//Proceedings of the 25th Pacific Asia Conference on Language, Information and Computation, Singapore, 2011:168-177.
[13] Wenbin Jiang, Haitao Mi, QunLiu. Word lattice reranking for Chinese Word Segmentation and Part-of-Speech Tagging[C]//Proceedings of 22nd International Conference on Comtutational Linguistics, Manchester, UK, 2008:385-392.
{{custom_fnGroup.title_cn}}
脚注
{{custom_fn.content}}
基金
863重大项目(2011AA01A207)
{{custom_fund}}