基于判别式分类和重排序技术的藏文分词

孙 萌,华却才让,才智杰,姜文斌,吕雅娟,刘 群

PDF(2913 KB)
PDF(2913 KB)
中文信息学报 ›› 2014, Vol. 28 ›› Issue (2) : 61-65.
少数民族语言信息处理

基于判别式分类和重排序技术的藏文分词

  • 孙 萌1,2,华却才让3,才智杰3,姜文斌1,吕雅娟1,刘 群1
作者信息 +

Tibetan Word Segmentation Based on Discriminative Classification and Reranking

  • SUN Meng1,2, HUA que-cai-rang3, CAI Zhijie3, JIANG Wenbin1, LV Yajuan1, LIU Qun1
Author information +
History +

摘要

本文提出一种基于判别式模型的藏文分词方法,重点研究最小构词粒度和分词结果重排序对藏文分词效果的影响。在构词粒度方面,分别考察了以基本字丁、基本字丁-音节点、音节为最小构词粒度对分词效果的影响,实验结果表明选定音节为最小构词粒度分词的F值最高,为91.21%;在分词结果重排序方面,提出一种基于词图的最短路径重排序策略,将判别式解码生成的切分结果压缩为加权有向图,图中节点表示音节间隔,而边所覆盖的音节作为候选切分并赋予不同权重,选择一条最短路径从而实现整句切分,最终分词结果的F值达到96.25%。

Abstract

This paper presents a discriminative model based approach for Tibetan word segmentation, which aims to investigate the influence of word-formation granularity and reranking strategy on the Tibetan segmentation. As for word-formation granularity, we discuss the influence of using the basic Tibetan character, the basic Tibetan character-tsheg as well as the syllable as word-formation unit on the Tibetan segmentation. The experimental results show that using syllable as word-formation unit obtains a highest F-measure score of 91.21%. And with a word lattice and shortest-path based reranking strategy, we further boost F-measure up to 96.25% on segmentation.

关键词

判别式 / 藏文分词 / 构词粒度 / 重排序

Key words

discriminative model / Tibetan word segmentation / word-formation unit / reranking

引用本文

导出引用
孙 萌,华却才让,才智杰,姜文斌,吕雅娟,刘 群. 基于判别式分类和重排序技术的藏文分词. 中文信息学报. 2014, 28(2): 61-65
SUN Meng, HUA que-cai-rang, CAI Zhijie, JIANG Wenbin, LV Yajuan, LIU Qun. Tibetan Word Segmentation Based on Discriminative Classification and Reranking. Journal of Chinese Information Processing. 2014, 28(2): 61-65

参考文献

[1] 陈玉忠,李保利,俞士汶,等. 基于格助词和接续特征的藏文自动分词方案[J].语言文字应用,2003(1):75-82.
[2] 陈玉忠,李保利,俞士汶. 藏文自动分词系统的设计与实现[J].中文信息学报, 2003,17(03):15-20.
[3] 祁坤钰.信息处理用藏文自动分词研究[J]. 西北民族大学学报(哲学社会科学版), 2006,(4):92-97.
[4] 才智杰. 藏文自动分词系统中紧缩词的识别[J]. 中文信息学报, 2009,23(1):35-37.
[5] 才智杰. 班智达藏文自动分词系统的设计与实现[J]. 青海师范大学民族师范学报,2010,21(2):75-77.
[6] 孙媛,罗桑强巴,杨锐,等.藏语交集型歧义字段切分方法研究[C]//第十二届中国少数民族语言文字信息处理学术研讨会论文集,2009.
[7] 刘汇丹,诺明花,赵维纳,等. SegT: 一个实用的藏文分词系统[J]. 中文信息学报, 2012, 26(1):97-103.
[8] 苏俊峰,祁坤钰,本太. 基于HMM 的藏语语料库词性自动标注研究[J]. 西北民族大学学报(自然科学版), 2009, 30(1): 42-45.
[9] 史晓东,卢亚军. 央金藏文分词系统[J]. 中文信息学报,2011,25(4):54-56.
[10] Nianwen Xue, Libin Shen. Chinese word segmentation as LMR tagging[C]//Proceedings of the 2nd SIGHAN Workshop on Chinese Language Processing, in conjunction with ACL03, Sapporo, Japan, 2003: 176-179.
[11] Collins,Michael. Discriminative training methods for hidden markov models: Theory and experiments with perceptron algorithms[C]//Proceedings of the Empirical Methods in Natural Language processing Conference, Philadelphia, America, 2002: 1-8.
[12] Huidan Liu, Minghua Nuo, Longlong Ma, et al. Tibetan Word Segmentation as Syllable Tagging Using Conditional Random Fields[C]//Proceedings of the 25th Pacific Asia Conference on Language, Information and Computation, Singapore, 2011:168-177.
[13] Wenbin Jiang, Haitao Mi, QunLiu. Word lattice reranking for Chinese Word Segmentation and Part-of-Speech Tagging[C]//Proceedings of 22nd International Conference on Comtutational Linguistics, Manchester, UK, 2008:385-392.

基金

863重大项目(2011AA01A207)
PDF(2913 KB)

647

Accesses

0

Citation

Detail

段落导航
相关文章

/