基于条件随机场的藏语自动分词方法研究与实现

李亚超1,加羊吉1,宗成庆2,于洪志1

PDF(1629 KB)
PDF(1629 KB)
中文信息学报 ›› 2013, Vol. 27 ›› Issue (4) : 52-59.
综述

基于条件随机场的藏语自动分词方法研究与实现

  • 李亚超1,加羊吉1,宗成庆2,于洪志1
作者信息 +

Research and Implementation of Tibetan Automatic Word Segmentation
Based on Conditional Random Field

  • LI Yachao1, JAM Yangkyi1, ZONG Chengqing2, YU Hongzhi1
Author information +
History +

摘要

藏语自动分词是藏语信息处理的基础性关键问题,而紧缩词识别是藏语分词中的重点和难点。目前公开的紧缩词识别方法都是基于规则的方法,需要词库支持。该文提出了一种基于条件随机场的紧缩词识别方法,并在此基础上实现了基于条件随机场的藏语自动分词系统。实验结果表明,基于条件随机场的紧缩词识别方法快速、有效,而且可以方便地与分词模块相结合,显著提高了藏语分词的效果。

Abstract

Tibetan automatic word segmentation (TAWS) is an important problem in Tibetan information processing, while abbreviated word recognition is one of the key and most difficult problems in TAWS. All the existing methods of Tibetan abbreviated word recognition are rule-based approaches, which need vocabulary support. In this paper, we propose a method based on conditional random field (CRF) for abbreviated word recognition, and then implement a TAWS system with CRF. The experimental results show that our abbreviated word recognition method is fast and effective and can be combined easily with the segmentation model based on conditional random fields. This significantly increases the effect of the Tibetan word segmentation.
Key wordsTibetan automatic word segmentation; conditional random fields; abbreviated word recognition; case-auxiliary words

关键词

藏语自动分词 / 条件随机场 / 紧缩词识别 / 格助词

Key words

Tibetan automatic word segmentation / conditional random fields / abbreviated word recognition / case-auxiliary words

引用本文

导出引用
李亚超1,加羊吉1,宗成庆2,于洪志1. 基于条件随机场的藏语自动分词方法研究与实现. 中文信息学报. 2013, 27(4): 52-59
LI Yachao1, JAM Yangkyi1, ZONG Chengqing2, YU Hongzhi1. Research and Implementation of Tibetan Automatic Word Segmentation
Based on Conditional Random Field. Journal of Chinese Information Processing. 2013, 27(4): 52-59

参考文献

[1] 山木旦,郑绍功,扎喜拉旦等.新编藏文字典[M].西宁: 青海民族出版社,1979.
[2] 扎西次仁.一个人机互助的藏文分词和词登录系统的设计[C].中国少数民族语言文字现代化文集,北京: 民族出版社,1999: 322-327.
[3] 陈玉忠,李保利,俞士汶,等.基于格助词和连续特征的藏文自动分词方案[J].语言文字应用,2003,(1): 75-82.
[4] 祁坤钰.信息处理用藏文自动分词研究[J].西北民族大学学报(哲学社会科学版),2006,(4): 92-97.
[5] 才智杰.藏文自动分词系统中紧缩词的识别[J].中文信息学报,2009,23(1): 35-37.
[6] Huidan Liu, Weina Zhao, Minghua Nuo, et al. Tibetan Number Identification Based on Classification of Number Components in Tibetan Word Segmentation[C]//Proceedings of the 23rd International Conference on Computational Linguistics (Posters Volume) (Coling 2010), 2010: 719-724.
[7] Huidan Liu, Minghua Nuo, Longlong Ma, et al. Tibetan Word Segmentation as Syllable Tagging Using Conditional Random Fields[C]//Proceedings of the 25th Pacific Asia Conference on Language, Information and Computation (PACLIC-2011), 2011: 168-177.
[8] 史晓东,卢亚军.央金藏文分词系统[J].中文信息学报,2011,25(4): 54-56.
[9] 黄昌宁,赵海.中文分词十年回顾[J].中文信息学报,2007,21(3): 8-20.
[10] 宗成庆.统计自然语言处理[M].清华大学出版社,2008.
[11] Neinwen Xue, Susan P. Converse. Combining classifiers for Chinese word segmentation[C]//Proceedings of the First SIGHAN Workshop on Chinese Language Processing, Taipei, 2002, Taiwan, 2002: 63-70.
[12] 关白.浅析藏文分词中的几个概念[J].西藏大学学报(自然科学版),2009,24(1): 65-69.
[13] J. Lafferty, A. McCallum, F. Pereira. Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data[C]//Proceedings of ICML-2001, 2001: 282-289.
[14] Nianwen Xue. Chinese word segmentation as character tagging[C]//International Journal of Computational Linguistics and Chinese Language Processing, 2003: 29-48.
[15] Kun Wang, Chengqing Zong, Keh-Yih Su. A Character-Based Joint Model for Chinese Word Segmentation[C]//Coling 2010, 2010: 1173-1181.

基金

国家自然基金资助项目(61032008);模式识别国家重点实验室开放课题资助项目(201001051);西北民族大学中央高校基本业务费专项资金项目(ycx11135,zyz2011101)
PDF(1629 KB)

739

Accesses

0

Citation

Detail

段落导航
相关文章

/