融合无监督特征的藏文分词方法研究

李亚超,加羊吉,江 静,何向真,于洪志

PDF(1363 KB)
PDF(1363 KB)
中文信息学报 ›› 2017, Vol. 31 ›› Issue (2) : 71-75.
少数民族语言信息处理

融合无监督特征的藏文分词方法研究

  • 李亚超,加羊吉,江 静,何向真,于洪志
作者信息 +

Study on Fusion of Unsupervised Features for Tibetan Word Segmentation

  • LI Yachao, JIA Yangji, JIANG Jing, HE Xiangzhen, YU Hongzhi
Author information +
History +

摘要

藏文分词是藏文信息处理的基础性关键问题,目前基于序列标注的藏文分词方法大都采用音节位置特征和类别特征等。该文从无标注语料中抽取边界熵特征、邻接变化数特征、无监督间隔标注等无监督特征,并将之融合到基于序列标注的分词系统中。从实验结果可以看出,与基线藏文分词系统相比,分词F值提高了0.97%,并且未登录词识别结果也有较大的提高。说明,该文从无标注数据中提取出的无监督特征较为有效,和有监督的分词模型融合到一起显著提高了基线分词系统的效果。

关键词

藏文 / 分词 / 序列标注

Key words

Tibetan / word segmentation / sequence labeling

引用本文

导出引用
李亚超,加羊吉,江 静,何向真,于洪志. 融合无监督特征的藏文分词方法研究. 中文信息学报. 2017, 31(2): 71-75
LI Yachao, JIA Yangji, JIANG Jing, HE Xiangzhen, YU Hongzhi. Study on Fusion of Unsupervised Features for Tibetan Word Segmentation. Journal of Chinese Information Processing. 2017, 31(2): 71-75

参考文献

[1] 山木旦,郑绍功,扎喜拉旦,等.新编藏文字典[M].西宁: 青海民族出版社,1979.
[2] 扎西次仁.一个人机互助的藏文分词和词登录系统的设计[C].中国少数民族语言文字现代化文集,北京: 民族出版社,1999: 322-327.
[3] 陈玉忠,李保利,俞士汶,等.基于格助词和连续特征的藏文自动分词方案[J].语言文字应用,2003,(1): 75-82.
[4] 祁坤钰.信息处理用藏文自动分词研究[J].西北民族大学学报(哲学社会科学版),2006,(4): 92-97.
[5] 才智杰.藏文自动分词系统中紧缩词的识别[J].中文信息学报,2009,23(1): 35-37.
[6] 羊毛卓玛,欧珠.一种改进型的藏文分词交集型歧义消解方法[J].西藏科技信息,2012,1: 66-68.
[7] Huidan Liu, Weina Zhao, MinghuaNuo, et al. Tibetan Number Identification Based on Classification of Number Components in Tibetan Word Segmentation[C]//Proceedings of the 23rd International Conference on Computational Linguistics (Posters Volume) (Coling 2010), 2010: 719-724.
[8] Huidan Liu, MinghuaNuo, Longlong Ma, et al. Tibetan Word Segmentation as Syllable Tagging Using Conditional Random Fields[C]//Proceedings of the 25th Pacific Asia Conference on Language, Information and Computation (PACLIC-2011), 2011: 168-177.
[9] 史晓东,卢亚军.央金藏文分词系统[J].中文信息学报,2011, 25(4): 54-56.
[10] Tao Jiang, Hongzhi Yu, Yangkyi Jam. Tibetan word segmentation system based on conditional random fields[C]//Proceedings of Software Engineering and Service Science (ICSESS), 2011 IEEE 2nd International Conference: 2011, 7, 446-448.
[11] 李亚超,加羊吉,宗成庆,等.基于条件随机场的藏语自动分词方法研究与实现[J].中文信息学报,2013,27(4): 52-58.
[12] Yachao Li,Hongzhi Yu. Study on Tibetan Word Segmentation as Syllable Tagging[C]//Proceedings of Natural Language Processing and Chinese Computing (NLP&CC 2013). 2013, 11: 363-369.
[13] Haodi Feng, Kang Chen, Xiaotie Deng, et al. Accessor variety criteria for Chinese word extraction. Computational Linguistics [J]. 2004, 30(1): 75-93.
[14] Paul Cohen, Brent Heeringa, Niall Adams. An unsupervised algorithm for segmenting categorical timeseries into episodes[C]//Proceedings of Pattern Detection and Discovery. 2002: 117-133.
[15] Paul Cohen, Brent Heeringa, Niall Adams. An unsupervised algorithm for segmenting categorical timeseries into episodes [J]. Pattern Detection and Discovery. 2002: 117-133.
[16] Kumiko Tanaka-Ishii, ZhihuiJin. From phoneme to morpheme: Another verification rsing a corpus[C]//Proceedings of the 21st International Conference on Computer Processing of Oriental Languages. 2011: 234-244.

基金

国家社科基金青年项目(15CYY043);国家自然科学基金(61262054);甘肃省高等学校科研项目(2016B—007);甘肃省民族语言智能处理重点实验室开放基金; 西北民族大学中央高校基本科研业务费专项资金(31920140064, 31920150089)
PDF(1363 KB)

717

Accesses

0

Citation

Detail

段落导航
相关文章

/