Tibetan Automatic Word Segmentation Based on #br#
Conditional Random Fields and Knowledge Fusion
Luobsang Karten1,YANG Yuanyuan2, ZHAO Xiaobing3
1. Shool of Information Engineering,Minzu University of China, Beijing 100081,China;
2. School of Chinese Minority Language and Literature, MinZu University of China, Beijing 100081,China;
3. National Language Resource Monitoring & Research Center of Minority Languages,
Minzu University of China, Beijing 100081,China)
Abstract:Tibetan word segmentation is one essential task in Tibetan language processing. In this paper, a CRF module is trained on 35.1M Tibetan corpus with manual annotation. The CRF segmentation results is processed by rules for the errors such as segmentation errors of non-Tibetan characters, recognition error of Tibetan adhesion words, segmentation errors of stop words and unregistered words. An open test demonstrate an accuracy of 96.11%, recall rate of 96.03%, and F score of 96.06%.
Key words Tibetan; word segmentation;CRFs;knowledge fusion
[1] 孙茂松,邹嘉彦.汉语自动分词研究评述[J].当代语言学,2001, 3(1): 22-32.
[2] 罗秉芬,江荻.藏文计算机自动分词的基本规则[C]//中国少数民族语言文字现代化文集.北京: 民族出版社, 1999
[3] 扎西次仁.一个人机互助的藏语分词和词登录系统的设计[C]//中国少数民族语言文字现代化文集.北京: 民族出版社, 1999.
[4] 陈玉忠,李保利,俞士汶,等.基于格助词和接续特征的藏文自动分词方案[J].语言文字应用,2003,(01): 75-82.
[5] 陈玉忠,李保利,俞士汶.藏文自动分词系统的设计与实现[J].中文信息学报,2003,17(03):15-20.
[6] 才智杰.班智达藏文自动分词系统的设计与实现[J].青海师范大学民族师范学院学报,2010,21(02): 75-77.
[7] Norbu S, Choejey P, Dendup T, et al. Dzongkha word segmentation[C]//Proceedings of the 8th Workshop on Asian Language Resources. 2010: 95-102.
[8] 史晓东,卢亚军. 央金藏文分词系统[J]. 中文信息学报,2011, 25(4): 54-56.
[9] Liu Huidan, Nuo Minghua,Ma Longlong, et al.Tibetan Word Segmentation as Syllable Tagging Using Conditional Random Field[C]//Proceedings of the PACLIC.2011: 168-177.
[10] 洪铭材,张阔,唐杰等.基于条件随机场(CRFs)的中文词性标注方法[J].计算机科学,2006,33(10): 146-151.
[11] 魏欧,孙玉芳.基于非监督训练的汉语词性标注的实验与分析[J].计算机研究与发展,2000,37( 4): 477-482.
[12] 李亚超,加羊吉,宗成庆等.基于条件随机场的藏语自动分词方法研究与实现[J].中文信息学报,2013,27(04):52-58.
[13] 康才畯.藏语分词与词性标注研究[D].上海师范大学博士学位论文.2014.5.