潘华山,严 馨,周 枫,余正涛,郭剑毅. 基于层叠条件随机场的高棉语分词及词性标注方法[J]. 中文信息学报, 2016, 30(4): 110-116.
PAN Huashan, YAN Xin, ZHOU Feng, YU Zhengtao, GUO Jianyi. A Khmer Word Segmentation and Part-of-Speech Tagging Method Based on Cascaded Conditional Random Fields. , 2016, 30(4): 110-116.
基于层叠条件随机场的高棉语分词及词性标注方法
潘华山,严 馨,周 枫,余正涛,郭剑毅
昆明理工大学 信息工程与自动化学院和云南省计算机技术应用重点实验室,云南 昆明 650500
A Khmer Word Segmentation and Part-of-Speech Tagging Method Based on Cascaded Conditional Random Fields
PAN Huashan, YAN Xin, ZHOU Feng, YU Zhengtao, GUO Jianyi
School of Information Engineering and Automation, Kunming University of Science and Technology and Key Lab of Computer Technologies Application of Yunnan Province, Kunming, Yunnan 650500,China
Abstract:This paper presents a Khmer automatic word segmentation and POS tagging method based on Cascaded Conditional Random Fields(CCRFs) model. The approach consists of three layers of Conditional Random Fields(CRFs) models: the first layer is the word segmentation model in Khmer character cluster(KCC) granularity, integrating the word formation characteristics of Khmer into the feature template; the second layer is the word segmentation correction model in word granularity, integrating the characteristic of Khmer named entities into the feature template; the third layer is the POS tagging model, integrating the rich affixes information into the feature template, and achieved the Khmer POS tagging. We experimented on an open corpus and obtained a final accuracy rate of 95.44%, indicating that the proposed method can effectively solve the Khmer word segmentation and POS tagging problems.
[1] 莫源源. 高棉语的构词方式及其语法功能[J]. 教法研究, 2012(10):45-46. [2] Huor C S, Rithy T, Hemy R P, et al. Detection and Correction of Homophonous Error in Khmer Language[J]. PAN Localization Working Papers, 2006:243-248. [3] 蒋艳荣,刘习文,陈耿涛.基于Viterbi改进算法的高棉语分词研究[J].计算机工程, 2011,37(15):174-176. [4] Huor C S, Rithy T, Hemy R P, et al. Word Bigram Vs Orthographic Syllable Bigram in Khmer Word Segmentation[J]. PAN Localization Working Papers, 2004:249-253. [5] Nou C, Kameyama W. Khmer POS Tagger: A Transformation-based Approach with Hybrid Unknown Word Handling[C]//Proceedings of the International Conference on Semantic Computing, 2007:482-492. [6] Lafferty J, McCallum A, Pereira F C N. Conditional random fields: Probabilistic models for segmenting and labeling sequence data[C]//Proceedings of the Eighteenth International Conference on Machine Learning, 2001:282-289. [7] The Unicode Consortium. The Unicode Standard, Version 6.2.0[S]. Unicode Consortium, 2012. [8] TakuKudo. CRF++ toolkit[CP]. 2005, http://crfpp.sourceforge.net/ [9] Bazzi I, Glass J. Modelling out-of-vocabulary words for robust speech recognition[D]. Proc Icslp, 2002. [10] Ngo Q H, Dien D, Winiwarter W. Building English-Vietnamese Named Entity Corpus with Aligned Bilingual News Articles[C]//Proceedings of The Workshop on South & Southeast Asian Natural Language Processing,2014:85-93. [11] 肯素(柬埔寨).高棉语法[M].柬埔寨皇家科学院出版社,2007.5. [12] Nath C. Dictionnaire cambodgien[M]. Phnom Penh,1967. [13] Nou C, Kameyama W. Khmer POS Tagger: A Transformation-based Approach with Hybrid Unknown Word Handling[C]//Proceedings of International Conference on Semantic Computing, 2007:482-492.