词干和构形附加成分是蒙古语词的组成成分,在构形附加成分中包含着数、格、体、时等大量语法信息。利用这些语法信息有助于使用计算机对蒙古语进行有效处理。蒙古语词在结构上表现为一个整体,为了利用其中的语法信息需要识别出词干和各构形附加成分。通过分析蒙古语词的构形特点,提出一种有效的蒙古语词标注方法,并基于条件随机场模型构建了一个实用的蒙古语词切分系统。实验表明该系统的词切分准确率比现有蒙古语词切分系统的准确率有较大提高,达到了0.992。
Abstract
Etyma and morphological affix are the components of Mongolian words, which include lots of grammar information. Using this grammar information is helpful for effective processing Mongolian language. Mongolian words exhibit as a whole in the structure, and therefore, the detection of etyma and each morphological affix is necessary to capture this grammar information. By analyzing the characteristics of morphological construction of Mongolian words, this paper proposes an effective Mongolian word labeling method, and constructs a practical Mongolian word segmentation system based on conditional random fields model. Experiments show that the accuracy of segmentation has a significant improvement than current system, reaching an accuracy rate of 0.992.
Key wordsMongolian; word segmentation; etyma; morphological affix; conditional random fields; statistical language model
关键词
蒙古语 /
词切分 /
词干 /
构形附加成分 /
条件随机场 /
统计语言模型
{{custom_keyword}} /
Key words
Mongolian /
word segmentation /
etyma /
morphological affix /
conditional random fields /
statistical language model
{{custom_keyword}} /
{{custom_sec.title}}
{{custom_sec.title}}
{{custom_sec.content}}
参考文献
[1] 清格尔泰.蒙古语语法[M].呼和浩特: 内蒙古人民出版社,1991.
[2] 那顺乌日图.蒙古文词根、词干、词尾自动切分系统[J].内蒙古大学学报(人文社会科学版),1997,2:53-57.
[3] 那顺乌日图,雪艳,叶嘉明.现代蒙古语语料库加工技术的新进展——新一代蒙古语词语自动切分与标注系
统(Darhan Tagging System)[C]//第十届全国少数民族语言文字信息处理学术研讨会论文集.青海:2005.
[4] 侯宏旭,刘群,那顺乌日图,等.基于统计语言模型的蒙古文词切分[J].模式识别与人工智能,2009,1:108-112.
[5] 侯宏旭,刘群,那顺乌日图.基于实例的汉蒙机器翻译[J].中文信息学报,2007,21(4):65-72.
[6] J. Lafferty, A. McCallum, and F. Pereira. Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data[C]//Proceedings of the 18th International Conf. on Machine Learning, 2001:282- 289.
[7] Fei Sha, Fernando Pereira. Shallow Parsing with Conditional Random Fields[C]//Proceedings of HLT-NAACL 2003:134-141.
[8] Fuchun Peng, Fangfang Feng, and Andrew McCallum. Chinese Segmentation and New Word Detection using Conditional Random Fields[C]//Proceedings of The 20th International Conference on Computational Linguis- tics (COLING 2004), 2004: 562- 568.
[9] 罗彦彦,黄德根.基于CRFs边缘概率的中文分词[J].中文信息学报,2009,23(5):3-8.
[10] 冯元勇.基于单字提示特征的中文命名实体识别快速算法[J].中文信息学报,2008:22(1):104-110.
{{custom_fnGroup.title_cn}}
脚注
{{custom_fn.content}}
基金
973前期研究项目资助(2007CB316503)
{{custom_fund}}