该文提出了一种以符号解码与数值解码并举的SSD(Symbol-and-Statistics Decoding Model)模型,该模型被用于汉语词性标注任务,其标注正确率在封闭测试中达到97.08%,开放测试中达到95.67%,较二阶HMM的95.56%和94.70%都有较为显著提高。SSD模型的正确率虽然不及最大熵模型和CRF模型,但它的训练时间远少于后者,说明SSD模型在处理自然语言中的特定任务时是一种较强的实用模型。
Abstract
A statistical language model named Symbol-and-Statistics Decoding (SSD) language model is presented in this article. The 2-gram SSD model is applied to the Chinese POS tagging task with a quite good result. The precision is as high as 97.08% in the closed test and 95.67% in the open test is, which are both significantly higher than the HMM at 95.56% and 94.70%, respectively. Although the performance of SSD model is not as good as the conditional models such as Maximum Entropy Model and CRF model, the training time of SSD is much less than the conditional models, which makes SSD model more applicable to certain tasks in natural language processing.
Key wordscomputer application; Chinese information processing; SSD model; HMM; POS tagging
关键词
计算机应用 /
中文信息处理 /
SSD模型 /
HMM /
词性标注
{{custom_keyword}} /
Key words
computer application /
Chinese information processing /
SSD model /
HMM /
POS tagging
{{custom_keyword}} /
{{custom_sec.title}}
{{custom_sec.title}}
{{custom_sec.content}}
参考文献
[1] Daniel Jurafsky, James H. Martin. Speech and Languge Processing: An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition[M]. USA: Prentice Hall,2000.
[2] Doug Cutting, Julian Kupiec, Jan Pedersen, Penelope Sibun. A Practical Part-of-Speech Tagger [C]//Proceedings of the Third Conference on Applied Natural Language Processing, 1992: 133-140.
[3] Adwait Ratnaparkhi. A maximum entropy model for Part-of-speech Tagging[C]//Proceedings of the Conference on Empirical Methods in Natural Language Processing, 1996: 133-141.
[4] 俞士汶,段慧明,朱学锋,等. 北京大学现代汉语语料库基本加工规范[J]. 中文信息学报,2002,16(6): 58-65.
[5] 梁以敏,黄德根. 基于完全二阶隐马尔可夫模型的汉语词性标注[J]. 计算机工程, 2005, 31(10): 177-179.
[6] 屈刚,陆汝占 一个改进的汉语词性标注系统[J]. 上海交通大学学报,2003,37(6): 897-900.
[7] 洪铭材,张阔,唐杰,等. 基于条件随机场(CRFs)的中文词性标注方法[J]. 计算机科学, 2006, 33(10): 148-155.
[8] 姜维,关毅,王晓龙. 基于条件随机域的词性标注模型[J]. 计算机工程与应用,2006, 21: 13-16.
[9] Kristina Toutanova, Dan Klein, Christopher Manning, and Yoram Singer. Feature-Rich Part-of-Speech Tagging with a Cyclic Dependency Network[C]//Proceedings of HLT-NAACL, 2003: 252-259.
{{custom_fnGroup.title_cn}}
脚注
{{custom_fn.content}}
基金
国家自然科学基金资助项目(60572159, 60872121)
{{custom_fund}}