张孝飞,陈肇雄,黄河燕,蔡智. 词性标注中生词处理算法研究[J]. 中文信息学报, 2003, 17(5): 2-6.
ZHANG Xiao-fei,CHEN Zhao-xiong,HUANG He-yan,,CAI Zhi. An Approach of Processing New Words Based on HMM in Tagging of Speech of Part. , 2003, 17(5): 2-6.
词性标注中生词处理算法研究
张孝飞1,2,陈肇雄2,黄河燕2,蔡智1
1.中国科技大学计算机系 2.中国科学院计算机语言信息工程研究中心
An Approach of Processing New Words Based on HMM in Tagging of Speech of Part
Abstract:Ambiguity of part of speech (POS) which urgent needs to be resolved is a very important ambiguous phenomenon in natural language processing. Furthermore , it is very difficult to disambiguate the ambiguity of part of speech of the new words. In this paper , through converting the problem of tagging of POS to the problem of calculation of word’s emission probability ; a new approach based on HMM is proposed to solve this problem. This approach uses nothing more than a tagged corpus (e.g. no grammar dictionaries , no grammar rules), and the result shows that the correct rata arrive at 97% in close test and 92% in open test .
[1] Jelinek , F. . Self-organized language modeling for speech recognition. Readings in Speech Recognition [C] , A. Waibel and K. F. Lee , eds. , Morgan-Kaufmann , San Mateo , CA , 1990 , 450 - 506. [2] Miller , D. , Leek , T. , and Schwartz , R. M. . A hidden Markov model information retrieval system. Proc. 22nd International Conference on Research and Development in Information Retrieval[C] , Berkeley , CA , 1999 , 214 - 221. [3] Zue , V. W. . Navigating the information superhighway using spoken language interfaces [R] . IEEE Expert , October , 1995 ,10 (5) :39 - 43. [4] L. E. Baum. An inequality and associated maximization technique in statistical estimation for probabilistic functions of a Markov process [J] , Inequalities , 1972 , 3 : 1 - 8 [5] 赵铁军,等. 机器翻译原理[M] ,哈尔滨工业大学出版社,2000 ,6 ,141 - 143. [6] 刘开瑛,等. 语料库词类自动标注算法研究,陈肇雄. 机器翻译研究进展[C] . 北京:电子工业出版社,1992 ,378 - 386. [7] 黄昌宁,李娟子. 语料库语言学[M] . 北京:商务印书馆,2002 ,115 - 120. [8] Ralph Weischedel , Marie Meteer , Richard Schwartz , Lance Ramshaw , Jeff Palmucci. Coping with Ambiguity and Unknown Words through Probabilistic Models [J] . Computational Linguistic , 1993 , 19 (2) : 359 - 382. [9] 周强,俞士汶. 一种切分和词性标注相融合的汉语语料库多级处理方法. 陈力为. 计算语言学研究与应用[C] . 北京:北京语言学院出版社,1993 ,126 - 131. [10] 白栓虎. 基于统计的汉语语料库词性自动标注的研究与实现,黄昌宁,夏莹,语言信息处理专论[C] . 北京:清华大学出版社. [11] 陈志忠,陈肇雄,高庆狮. 通用的自然语言词法分析机制[J] . 计算机学报,1991 ,2 (2) .