李国臣. 文本分类中基于对数似然比测试的特征词选择方法[J]. 中文信息学报, 1999, 13(4): 17-22.
Li Guochen. A Log-Likelihood-Ratio-Test-Based Feature Word Selection Approach in Text Categorization. , 1999, 13(4): 17-22.
文本分类中基于对数似然比测试的特征词选择方法
李国臣
山西大学计算机科学系
A Log-Likelihood-Ratio-Test-Based Feature Word Selection Approach in Text Categorization
Li Guochen
Department of Computer Science , Shanxi University
Abstract:The paper uses the Log-Likelihood-Ratio-Test-Based feature words selection approach in the field of text categorization. In comparison with the traditional method , that is , each of the frequecy test , salience test and distributioness test is conducted independently , the proposed approach uses covariance matrix to coordinate the associations among the variant statistics so that all of them are integrated into a whole. The experiments show that the approach is superior to the t raditional approach.
[1] Apte C et al . Automated learning of decision rules for text categorization. ACM Transaction on Information Systems ,July 1994 ,12(3) [2] Burtle C. Statistics in Linguistics. Basil Blackwell World Publishing Corp. 1985 [3] Duda R O et al . Pattern Classification and Scene Analysis. John Wiley & Sons , NY, USA , 1973 [4] Lewis D D et al . Evaluating and optimizing autonomous text classification systems. In : Proceedings of the 18th SIGIR Conference , 1995 [5] Yang Y. Noise reduction in a statistical approach to text categorization. In : Proceedings of the 18th SIGIR Conference , 1995 [6] Young S. The HTK Book. Cambridge University ,1997 [7] 吴军. 汉语语料的自动分类. 中文信息学报,1995(4) [8] 杨允信. 中文文件自动分类之研究. 见:台湾第六届计算语言学研讨会论文集,1993 [9] 蔡元龙. 模式识别. 西安:西北电讯工程学院出版社, 1986 [10] 丁均彦. 文本分类系统的研究与实现[硕士学位论文] . 北京:清华大学,1998