本文将对数似然比测试用于文本分类中的特征词选择。与传统的频度、集中度和分散度等多种统计指标的测试独立进行的方法相比较,这种方法利用协方差矩阵协调了各个统计指标之间的联系,从而将它们有机地统一为一个整体。实验显示,这种特征词选择方法优于传统的频度测试、集中度测试和分散度测试独立进行的特征词选择的方法。
Abstract
The paper uses the Log-Likelihood-Ratio-Test-Based feature words selection approach in the field of text categorization. In comparison with the traditional method , that is , each of the frequecy test , salience test and distributioness test is conducted independently , the proposed approach uses covariance matrix to coordinate the associations among the variant statistics so that all of them are integrated into a whole. The experiments show that the approach is superior to the t raditional approach.
关键词
文本分类 /
特征词选择 /
对数似然比测试
{{custom_keyword}} /
Key words
Text categorization /
Feature Selection /
Log Likelihood Ratio Test
{{custom_keyword}} /
{{custom_sec.title}}
{{custom_sec.title}}
{{custom_sec.content}}
参考文献
[1] Apte C et al . Automated learning of decision rules for text categorization. ACM Transaction on Information Systems ,July 1994 ,12(3)
[2] Burtle C. Statistics in Linguistics. Basil Blackwell World Publishing Corp. 1985
[3] Duda R O et al . Pattern Classification and Scene Analysis. John Wiley & Sons , NY, USA , 1973
[4] Lewis D D et al . Evaluating and optimizing autonomous text classification systems. In : Proceedings of the 18th SIGIR Conference , 1995
[5] Yang Y. Noise reduction in a statistical approach to text categorization. In : Proceedings of the 18th SIGIR Conference , 1995
[6] Young S. The HTK Book. Cambridge University ,1997
[7] 吴军. 汉语语料的自动分类. 中文信息学报,1995(4)
[8] 杨允信. 中文文件自动分类之研究. 见:台湾第六届计算语言学研讨会论文集,1993
[9] 蔡元龙. 模式识别. 西安:西北电讯工程学院出版社, 1986
[10] 丁均彦. 文本分类系统的研究与实现[硕士学位论文] . 北京:清华大学,1998
{{custom_fnGroup.title_cn}}
脚注
{{custom_fn.content}}