文本分类中特征权重因子的作用研究

张爱华1,靖红芳1,王 斌1,徐 燕2

PDF(2713 KB)
PDF(2713 KB)
中文信息学报 ›› 2010, Vol. 24 ›› Issue (3) : 97-105.
综述

文本分类中特征权重因子的作用研究

  • 张爱华1,靖红芳1,王 斌1,徐 燕2
作者信息 +

Research on Effects of Term Weighting Factors for Text Categorization

  • ZHANG Aihua1, JING Hongfang1, WANG Bin1, XU Yan2
Author information +
History +

摘要

在传统的基于向量空间的文本分类中,特征权重计算与特征选择过程完全割裂,特征选择函数的得分能反映特征的重要性,却未被纳入权重表示,造成特征表示不精确并影响分类性能。一些改进方法使用特征选择函数等修改TFIDF模型,提高了分类性能,但没有探究各权重因子如何影响分类的性能。该文以词频、逆文档频率及特征选择函数分别作为衡量特征的文档代表性、文档区分性及类别区分性的因子,通过实验测试了它们对分类性能的影响,得到文档代表性因子能使分类效果峰值最高但抵抗噪音特征能力差、文档区分性因子具有抗噪能力但性能不稳定、而类别区分性因子抗噪能力最强且性能最稳定的结论。最后给出权重表示的四点构造原则,并通过实验验证了其对分类性能的优化效果。

Abstract

In traditional vector space based text categorization models, term weighting and feature selection are absolutely isolated. Although feature selection functions give a score to each term, the score is seldom taken into account while weighting terms. This paper adopts term frequency, inverse document frequency and feature selection functions as the indication of the features' ability in representing a document, distinguishing different documents and distinguishing different categories respectively. The experimental results show that TF can raise the peak of the performance but it is sensitive to noisy features; IDF is tough to noise and but unstable; the feature selection function has strong moise-tolarent ability with stability. Finally, four criteria are proposed to combine the above factors to establish optimal weighting schemes and are further verified by experiments.
Key wordscomputer application; Chinese information processing; text categorization; term weighting; effects of weighting factors; VSM

关键词

计算机应用 / 中文信息处理 / 文本分类 / 权重表示 / 权重因子作用 / VSM

Key words

computer application / Chinese information processing / text categorization / term weighting / effects of weighting factors / VSM
 
/   /   /
 
/   /   /
 
/   /  

引用本文

导出引用
张爱华1,靖红芳1,王 斌1,徐 燕2. 文本分类中特征权重因子的作用研究. 中文信息学报. 2010, 24(3): 97-105
ZHANG Aihua1, JING Hongfang1, WANG Bin1, XU Yan2. Research on Effects of Term Weighting Factors for Text Categorization. Journal of Chinese Information Processing. 2010, 24(3): 97-105

参考文献

[1] Yang Y. An evaluation of statistical approaches to text categorization[J]. Information Retrieval, 1999, 1: 69-90.
[2] Sebastiani, F. Machine learning in automated text categorization[J]. ACM Computing Surveys, 2002, 34(1): 1-47.
[3] 苏金树, 张博锋, 徐昕. 基于机器学习的文本分类技术研究进展[J]. 软件学报, 2006, 17:1848-1859.
[4] Yang Y, Pedersen J. A Comparative Study on Feature Selection in Text Categorization[C]//Proceedings of the 14th International conference on Machine Learning, 1997: 412-420.
[5] Yan J, Liu N, Zhang B, et al. OCFS: optimal orthogonal centroid feature selection for text categorization[C]//Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval, 2005: 122-129.
[6] Yang Y, Liu X. A re-examination of text categorization methods[C]//Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval, 1999: 42-49.
[7] Thorsten J, Text Categorization with Suport Vector Machines: Learning with Many Relevant Features[C]//Proceedings of the 10th European Conference on Machine Learning, 1998: 137-142.
[8] Gerard S, Christopher B, Term-weighting approaches in automatic text retrieval[J]. Information Processing and Management: an International Journal, 1988, 24(5): 513-523.
[9] Hassan S, Banea C, Random-Walk Term Weighting for Improved Text Classification[C]//Proceedings of TextGraphs: 2nd Workshop on Graph Based Methods for Natural Language Processing, ACL, 2006: 53-60.
[10] Shankar S, Karypis G. A Feature Weight Adjustment Algorithm for Document Categorization[C]//Proceedings of SIGKDD’00 Workshop on Text Mining, 2000.
[11] 陆玉昌, 鲁明羽, 李凡, 等. 向量空间法中单词权重函数的分析和构造[J]. 计算机研究与发展, 2002, 39(10):1205-1210.
[12] Debole F, Sebastiani F. Supervised term weighting for automated text categorization[C]//Proceedings of the 2003 ACM symposium on Applied computing, 2003: 784-788.
[13] 鲁松, 李晓黎, 白硕, 等. 文档中词语权重计算方法的改进[J]. 中文信息学报, 2000, 14(6): 8-13.
[14] Lertnattee V, Theeramunkong T. Effect of term distributions on centroid-based text categorization[J]. Information Sciences-Informatics and Computer Science: An International Journal, 2004, 158(1): 89-115.
[15] Kehagias A, Petridis V, Kaburlasos VG, et al. A Comparison of Word- and Sense-Based Text Categorization Using Several Classification Algorithms[J]. Journal of Intelligent Information Systems, 2003, 21(3): 227-247.
[16] Moschitti A, Basili R. Complex linguistic features for text classification: A comprehensive study[C]//Proceedings of the 26th European Conference on Information Retrieval (ECIR), 2004: 181-196.
[17] Frigui H, Nasraoui O. Simultaneous Clustering and Dynamic Keyword Weighting for Text Documents[M]. Berry, M. W. (Ed.), Survey of Text Mining, Springer, Berlin. 2004: 45-72.
[18] McCallum A, Nigam K. A Comparison of Event Models for Naive Bayes Text Classification[C]//Proc. of the AAAI-98 Workshop on Learning for Text Categorization, 1998: 41-48.

基金

国家自然科学基金资助项目(60873166);国家973资助项目(2007CB311103);国家863计划资助项目(2006AA010105)
PDF(2713 KB)

Accesses

Citation

Detail

段落导航
相关文章

/