一种基于多特征因子改进的中文文本分类算法

叶 敏,汤世平,牛振东

PDF(2238 KB)
PDF(2238 KB)
中文信息学报 ›› 2017, Vol. 31 ›› Issue (4) : 132-137.
信息抽取与文本挖掘

一种基于多特征因子改进的中文文本分类算法

  • 叶 敏,汤世平,牛振东
作者信息 +

An Improved Chinese Text Classification Algorithm Based On Multiple Feature Factors

  • YE Min, TANG Shiping, NIU Zhendong
Author information +
History +

摘要

采用向量空间模型(vector space model,VSM)表示网页文本,通过在CHI(Chi-Square)特征选择算法中引入频度、集中度、分散度、位置信息这四个特征因子,并考虑词长和位置特征因子改进TF-IDF权重计算公式,提出了PCHI-PTFIDF(promoted CHI-promoted TF-IDF)算法用于中文文本分类。改进算法能降维得到分类能力更强的特征项集、更精确地反映特征项的权重分布情况。结果显示,与使用传统CHI和传统TF-IDF的文本分类算法相比,PCHI-PTFIDF算法的宏F1值平均提高了10%。

Abstract

In the framework of the vector space model(VSM), a new PCHI-PTFIDF(promoted CHI-promoted TFIDF)method based on feature selection and weight calculation is proposed. First, the factors of frequency, concentration, dispersion and location are introduced to CHi-Square based feature selection. Then, the TF-IDF weight is proposed to be optimized by the length and location factors of text terms. The proposed method can reduce the dimensions of the features with better classification ability, and produce better estimation of the weight distribution. The experimental results show that, compared with the algorithm using the traditional CHI and traditional TFIDF, the PCHI-PTFIDF method achieves 10% improvement in Macro-F1 on average.

关键词

文本分类 / χ2统计 / 特征选择 / TF-IDF权重计算

Key words

text classification / χ2 statistic / feature selection / TF-IDF feature weighting

引用本文

导出引用
叶 敏,汤世平,牛振东. 一种基于多特征因子改进的中文文本分类算法. 中文信息学报. 2017, 31(4): 132-137
YE Min, TANG Shiping, NIU Zhendong. An Improved Chinese Text Classification Algorithm Based On Multiple Feature Factors. Journal of Chinese Information Processing. 2017, 31(4): 132-137

参考文献

[1] Sebastiani F. Machine learning in automated text categorization[J]. Acm Computing Surveys, 2002, 34(2): 1-47.
[2] DeySarakar S, Goswami S. Empirical Study on Filter based Feature Selection Methods for Text Classification[J]. International Journal of Computer Applications, 2013, 81(6): 38-43.
[3] 胡龙茂. 中文文本分类技术比较研究[J]. 安庆师范学院学报(自然科学版), 2015(2): 49-53.
[4] 代六玲, 黄河燕, 陈肇雄. 中文文本分类中特征抽取方法的比较研究[J]. 中文信息学报, 2004, 18(1): 26-32.
[5] 肖雪, 卢建云, 余磊,等. 基于最低词频CHI的特征选择算法研究[J]. 西南大学学报(自然科学版), 2015(6): 138-143.[6] 刘振岩, 孟丹, 王伟平,等. 基于偏斜数据集的文本分类特征选择方法研究[J]. 中文信息学报, 2014, 28(2): 116-121.
[7] 刘海峰, 苏展, 刘守生. 一种基于词频信息的改进CHI文本特征选择[J]. 计算机工程与应用, 2013, 49(22): 110-114.
[8] 李国和, 岳翔, 吴卫江,等. 面向文本分类的特征词选取方法研究与改进[J]. 中文信息学报, 2015, 29(4): 120-125.
[9] 申剑博. 改进的 TF-IDF中文本特征项加权算法研究[J]. 软件导刊, 2015(4): 67-69.
[10] Yang Y. A re-examination of text categorization methods[C]//Proceedings of the International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, 1999: 42-49.
[11] Zhang H P, Yu H K. Xiong D Y, et al. HHMM-based Chinese lexical analyzer ICTCLAS[C]//Proceeds of the Sighan Workshop on Chinese Language Processing. Association for Computational Linguistics, 2003:758-759.
[12] Chang C C, Lin C J. LIBSVM: A library for support vector machines[J]. Acm Transactions on Intelligent Systems & Technology, 2011, 2(3): 389-396.
PDF(2238 KB)

630

Accesses

0

Citation

Detail

段落导航
相关文章

/