中文文本分类中基于概念屏蔽层的特征提取方法

廖莎莎,江铭虎

PDF(302 KB)
PDF(302 KB)
中文信息学报 ›› 2006, Vol. 20 ›› Issue (3) : 24-30.

中文文本分类中基于概念屏蔽层的特征提取方法

  • 廖莎莎,江铭虎
作者信息 +

A Feature Selection Method in Chinese Text Classification Based on Concept Extraction with a Shielded Level

  • LIAO Sha-sha,JIANG Ming-hu
Author information +
History +

摘要

本文提出了一种新的基于概念抽取和屏蔽层的特征选择方法。该方法利用HowNet概念词典中的概念树,通过义原在概念树中的位置信息进行概念抽取,并赋予其适当权值来说明其描述能力。对于权值低于屏蔽层的义原,我们不将其选入特征集,并相应保留原词。具体到每个词,我们计算其DEF条目中的权值,决定是将原词选入特征集还是进行概念抽取。本文重点研究了如何给义原设定一个合适的权值,如何在选取原词和概念之间取得平衡以及针对非概念词的加权处理。实验证明,设定合适的屏蔽层,不仅可以缩小特征维数,使分类正确率得到一定的提高,而且可以减少不同类别间的分类正确率的差别。

Abstract

In this paper,we propose a novel feature selection method based on concept extraction and shielded level. In this method, we use HowNet as the semantic dictionary to extract concept attributes. Based on their positions in the concept tree, the attributes will get proper weights, which present their description powers. A concept attribute will not be selected as feature if its weight is lower than the shielded level and the original word will be reserved for use. To each word, we calculate all the weights of the concept attributes in its DEF, and decide whether to extract the concept attributes or reserve the word. We focus mainly on how to weight the concept attributes, how to make a balance between concept features and word features, and how to treat the words out of the dictionary. The experiment shows that if a shielded level is set properly, it can not only reduce the feature dimension to a proper scale but also improve the classification precise. Moreover, it can reduce the difference of the classification precise among different categories.

关键词

计算机应用 / 中文信息处理 / 文本分类 / 特征提取 / 概念抽取 / 属性特征树 / 屏蔽层 / 描述能力

Key words

computer application / Chinese information processing / text classification / feature selection / concept extraction / concept tree / shielded level / description power

引用本文

导出引用
廖莎莎,江铭虎. 中文文本分类中基于概念屏蔽层的特征提取方法. 中文信息学报. 2006, 20(3): 24-30
LIAO Sha-sha,JIANG Ming-hu. A Feature Selection Method in Chinese Text Classification Based on Concept Extraction with a Shielded Level. Journal of Chinese Information Processing. 2006, 20(3): 24-30

参考文献

[1] 周茜,赵明生,扈雯. 中文文本分类中的特征选择研究[J]. 中文信息学报, 2004, 18 (3) : 17 - 23.
[2] 季姮,罗振声,万敏,高小云. 基于概念统计和语义层次分析的英文自动文摘研究[J]. 中文信息学报, 2003, 17 (2) : 14 - 20.
[3] 李莼,罗振声,厉宇航. 基于语义相关和概念相关的自动分类方法研究[J]. 计算机工程与应用, 2003, (12) : 106 - 109.
[4] 苏伟峰,李绍滋,李堂秋. 一个基于概念的中文文本分类模型[J]. 计算机工程与应用, 2002, (6) : 193 - 195.
[5] 王萌,何婷婷,姬东鸿,王晓荣. 基于HowNet概念获取的中文自动文摘系统[J]. 中文信息学报, 2005, 19 (3) : 440 - 446.
[6] 钱铁云,王元珍,冯小年. 结合类频率的关联中文文本分类[J]. 中文信息学报, 2004, 18 (6) : 30 - 36.
[7] Dong Zhengdong,Dong Qiang the download of Hownet[EB/OL] , http://www.keenage.com.
[8] Yang Yimin, and Pedersen J O. A comparative study on feature selection in text categorization [A]. In: proceedings of the 14th International Conference on Mahine Learning(ICML - 97) [C] , 1997.
[9] 代六玲,黄河燕,陈肇雄. 中文文本分类中特征抽取方法的比较研究[J]. 中文信息学报, 2004, 18 (1) : 26 - 32.
[10] 李凡,鲁明羽,陆玉昌. 关于文本特征抽取新方法的研究[J]. 清华大学学报(自然科学版) , 2001, (7) : 99 - 102.
[11] FABRIZIO SEBASTIANI machine learning in automated text categorization [C]. ACM computing surveys, Vol. 34, No 1,March 2002, P1.

基金

教育部优秀青年教师资助计划项目(2051);中国科学院模式识别国家重点实验室开放课题基金(10);2003年度清华大学985 -Ⅰ期基础研究基金的资助.
PDF(302 KB)

712

Accesses

0

Citation

Detail

段落导航
相关文章

/