廖莎莎,江铭虎. 中文文本分类中基于概念屏蔽层的特征提取方法[J]. 中文信息学报, 2006, 20(3): 24-30.
LIAO Sha-sha,JIANG Ming-hu. A Feature Selection Method in Chinese Text Classification Based on Concept Extraction with a Shielded Level. , 2006, 20(3): 24-30.
中文文本分类中基于概念屏蔽层的特征提取方法
廖莎莎,江铭虎
清华大学人文学院计算语言实验室,清华大学认知科学创新基地
A Feature Selection Method in Chinese Text Classification Based on Concept Extraction with a Shielded Level
LIAO Sha-sha,JIANG Ming-hu
Lab of Computational Linguistics of Chinese Language , Cognitive Sciences Innovation Base , Tsinghua Univ.
Abstract:In this paper,we propose a novel feature selection method based on concept extraction and shielded level. In this method, we use HowNet as the semantic dictionary to extract concept attributes. Based on their positions in the concept tree, the attributes will get proper weights, which present their description powers. A concept attribute will not be selected as feature if its weight is lower than the shielded level and the original word will be reserved for use. To each word, we calculate all the weights of the concept attributes in its DEF, and decide whether to extract the concept attributes or reserve the word. We focus mainly on how to weight the concept attributes, how to make a balance between concept features and word features, and how to treat the words out of the dictionary. The experiment shows that if a shielded level is set properly, it can not only reduce the feature dimension to a proper scale but also improve the classification precise. Moreover, it can reduce the difference of the classification precise among different categories.