结合类频率的关联中文文本分类

钱铁云,王元珍,冯小年

PDF(123 KB)
PDF(123 KB)
中文信息学报 ›› 2004, Vol. 18 ›› Issue (6) : 31-37.

结合类频率的关联中文文本分类

  • 钱铁云1,王元珍1,冯小年2
作者信息 +

Integrating Class Frequency Into Association Rules Based Chinese Text Categorization

  • QIAN Tie-yun1,WANG Yuan-zhen1,FENG Xiao-nian2
Author information +
History +

摘要

该文提出一种词类频率和关联中文文本分类相结合的算法ARCTC。此算法将文档视作事务,关键词视作项,并针对文本事务的特性,提出利用词的类频率筛选与分类相关性不大的词汇,然后将改进的关联规则挖掘算法用于挖掘项和类别间的相关关系。挖掘出的规则用于形成类别特征词的集合,可用来和类标号未知文档的词的集合求交集,交集元素个数最多者即为所分类别。实验证明,该算法在提高训练时间和测试时间的同时具有较好的召回率、准确率和F-Measure。

Abstract

In this paper , a new algorithm that integrates class frequency into association rules based document classification is introduced into Chinese text categorization. This algorithm views each document as a transaction and each term as an item. The class frequency of a term is used to filter the words that are irrelevant to classification , and the mining algorithm of association rules is used to mine the correlation between item and category. Class character words sets are formed basing on the rules , and unlabeled documents are classified by intersecting with these sets. Experiments confirm that this method has a promising recall , precision rate and F-Measure while speeding up both training and test time.

关键词

计算机应用 / 中文信息处理 / 基于关联的分类 / 中文文本分类 / 词类频率 / 类别特征词集合

Key words

computer application / Chinese information processing / association based classification / chinese text categorization / term class frequency / class character term set

引用本文

导出引用
钱铁云,王元珍,冯小年. 结合类频率的关联中文文本分类. 中文信息学报. 2004, 18(6): 31-37
QIAN Tie-yun,WANG Yuan-zhen,FENG Xiao-nian. Integrating Class Frequency Into Association Rules Based Chinese Text Categorization. Journal of Chinese Information Processing. 2004, 18(6): 31-37

参考文献

[1] 黄萱菁,吴立德,等. 独立于语种的文本分类方法[J] . 中文信息学报,2000 ,16 (6) :1 - 7.
[2] 刘少辉,董明楷,等. 一种基于向量空间模型的多层次文本分类方法[J] . 中文信息学报,2002 ,16 (3) :8 - 26.
[3] 李辉,史忠植,等. 运用文本领域的常识改善基于支撑向量机的文本分类器性能[J] . 中文信息学报, 2002 ,16 (2) :7 - 13.
[4] 刘斌,黄铁军,等. 一种新的基于统计的自动文本分类方法[J] . 中文信息学报,2002 ,16 (6) :18 - 24.
[5] 施彤年,卢忠良,等. 多类多标签汉语文本自动分类的研究[J] . 情报学报,2003 ,03 :306 - 309.
[6] B. Liu , W. Hsu , and Y. Ma. Integrating Classification and Association Rule Mining [C] . KDD - 98 , New York , 1998.
[7] Wenmin Li , Jiawei Han , JianPei. CMAR: Accurate and Efficient Classification Based on Multiple Class-Association Rules [C] . ICDM2001 , Silicon Valley , Ca , Nov 2001 : 369 - 376.
[8] Maria-Luiza Antonie , Osmar R. Zaiane. Text Document Categorization by Term Association [C] . In : Proc of the IEEE International Conference on Data Mining (ICDM 2002) , Maebashi City , Japan : 19 - 26.
[9] 宋擒豹,沈钧毅. 基于关联规则的Web文档聚类算法[J] . 软件学报,2002 ,13 (3) :417 - 423.
[10] Mohammed J . Zaki , Charu C. Aggarwal. XRules : An Effective Structural Classifier for XML Data [C] . The Ninth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (SIGKDD) . Washington , DC , USA , 2003.
[11] Yiming Yang , Jan O. Pederson. A Comparative Study on Feature Selection in Text Categorization [C] . International Conference on Machine Learning , Nashville , TN , July 1997.
[12] https://securesite.chireader.com/Archive/stopwords.txt.
[13] http://www.in2in.com/download.htm.

基金

科技部科技电子政务系统关键技术及应用系统的研究资助(2001BA110B01)
PDF(123 KB)

Accesses

Citation

Detail

段落导航
相关文章

/