该文提出一种词类频率和关联中文文本分类相结合的算法ARCTC。此算法将文档视作事务,关键词视作项,并针对文本事务的特性,提出利用词的类频率筛选与分类相关性不大的词汇,然后将改进的关联规则挖掘算法用于挖掘项和类别间的相关关系。挖掘出的规则用于形成类别特征词的集合,可用来和类标号未知文档的词的集合求交集,交集元素个数最多者即为所分类别。实验证明,该算法在提高训练时间和测试时间的同时具有较好的召回率、准确率和F-Measure。
Abstract
In this paper , a new algorithm that integrates class frequency into association rules based document classification is introduced into Chinese text categorization. This algorithm views each document as a transaction and each term as an item. The class frequency of a term is used to filter the words that are irrelevant to classification , and the mining algorithm of association rules is used to mine the correlation between item and category. Class character words sets are formed basing on the rules , and unlabeled documents are classified by intersecting with these sets. Experiments confirm that this method has a promising recall , precision rate and F-Measure while speeding up both training and test time.
关键词
计算机应用 /
中文信息处理 /
基于关联的分类 /
中文文本分类 /
词类频率 /
类别特征词集合
{{custom_keyword}} /
Key words
computer application /
Chinese information processing /
association based classification /
chinese text categorization /
term class frequency /
class character term set
{{custom_keyword}} /
{{custom_sec.title}}
{{custom_sec.title}}
{{custom_sec.content}}
参考文献
[1] 黄萱菁,吴立德,等. 独立于语种的文本分类方法[J] . 中文信息学报,2000 ,16 (6) :1 - 7.
[2] 刘少辉,董明楷,等. 一种基于向量空间模型的多层次文本分类方法[J] . 中文信息学报,2002 ,16 (3) :8 - 26.
[3] 李辉,史忠植,等. 运用文本领域的常识改善基于支撑向量机的文本分类器性能[J] . 中文信息学报, 2002 ,16 (2) :7 - 13.
[4] 刘斌,黄铁军,等. 一种新的基于统计的自动文本分类方法[J] . 中文信息学报,2002 ,16 (6) :18 - 24.
[5] 施彤年,卢忠良,等. 多类多标签汉语文本自动分类的研究[J] . 情报学报,2003 ,03 :306 - 309.
[6] B. Liu , W. Hsu , and Y. Ma. Integrating Classification and Association Rule Mining [C] . KDD - 98 , New York , 1998.
[7] Wenmin Li , Jiawei Han , JianPei. CMAR: Accurate and Efficient Classification Based on Multiple Class-Association Rules [C] . ICDM2001 , Silicon Valley , Ca , Nov 2001 : 369 - 376.
[8] Maria-Luiza Antonie , Osmar R. Zaiane. Text Document Categorization by Term Association [C] . In : Proc of the IEEE International Conference on Data Mining (ICDM 2002) , Maebashi City , Japan : 19 - 26.
[9] 宋擒豹,沈钧毅. 基于关联规则的Web文档聚类算法[J] . 软件学报,2002 ,13 (3) :417 - 423.
[10] Mohammed J . Zaki , Charu C. Aggarwal. XRules : An Effective Structural Classifier for XML Data [C] . The Ninth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (SIGKDD) . Washington , DC , USA , 2003.
[11] Yiming Yang , Jan O. Pederson. A Comparative Study on Feature Selection in Text Categorization [C] . International Conference on Machine Learning , Nashville , TN , July 1997.
[12] https://securesite.chireader.com/Archive/stopwords.txt.
[13] http://www.in2in.com/download.htm.
{{custom_fnGroup.title_cn}}
脚注
{{custom_fn.content}}
基金
科技部科技电子政务系统关键技术及应用系统的研究资助(2001BA110B01)
{{custom_fund}}