一种基于紧密度的半监督文本分类方法

郑海清,林琛,牛军钰

PDF(717 KB)
PDF(717 KB)
中文信息学报 ›› 2007, Vol. 21 ›› Issue (3) : 54-60.
综述

一种基于紧密度的半监督文本分类方法

  • 郑海清,林琛,牛军钰
作者信息 +

A Closeness-Based Semi-Supervised Text Classification Method

  • ZHENG Hai-qing,LIN Chen,NIU Jun-yu
Author information +
History +

摘要

自动的文本分类已经成为一个重要的研究课题。在实际的应用情况下,很多训练语料都只有一个数目有限的正例集合,同时语料中的正例和未标注文档在数量上的分布通常也是不均衡的。因此这种文本分类任务有着不同于传统的文本分类任务的特点,传统的文本分类器如果直接应用到这类问题上,也难以取得令人满意的效果。因此,本文提出了一种基于紧密度衡量的方法来解决这一类问题。由于没有标注出来的负例文档,所以,本文先提取出一些可信的负例,然后再根据紧密度衡量对提取出的负例集合进行扩展,进而得到包含正负例的训练集合,从而提高分类器的性能。该方法不需要借助特别的外部知识库来对特征提取,因此能够比较好的应用到各个不同的分类环境中。在TREC’05(国际文本检索会议)的基因项目的文本分类任务语料上的实验表明,该算法在解决半监督文本分类问题中取得了优异的成绩。

Abstract

Automatic text categorization has become a very important research area. In most applications, there’s only a positive document set with a limited size and a large portion of unlabeled data in the training set while the distribution of the number of the positive set and the negative set is also unbalanced. So, this kind of text categorization task is different from those traditional ones which have not only labeled positive but also labeled negative samples in its training set. Those traditional classification methods can not be directly used in such tasks. This paper proposed a closeness-based method to solve this semi-supervised text categorization problem. It firstly extracts a reliable negative set from the unlabeled set, and then uses the closeness-based algorithm to enlarge initially extracted reliable negative set to a proper size. Based on the labeled positive set and the extracted negative set, the classifier will be constructed. This method will improve the performance of the classifier without any outside resources to help the feature selection, so, it can be used in a lot semi-supervised text categorization tasks in different domains. The experiment on TREC’05 Genomics track data shows that this algorithm performs well in this kind of text categorization tasks.

关键词

计算机应用 / 中文信息处理 / 文本分类 / 半监督机器学习 / 支持向量机 / 紧密度

Key words

computer application / Chinese information processing / text categorization / semi-supervised learning / support vector machine / closeness

引用本文

导出引用
郑海清,林琛,牛军钰. 一种基于紧密度的半监督文本分类方法. 中文信息学报. 2007, 21(3): 54-60
ZHENG Hai-qing,LIN Chen,NIU Jun-yu. A Closeness-Based Semi-Supervised Text Classification Method. Journal of Chinese Information Processing. 2007, 21(3): 54-60

参考文献

[1] Fabrizio Sebastiani, Machine learning in automated text categorization[J]. ACM Computer Survey, 2002, 34(1): 1-47.
[2] 黄萱菁,吴立德,石崎洋之,徐国伟. 独立于语种的文本分类方法[J]. 中文信息学报, 2000,14(6): 1-7.
[3] A. Sun, E. Lim, B. Benatallah and M. Hassan, FISA: Feature-based Instance Selection for Imbalanced Text Classification[A]. Pacific-Asia Conference on Knowledge Discovery and Date Mining [C]. 2006, 250-254.
[4] H. Yu, J. Han and K. Chang. PEBL: Positive example based learning for Web page classification using SVM[A]. Internatinal Conference on Knowledge Discovery and Data [C]. 2002.
[5] K. Nigam, A. K. McCallum, S. Thurn et al, Text Classification from Labeled and Unlabeled Documents using EM[J]. Machine Learning, 2000, 39, 103-134.
[6] B. Liu, P.S. Yu,and X. Li, Partially supervised classification of text documents[A]. In: Proceedings of 19th International Conference on Machine Learning [C]. 2002.
[7] A. Blum, T. Mitchell, Combining labeled and unlabeled data with co-training[A]. In: Proceedings of the 11th annual conference on Computational learning theory [C]. 1998. 92-100.
[8] R. Ghani, Combining labeled and unlabeled data for multi-class text categorization[A]. In: Proceeding of the 19th International Conference on Machine Learning [C]. 2002.
[9] U. Brefeld, T. Scheffer, Co-EM Support Vector Machine[A]. In: Proceeding of the 21st International Conference on Machine Learning [C]. 2004.
[10] X. Li, B. Liu, Learning to classify texts using positive and unlabeled data[A]. In: Proceedings of the 18th International Joint Conference on Artificial Intelligence [C]. 2003.
[11] J. Rocchio, Relevance feedback information retrieval. In: G. Salton, editor, The Smart retrieval system-experiments in automatic document processing[M]. Prentice Hall, 1971, 313-323.
[12] Y. Yang and X. Lin, A re-examination of the categorization methods[A]. Special Interest Group on Information Retrieval [C]. 1999.
[13] V. Vapnik, The Nature of Statistical Learning Theory[M]. Springer, 1995.
[14] W. Hersh, A. Cohen, J. Yang, et al, TREC 2005 Genomics Track Overview[A]. In: Proceeding of Text Rretrieval Conference [C]. 2005.
[15] J. Niu, L. Sun, L. Lou, et al, WIM at TREC 2005[A]. In: Proceeding of Text Retrieval Conference [C]. 2005.
[16] T. Joachims, Estimating the Generalization Performance of a SVM Efficiently[A]. In: Proceedings of the 17th International Conference on Machine Learning [C]. 2000.
[17] Fujita S. Revising again document length hypotheses TREC 2004 genomics track experiment at patolis [A]. In: Proceeding of Text Retrieval Conference [C]. 2004.

基金

国家863计划资助项目(2001AA114210-01)
PDF(717 KB)

Accesses

Citation

Detail

段落导航
相关文章

/