基于词的向量空间模型是文本分类中的传统的表示文本的方法。这种表示方法的一个缺点是忽略了词之间的关系。最近一些使用潜在主题文本表示的方法,如隐含狄利克雷分配LDA (Latent Dirichlet Allocation)引起了人们的注意,这种表示方法可以处理词之间的关系。但是,只使用基于潜在主题的文本表示可能造成词信息的损失。我们使用改进的随机森林方法结合基于词的和基于LDA主题的两种文本表示方法。 对于两类特征分别构造随机森林,最终分类结果通过投票机制决定。在标准数据集上的实验结果表明,相比只使用一种文本特征的方法,我们的方法可以有效地结合两类特征,提高文本分类的性能。
Abstract
Term-based Vector Space Model (VSM) is a traditional approach to representing documents, which defects in its neglecting of the relations between terms. To capture the relations between the terms, some latent topics-based document representations such as LDA (Latent Dirichlet Allocation) have arisen much attention recently. However, simple latent topic-based text representations may cause loss of information carried by terms. In this paper, we use a modified random forests method to combine the term based and the LDA latent topic based documents representation. Random forests are constructed separately for two kinds of text representations and the final classification result is decided by vote scheme. The experimental results on some standard datasets show that, compared with methods only using one set of text features, our method can efficiently combine two kinds of text representations and improve the performance of text categorization.
Key words computer application; Chinese information processing; text categorization; VSM; latent dirichlet allocation; ensemble classification; random forests
关键词
计算机应用 /
中文信息处理 /
文本分类 /
向量空间模型 /
隐含狄利克雷分配 /
集成分类 /
随机森林
{{custom_keyword}} /
Key words
computer application /
Chinese information processing /
text categorization /
VSM /
latent dirichlet allocation /
ensemble classification /
random forests
{{custom_keyword}} /
{{custom_sec.title}}
{{custom_sec.title}}
{{custom_sec.content}}
参考文献
[1] Thorsten Joachims. Text Categorization with Support Vector Machine: Learning with many relevant features[C]//European Conference on Machine Learning: 1998: 137-142.
[2] 代六玲,黄河燕,陈肇雄.一种文本分类的在线SVM学习算法[J].中文信息学报,2005,19(5):11-15.
[3] R. E. Schapire,Y. Singer. Boostexter: A boosting-based system for text categorization[J].Machine Learning,2000,39(2/3):135-168.
[4] K Aas, L Eikvil. Text categorisation: A survey[R].Norwegian Computing Center, June 1999.
[5] 王煜, 王正欧, 白石. 用于文本分类的改进KNN算法[J].中文信息学报,2007,21(3):76-82.
[6] 陈治纲,何丕廉,孙越恒,郑小慎. 基于向量空间模型的文本分类系统的研究与实现[J].中文信息学报,2005,19(01):36-41.
[7] T. Hofmann. Probabilistic latent semantic indexing[C]//22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. Berkeley, CA, USA, 1999: 50-57.
[8] Blei, D.M, Ng, A.Y, Jordan, M.I. Latent Dirichlet Allocation[J]. Journal of Machine Learning Research, 2003, 3: 993-1022.
[9] L Breiman: Random forests[J].Machine Learning, 2001,45(1):5-32.
[10] C Chen, A Liaw, L Breiman. Using Random Forest to Learn Imbalanced Data[R].Statistics Department,University of California at Berkeley,2004.
[11] http://www.daviddlewis.com/resources/testcollections/reuters21578 [DB/OL].
[12] http://www.searchforum.org.cn/tansongbo/corpus.htm [DB/OL].
[13] Y. Yang, X. Liu. A re-examination of text categorization methods[C]//22nd Annual International SIGIR, 1999: 42-49.
[14] Y Yang. A study of thresholding strategies for text categorization[C]//SIGIR-01, 24th ACM International Conference on Research and Development in Information Retrieval, 2001: 137-145.
[15] http://www.cs.princeton.edu/~blei/lda-c [CP/OL].
{{custom_fnGroup.title_cn}}
脚注
{{custom_fn.content}}
基金
国家863计划资助项目(2006AA010109);国家自然科学基金资助项目(60673043);国家社科资金资助项目(07BYY051)
{{custom_fund}}