基于线索树双层聚类的微博话题检测

马 彬,洪 宇,陆剑江,姚建民,朱巧明

PDF(4548 KB)
PDF(4548 KB)
中文信息学报 ›› 2012, Vol. 26 ›› Issue (6) : 121-129.
综述

基于线索树双层聚类的微博话题检测

  • 马 彬,洪 宇,陆剑江,姚建民,朱巧明
作者信息 +

A Thread-based Two-stage Clustering Method of Microblog Topic Detection

  • MA Bin, HONG Yu, LU Jianjiang, YAO Jianmin, ZHU Qiaoming
Author information +
History +

摘要

微博作为一种全新的信息发布模式,在极大程度上增强了网络信息的开放性和互动性,但同时也造成微博空间内信息量的裂变式增长。利用话题检测技术将微博文本信息按照话题进行归类和组织,可以帮助用户在动态变化的信息环境下高效获取个性信息或热点话题。该文针对微博文本短、半结构、上下文信息丰富等特点,提出了基于线索树的双层聚类的话题检测方法,通过利用融合了时序特征和作者信息的话题模型(Temporal-Author-Topic, TAT)进行线索树内的局部聚类,借以实现垃圾微博的过滤,最后利用整合后的线索树进行全局话题检测。实验结果显示该方法在解决数据稀疏方面取得了较好的效果,话题检测的F值达到31.2%。

Abstract

Microblog is a novel individual publication model over Internet, making significantly more information open and interactive. Utilizing topic detection techniques to classify and organize microblog texts by topics can enable users access to the information interested to them under the dynamic environment. To deal with the short, semi-structured, context dependent microblog texts, we propose a thread-based two-stage clustering method. In the first phase, the temporal-author-topic (TAT) model is applied to clean the thread, namely to filter out the noisy microblog texts. In the second phrase, microblog texts with each thread are merged to form the thread texts for global topic detection. Experimental results show the approach achieves a good performance with a F-measure of 31.2%.
Key wordsmicroblog texts; topic detection; TAT model; thread information; LDA feature selection

关键词

微博文本 / 话题检测 / TAT模型 / 线索树 / LDA特征选择

Key words

microblog texts / topic detection / TAT model / thread information / LDA feature selection

引用本文

导出引用
马 彬,洪 宇,陆剑江,姚建民,朱巧明. 基于线索树双层聚类的微博话题检测. 中文信息学报. 2012, 26(6): 121-129
MA Bin, HONG Yu, LU Jianjiang, YAO Jianmin, ZHU Qiaoming. A Thread-based Two-stage Clustering Method of Microblog Topic Detection. Journal of Chinese Information Processing. 2012, 26(6): 121-129

参考文献

[1] China and Microbloging: How people tweet in China[DB/OL]. www.digimind.com, 2011.
[2] J Allan, J Carbonell. Topic Detection and Tracking Pilot Study: Final Report[C]//Proceeding of the DARPA Broadcast News Transcriptions and Understanding Workshop, February, 1998: 11-17.
[3] Y Yang, T Pierce, J Carbonell. A Study on Retrospective and On-Line Event Detection[C]//Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, 1998: 28-36.
[4] J Xu, W Croft. Cluster-based language models for distributed retrieval[C]//Proceedings of the SIGIR 1999: 254-261.
[5] Z Jia, Q He, H Zhang, et al. A New Event Detection and Tracking Algorithm Based on Dynamic Evolution Model[J]. Journal of Computer Research and Development, 2004,41(7):1273-1280.
[6] 贾自艳, 何清, 张俊海,等. 一种基于动态进化模型的事件探测和追踪算法[J]. 计算机研究与发展. 2004,41(7):1273-1280.
[7] B Sharifi, M-A Hutton, J Kalita. Summarizing Microblogs with Topic Models[C]//Proceeding of NAACL-HLT2010: 685-688.
[8] D Ramage, S Dumais, D Liebling. Characterizing Microblogs with Topic Models[C]//Proceeding of ICWSM2010.
[9] B OConnor, M Krieger, D Ahn. TweetMotif: Exploratory Search and Topic Summarizing for Twitter[C]//Proceedings of ICWSW 2010.
[10] Z Liu, W Yu, W Chen, et al. Short Text Feature Selection and Classification for Micro Blog Mining[C]//Proceedings of CiSE2010:1-4.
[11] M Blei, Y Ng, I Jordan. Latent Dirchlet Allocation[J]. Journal of Machine Learning Research,2003: 993-1022.
[12] M Steinbach, G Kapypis, V Kumar. A Comparison of Document Clustering Techniques[C]//Proceedings of KDD Workshop on Text Mining, 2000: 109-111.
     ()()

基金

国家自然科学基金资助项目(61003152,60970057,60970056);教育部博士点基金资助项目(2009321110006);教育部博士学科点专项基金资助项目(20103201110021);江苏省苏州市自然科学基金项目(SYG201030)
PDF(4548 KB)

496

Accesses

0

Citation

Detail

段落导航
相关文章

/