微博作为一种全新的信息发布模式,在极大程度上增强了网络信息的开放性和互动性,但同时也造成微博空间内信息量的裂变式增长。利用话题检测技术将微博文本信息按照话题进行归类和组织,可以帮助用户在动态变化的信息环境下高效获取个性信息或热点话题。该文针对微博文本短、半结构、上下文信息丰富等特点,提出了基于线索树的双层聚类的话题检测方法,通过利用融合了时序特征和作者信息的话题模型(Temporal-Author-Topic, TAT)进行线索树内的局部聚类,借以实现垃圾微博的过滤,最后利用整合后的线索树进行全局话题检测。实验结果显示该方法在解决数据稀疏方面取得了较好的效果,话题检测的F值达到31.2%。
Abstract
Microblog is a novel individual publication model over Internet, making significantly more information open and interactive. Utilizing topic detection techniques to classify and organize microblog texts by topics can enable users access to the information interested to them under the dynamic environment. To deal with the short, semi-structured, context dependent microblog texts, we propose a thread-based two-stage clustering method. In the first phase, the temporal-author-topic (TAT) model is applied to clean the thread, namely to filter out the noisy microblog texts. In the second phrase, microblog texts with each thread are merged to form the thread texts for global topic detection. Experimental results show the approach achieves a good performance with a F-measure of 31.2%.
Key wordsmicroblog texts; topic detection; TAT model; thread information; LDA feature selection
关键词
微博文本 /
话题检测 /
TAT模型 /
线索树 /
LDA特征选择
{{custom_keyword}} /
Key words
microblog texts /
topic detection /
TAT model /
thread information /
LDA feature selection
{{custom_keyword}} /
{{custom_sec.title}}
{{custom_sec.title}}
{{custom_sec.content}}
参考文献
[1] China and Microbloging: How people tweet in China[DB/OL]. www.digimind.com, 2011.
[2] J Allan, J Carbonell. Topic Detection and Tracking Pilot Study: Final Report[C]//Proceeding of the DARPA Broadcast News Transcriptions and Understanding Workshop, February, 1998: 11-17.
[3] Y Yang, T Pierce, J Carbonell. A Study on Retrospective and On-Line Event Detection[C]//Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, 1998: 28-36.
[4] J Xu, W Croft. Cluster-based language models for distributed retrieval[C]//Proceedings of the SIGIR 1999: 254-261.
[5] Z Jia, Q He, H Zhang, et al. A New Event Detection and Tracking Algorithm Based on Dynamic Evolution Model[J]. Journal of Computer Research and Development, 2004,41(7):1273-1280.
[6] 贾自艳, 何清, 张俊海,等. 一种基于动态进化模型的事件探测和追踪算法[J]. 计算机研究与发展. 2004,41(7):1273-1280.
[7] B Sharifi, M-A Hutton, J Kalita. Summarizing Microblogs with Topic Models[C]//Proceeding of NAACL-HLT2010: 685-688.
[8] D Ramage, S Dumais, D Liebling. Characterizing Microblogs with Topic Models[C]//Proceeding of ICWSM2010.
[9] B OConnor, M Krieger, D Ahn. TweetMotif: Exploratory Search and Topic Summarizing for Twitter[C]//Proceedings of ICWSW 2010.
[10] Z Liu, W Yu, W Chen, et al. Short Text Feature Selection and Classification for Micro Blog Mining[C]//Proceedings of CiSE2010:1-4.
[11] M Blei, Y Ng, I Jordan. Latent Dirchlet Allocation[J]. Journal of Machine Learning Research,2003: 993-1022.
[12] M Steinbach, G Kapypis, V Kumar. A Comparison of Document Clustering Techniques[C]//Proceedings of KDD Workshop on Text Mining, 2000: 109-111.
()()
{{custom_fnGroup.title_cn}}
脚注
{{custom_fn.content}}
基金
国家自然科学基金资助项目(61003152,60970057,60970056);教育部博士点基金资助项目(2009321110006);教育部博士学科点专项基金资助项目(20103201110021);江苏省苏州市自然科学基金项目(SYG201030)
{{custom_fund}}