Abstract:Microblog is a novel individual publication model over Internet, making significantly more information open and interactive. Utilizing topic detection techniques to classify and organize microblog texts by topics can enable users access to the information interested to them under the dynamic environment. To deal with the short, semi-structured, context dependent microblog texts, we propose a thread-based two-stage clustering method. In the first phase, the temporal-author-topic (TAT) model is applied to clean the thread, namely to filter out the noisy microblog texts. In the second phrase, microblog texts with each thread are merged to form the thread texts for global topic detection. Experimental results show the approach achieves a good performance with a F-measure of 31.2%. Key wordsmicroblog texts; topic detection; TAT model; thread information; LDA feature selection
[1] China and Microbloging: How people tweet in China[DB/OL]. www.digimind.com, 2011. [2] J Allan, J Carbonell. Topic Detection and Tracking Pilot Study: Final Report[C]//Proceeding of the DARPA Broadcast News Transcriptions and Understanding Workshop, February, 1998: 11-17. [3] Y Yang, T Pierce, J Carbonell. A Study on Retrospective and On-Line Event Detection[C]//Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, 1998: 28-36. [4] J Xu, W Croft. Cluster-based language models for distributed retrieval[C]//Proceedings of the SIGIR 1999: 254-261. [5] Z Jia, Q He, H Zhang, et al. A New Event Detection and Tracking Algorithm Based on Dynamic Evolution Model[J]. Journal of Computer Research and Development, 2004,41(7):1273-1280. [6] 贾自艳, 何清, 张俊海,等. 一种基于动态进化模型的事件探测和追踪算法[J]. 计算机研究与发展. 2004,41(7):1273-1280. [7] B Sharifi, M-A Hutton, J Kalita. Summarizing Microblogs with Topic Models[C]//Proceeding of NAACL-HLT2010: 685-688. [8] D Ramage, S Dumais, D Liebling. Characterizing Microblogs with Topic Models[C]//Proceeding of ICWSM2010. [9] B OConnor, M Krieger, D Ahn. TweetMotif: Exploratory Search and Topic Summarizing for Twitter[C]//Proceedings of ICWSW 2010. [10] Z Liu, W Yu, W Chen, et al. Short Text Feature Selection and Classification for Micro Blog Mining[C]//Proceedings of CiSE2010:1-4. [11] M Blei, Y Ng, I Jordan. Latent Dirchlet Allocation[J]. Journal of Machine Learning Research,2003: 993-1022. [12] M Steinbach, G Kapypis, V Kumar. A Comparison of Document Clustering Techniques[C]//Proceedings of KDD Workshop on Text Mining, 2000: 109-111. ()()