HE Min1; LIU Wei1;2; LIU Yue1; WANG Lihong2; BAI Shuo1; CHENG Xueqi1
1. CAS Key Laboratory of Newtwork Data Science and Technology, Institute of Computing Technology, Chinese Academy of Sciences, Beijing 100190, China; 2. National Computer network Emergency Response technical Team/Coordination Center of China, Beijing 100029, China
Abstract:With the microblog properties of sparse data and difficult decision on relation of content, a feature-driven microblog topic detection method is proposed. The meaningful strings are extracted as dynamic microblog features. The author-influence and document-influence of features are defined according to the structure relation of microblogs, which form the attribute sets together with the statistics on content. The logic regression model is used to classify features into key features and noise features. The nearest neighbor clustering method is modified to derive the topics from clustering the key feartures, in which the mutual information of key features is applied as the distance measure. The microblog data experiment shows that the accuracy and recall are remarkably improved by the proposed method.
[1] Papka R, Allan J. On-line new event detection using single pass clustering[R]. USA: University of Massachusetts, 1998. [2] 雷震,吴玲达,雷蕾,等. 初始化类中心的增量K军执法及其在新闻事件探测的应用[J]. 情报学报,2006,25(3): 289-295. [3] 骆卫华,于满泉,许洪波,等. 基于多策略优化的分治多层聚类算法的话题发现研究[J].中文信息学报,2006, 20(1): 29-36. [4] BleiD, Ng A, Jordan M. Latent dirichlet allocation[J]. Journal of Machine Learning Research: 2003, 3(2003): 993-1022. [5] Blei D, Griffiths T, Jordan M, et al. Hierarchical topic models and the nested Chinese restaurant process[C]//Proceedings of NIPS 04’. Denver, USA: NIPS, 2004. [6] RamageD, Hall D, Nallapati R,et al. Labeled LDA: a supervised topic model for credit attribution in multi-labeled corpora[C]//Proceedings of EMNLP 09’. Stroudsburg, PA: ACL, 2004. [7] Sharifi B, Hutton M, Kalita J.Summarizing microblogs with topic models[C]//Proceedings of 11th Annual Conference of the North American Chapter of the Association for Computational Linguistics. Los Angeles, USA: NAACL, 2010: 685-688. [8] Liu Z, Yu W, Chen W. Short Text Feature Selection and Classification for MicroBlog Mining[C]//Proceedings of International Conference on Computational Intelligence and Software Engineering. Piscataway, NJ: IEEE, 2010: 1-4. [9] Lee C, Wu C, Chien T. BursT: A Dynamic Term Weighting Scheme for Mining Microblogging Messages[C]//Proceedings of 8th International Symposium on Neural Networks. Piscataway, NJ: IEEE , 2011. [10] Du Yanyan,He Yanxiang,Tian Ye. Microblog bursty topic detection based on userrelationship[C]//Proceedings of 6th IEEE Information Technology and Artificial Intelligence Conference. Piscataway, NJ: IEEE, 2011: 260-263. [11] Kasiviswanathan S, Melville P, Banerjee A. Emerging topic detection using dictionary learning[C]//Proceedings of conference on CIKM’11. New York: ACM , 2011: 745-754. [12] 贺敏.面向互联网的中文有意义串挖掘[D]. 中国科学院计算技术研究所硕士学位论文,2007. [13] 贺敏,王丽宏,杜攀,等.基于有意义串聚类的微博热点话题发现方法[J].通信学报,2013, 34(Z1): 256-262. [14] Yang S, Cheng X, Chen Y. Detect events on noisy textual datasets[C]//Proceedings of the 12th International Asia-Pacific Web Conference. Busan, Korea: IEEE, 2010.