基于特征驱动的微博话题检测方法

贺 敏;刘 玮;刘 悦;王丽宏;白 硕;程学旗

PDF(2473 KB)
PDF(2473 KB)
中文信息学报 ›› 2017, Vol. 31 ›› Issue (3) : 101-108.
信息抽取与文本挖掘

基于特征驱动的微博话题检测方法

  • 贺 敏1;刘 玮1;2;刘 悦1;王丽宏2;白 硕1;程学旗1
作者信息 +

Feature Driven Microblog Topic Detection

  • HE Min1; LIU Wei1;2; LIU Yue1; WANG Lihong2; BAI Shuo1; CHENG Xueqi1
Author information +
History +

摘要

该文针对微博数据稀疏、内容关系难以计算的特点,提出了一种基于特征驱动的微博话题检测方法。提取有意义串作为微博动态特征,根据微博的结构关系计算特征的作者影响力和文档影响力,与内容统计特性共同构成特征的属性组,采用逻辑回归对特征建模,基于属性组对特征二元分类得到话题关键特征,将关键特征之间的互信息作为距离度量,改进最近邻聚类方法对关键特征聚类产生话题。微博数据实验表明,该方法有效提高了微博话题检测的准确率和召回率。

Abstract

With the microblog properties of sparse data and difficult decision on relation of content, a feature-driven microblog topic detection method is proposed. The meaningful strings are extracted as dynamic microblog features. The author-influence and document-influence of features are defined according to the structure relation of microblogs, which form the attribute sets together with the statistics on content. The logic regression model is used to classify features into key features and noise features. The nearest neighbor clustering method is modified to derive the topics from clustering the key feartures, in which the mutual information of key features is applied as the distance measure. The microblog data experiment shows that the accuracy and recall are remarkably improved by the proposed method.

关键词

话题检测 / 微博 / 关键特征 / 逻辑回归 / 聚类

Key words

topic detection / microblog / key feature / logic regression / clustering

引用本文

导出引用
贺 敏;刘 玮;刘 悦;王丽宏;白 硕;程学旗. 基于特征驱动的微博话题检测方法. 中文信息学报. 2017, 31(3): 101-108
HE Min; LIU Wei; LIU Yue; WANG Lihong; BAI Shuo; CHENG Xueqi. Feature Driven Microblog Topic Detection. Journal of Chinese Information Processing. 2017, 31(3): 101-108

参考文献

[1] Papka R, Allan J. On-line new event detection using single pass clustering[R]. USA: University of Massachusetts, 1998.
[2] 雷震,吴玲达,雷蕾,等. 初始化类中心的增量K军执法及其在新闻事件探测的应用[J]. 情报学报,2006,25(3): 289-295.
[3] 骆卫华,于满泉,许洪波,等. 基于多策略优化的分治多层聚类算法的话题发现研究[J].中文信息学报,2006, 20(1): 29-36.
[4] BleiD, Ng A, Jordan M. Latent dirichlet allocation[J]. Journal of Machine Learning Research: 2003, 3(2003): 993-1022.
[5] Blei D, Griffiths T, Jordan M, et al. Hierarchical topic models and the nested Chinese restaurant process[C]//Proceedings of NIPS 04’. Denver, USA: NIPS, 2004.
[6] RamageD, Hall D, Nallapati R,et al. Labeled LDA: a supervised topic model for credit attribution in multi-labeled corpora[C]//Proceedings of EMNLP 09’. Stroudsburg, PA: ACL, 2004.
[7] Sharifi B, Hutton M, Kalita J.Summarizing microblogs with topic models[C]//Proceedings of 11th Annual Conference of the North American Chapter of the Association for Computational Linguistics. Los Angeles, USA: NAACL, 2010: 685-688.
[8] Liu Z, Yu W, Chen W. Short Text Feature Selection and Classification for MicroBlog Mining[C]//Proceedings of International Conference on Computational Intelligence and Software Engineering. Piscataway, NJ: IEEE, 2010: 1-4.
[9] Lee C, Wu C, Chien T. BursT: A Dynamic Term Weighting Scheme for Mining Microblogging Messages[C]//Proceedings of 8th International Symposium on Neural Networks. Piscataway, NJ: IEEE , 2011.
[10] Du Yanyan,He Yanxiang,Tian Ye. Microblog bursty topic detection based on userrelationship[C]//Proceedings of 6th IEEE Information Technology and Artificial Intelligence Conference. Piscataway, NJ: IEEE, 2011: 260-263.
[11] Kasiviswanathan S, Melville P, Banerjee A. Emerging topic detection using dictionary learning[C]//Proceedings of conference on CIKM’11. New York: ACM , 2011: 745-754.
[12] 贺敏.面向互联网的中文有意义串挖掘[D]. 中国科学院计算技术研究所硕士学位论文,2007.
[13] 贺敏,王丽宏,杜攀,等.基于有意义串聚类的微博热点话题发现方法[J].通信学报,2013, 34(Z1): 256-262.
[14] Yang S, Cheng X, Chen Y. Detect events on noisy textual datasets[C]//Proceedings of the 12th International Asia-Pacific Web Conference. Busan, Korea: IEEE, 2010.

基金

国家科技支撑基金(2012BAH46B01);国家自然科学基金(61170230)
PDF(2473 KB)

572

Accesses

0

Citation

Detail

段落导航
相关文章

/