随着计算机的普及与互联网的高速发展,Facebook、Twitter、新浪微博等社交媒体逐渐成为人们信息交流的主要渠道。然而,由于社交媒体信息具有数量庞大、结构复杂、传播速度快等特点,人们无法从中快速准确地获取想要的信息。于是,话题检测与追踪技术应运而生,它将用户关注的信息从大量无序信息中筛选出来,经过细致的过滤和有效的整合,生成简单、清晰的话题信息,并在此基础上实现对话题的追踪和发展趋势分析。该文对社交媒体上的话题检测与追踪工作进行综述,首先论述了话题检测方面的三类方法,包括基于主题模型的话题检测、基于改进聚类算法的话题检测和基于多特征融合的话题检测;其次,对话题追踪的研究成果进行了介绍,主要分为非自适应话题追踪和自适应话题追踪两大类;最后,列举出社交媒体话题的检测与追踪中存在的问题以及对未来研究的展望。
Abstract
Social media such as Facebook, Twitter, and Sina Microblog have become the main channels for people to exchange information. To deal with the large quantity, complex structure and the fast transmission speed of social media information, the technology of topic detection and tracking comes into being to generate simple and clear topic information. This paper reviews the work done on social media topic detection and tracking. Firstly, it summarizes three types of topic detection methods based on topic model, clustering algorithm and multi-feature fusion, respectively. Secondly, it introduces the researches on topic tracking in two categories: non adaptive topic tracking and adaptive topic tracking. Finally, it lists the problems in the current topic detection and tracking technology, and discusses the prospects of future researches on social media.
关键词
话题检测 /
话题追踪 /
聚类 /
主题模型
{{custom_keyword}} /
Key words
topic detection /
topic tracking /
clustering /
topic model
{{custom_keyword}} /
{{custom_sec.title}}
{{custom_sec.title}}
{{custom_sec.content}}
参考文献
[1] Fiscus J G,Doddington G R.Topic detection and tracking evaluation overview[M].Topic detection and tracking.Springer US,2002:17-31.
[2] 赵华,赵铁军,于浩,等.面向动态演化的话题检测研究[J].高技术通讯,2006,16(12):1230-1235.
[3] Blei D M,Ng A Y,Jordan M I.Latent dirichlet allocation[J].Journal of Machine Learning Research,2003,3(Jan):993-1022.
[4] Mcauliffe J D,Blei D M.Supervised topic models[C]//Proceedings of the 20th International Conference on Neural Information Processing Systems,2008:121-128.
[5] Zhang Z,He Q,Gao J,et al.A deep learning approach for detecting traffic accidents from social media data[J].Transportation Research Part C:Emerging Technologies,2018,86:580-596.
[6] Ramage D,Hall D,Nallapati R,et al.Labeled LDA:A supervised topic model for credit attribution in multi-labeled corpora[C]//Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing:Volume 1.Association for Computational Linguistics,2009:248-256.
[7] 周先琳.基于动态Labeled-LDA模型的微博主题挖掘[D].合肥:合肥工业大学硕士学位论文,2015.
[8] 江雨燕,李平,王清.用于多标签分类的改进Labeled LDA模型[J].南京大学学报(自然科学版),2013,49(04):425-432.
[9] 余本功,张卫春,王龙飞.基于改进的OLDA模型话题检测及演化分析[J].情报杂志,2017,36(02):102-107.
[10] 李倩.Single-Pass聚类算法的改进及其在微博话题检测中的应用研究[D].济南:山东师范大学硕士学位论文,2016.
[11] 叶施仁,杨英,杨长春,等.孤立点预处理和Single-Pass聚类结合的微博话题检测方法[J].计算机应用研究,2016,33(08):2294-2297.
[12] 周雪梅,闫用杰,程山英,等.基于文本重构的网络话题检测模型研究[J].南昌航空大学学报(自然科学版),2015,29(03):32-37.
[13] Chen P,Zhang N L,Liu T,et al.Latent tree models for hierarchical topic detection[J].Artificial Intelligence,2017,250:105-124.
[14] Geng X,Zhang Y,Jiao Y,et al.A novel hybrid clustering algorithm for microblog topic detection[G].AIP Conference Proceedings.Melville:AIP Publishing LLC.,2017,1890(1):040074.
[15] Dong G,Yang W,Zhu F,et al.Discovering burst patterns of burst topic in Twitter[J].Computers & Electrical Engineering,2017,58:551-559.
[16] Ghoorchian K,Girdzijauskas S,Rahimian F.De-GPar:Large scale topic detection using node-cut partitioning on dense weighted graphs[C]//Proceedings of the 37th International Conference on Distributed Computing Systems (ICDCS).IEEE,2017:775-785.
[17] Zhang C,Wang H,Cao L,et al.A hybrid term-term relations analysis approach for topic detection[J].Knowledge-Based Systems,2016,93:109-120.
[18] Torres-Tramón P,Hromic H,Heravi B R.Topic Detection in Twitter using topology data analysis[G].LNCS9396:Proc of 2015 International Conference on Web Engineering.Berlin:Springer,2015:186-197.
[19] 黄贤英,陈红阳,刘英涛.短文本相似度研究及其在微博话题检测中的应用[J].计算机工程与设计,2015,36(11):3128-3133.
[20] 金镇晟.基于改进的TF-IDF算法的中文微博话题检测与研究[D].北京:北京理工大学硕士学位论文,2015.
[21] 刘志雄.面向用户兴趣与社区关系的微博话题检测方法[D].北京:北京交通大学硕士学位论文,2017.
[22] 刘玉新.Web 2.0互联网在线话题发现和热度评估[D].广州:华南理工大学硕士学位论文,2013.
[23] 李正.基于地理位置信息的中文微博突发话题检测技术研究[D].哈尔滨:哈尔滨工程大学硕士学位论文,2016.
[24] Zhang W,Chen T,Li G,et al.Fusing cross-media for topic detection by dense keyword groups[J].Neurocomputing,2015,169:169-179.
[25] 万越,隋杰.基于用户行为影响的微博突发话题检测方法[J].中国科学技术大学学报,2017,47(04):328-335.
[26] 贺敏,刘玮,刘悦,等.基于特征驱动的微博话题检测方法[J].中文信息学报,2017,31(03):101-108,124.
[27] 王征,王林森,赵磊.基于信息密度的微博突发话题检测模型研究[J].情报理论与实践,2016,39(03):125-129.
[28] Fang Y,Zhang H,Ye Y,et al.Detecting hot topics from Twitter:Amultiview approach[J].Journal of Information Science,2014,40(5):578-593.
[29] 席耀一,林琛,李弼程,等.基于语义相似度的论坛话题追踪方法[J].计算机应用,2011,31(1):93-96.
[30] Chen H,Lu J,Wang F,et al.A new method of topic tracking for Micro Blog texts based on semantic Relevance[C]//Proceedings of the 9th International Conference on Intelligent Human-Machine Systems and Cybernetics (IHMSC).IEEE,2017,2:349-353.
[31] 唐晓波,王中勤,钟林霞.基于维基语义扩展的微博话题追踪模型研究[J].情报科学,2017,35(02):80-85.
[32] Papka R.On-line new event detection,clustering,and tracking[R].University of Massachuseffs Amherst,MA,USA,1999.
[33] J Allan,J Carbonell,G Doddington,et al.Topic detection and tracking pilot study:Finalreport[C]//Proceedings of the DARPA BroadcastNews Transcription and Understanding Workshop,Virginia:Lansdowne,1998,194-218.
[34] Lewis DD,Schapire R E,Callan J P,et al.Training algorithms for linear text classifiers[C]//Proceedings of the 19th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval.ACM,1996:298-306.
[35] Huang J,Peng M,Wang H,et al.A probabilistic method for emerging topic tracking in Microblog stream[J].World Wide Web,2017,20(2):325-350.
[36] Khandelwal V,Gupta R,Allan J.An evaluation corpus for temporal summarization[C]//Proceedings of the 1st International Conference on Human Language Technology Research.Association for Computational Linguistics,2001:1-5.
[37] Y Lo,J L Gauvain.The LIMSI topic tracking system for TDT 2002 [C]//Proceedings of Topic Detection Tracking Workshop.Gaithersburg,USA,2002.
[38] 张辉,周敬民,王亮,等.基于三维文档向量的自适应话题追踪器模型[J].中文信息学报,2010,24(05):70-76.
[39] 郑燕.基于增量学习的自适应话题追踪技术研究[D].济南:山东师范大学硕士学位论文,2013.
[40] 刘彦伟.微博话题追踪系统的研究与实现[D].北京:北京交通大学硕士学位论文,2013.
[41] 柏文言,张闯,徐克付,等.一种融合用户关系的自适应微博话题跟踪方法[J].电子学报,2017,45(06):1375-1381.
[42] Hu F,Wu G,Zhao C.Research on topic tracking based on event-time relation model[J].Intelligent Computer and Applications,2016,1:008.[43] 彭敏,官宸宇,朱佳晖,等.面向社交媒体文本的话题检测与追踪技术研究综述[J].武汉大学学报(理学版),2016,62(3):197-217.
{{custom_fnGroup.title_cn}}
脚注
{{custom_fn.content}}
基金
国家自然科学基金(61772081,61602044);科技创新服务能力建设—科研基地建设—北京实验室—国家经济安全预警工程北京实验室项目(PXM2018_014224_000010)
{{custom_fund}}