面向Web的故事脉络挖掘研究综述

赵旭剑,王崇伟,金培权,张晖,杨春明,李波

PDF(3429 KB)
PDF(3429 KB)
中文信息学报 ›› 2021, Vol. 35 ›› Issue (11) : 13-33.
综述

面向Web的故事脉络挖掘研究综述

  • 赵旭剑1,王崇伟1,金培权2,张晖3,杨春明1,李波1
作者信息 +

A Survey of Web-oriented Storyline Mining

  • ZHAO Xujian1, WANG Chongwei1, JIN Peiquan2, ZHANG Hui3, YANG Chunming1, LI Bo1
Author information +
History +

摘要

互联网时代,纷繁复杂的Web信息使得人们难以快速、准确地获得新闻事件的故事脉络。因此,如何从Web信息中自动挖掘社会事件的故事脉络(简称“故事脉络挖掘”)成为近年来Web数据挖掘领域的一个研究热点。故事脉络挖掘旨在通过分析新闻事件与后续关联事件间的相互关系,抽取事件的演化阶段,并进一步挖掘事件的演化模式。故事脉络挖掘可应用于网络新闻检索、文本摘要、舆情监测等众多应用场景,具有重要的研究价值。该文首先概述了故事脉络挖掘的定义、流程及主要任务,然后从故事脉络构建和事件演化分析两个方面详细介绍了目前故事脉络挖掘方向的主要进展,接着比较了两类数据集及其评测标准,最后给出了故事脉络挖掘领域未来的若干研究挑战和技术框架。

Abstract

The complex Web information makes it difficult for people to quickly and accurately obtain the storyline of news events. Therefore, “storyline mining” has become a valid research issue in the recent years, with a purpose to extract the evolutionary stages of events and further explore the evolution model of events by analyzing the correlation between news events and subsequent related events. Storyline mining can be applied to many applications, such as web news retrieval, text summarization, and public opinion monitoring. This paper first outlines the definition, process and main tasks of storyline mining. Next, from the aspects of storyline construction and event evolution analysis, the main progresses of the current studies on this task are introduced in detail. And then we compare two types of datasets and their evaluation metrics. Finally, several future research directions and technical frameworks of the storyline mining are discussed in the paper.

关键词

故事脉络 / 事件演化 / 演化周期 / 演化模式

Key words

storyline / event evolution / evolutionary cycle / evolutionary pattern

引用本文

导出引用
赵旭剑,王崇伟,金培权,张晖,杨春明,李波. 面向Web的故事脉络挖掘研究综述. 中文信息学报. 2021, 35(11): 13-33
ZHAO Xujian, WANG Chongwei, JIN Peiquan, ZHANG Hui, YANG Chunming, LI Bo. A Survey of Web-oriented Storyline Mining. Journal of Chinese Information Processing. 2021, 35(11): 13-33

参考文献

[1] 中国互联网络信息中心. 第47次中国互联网络发展状况统计报告[EB/OL]. 2021-02-03. http://www.cac.gov.cn/2021-02/03/c_161392342 3079314.htm.
[2] Makkonen J. Investigations on event evolution on TDT[C]//Proceedings of the HLT-NAACL Student Research Workshop, 2003: 43-48.
[3] Nallapati R, Feng A, Peng F, et al. Event threading within news topics[C]//Proceedings of the 13th ACM International CIKM, 2004: 446-453.
[4] Yang Y, Carbonell J G, Brown R D, et al. Learning approaches for detecting and tracking news events[J]. IEEE Intelligent Systems and Their Applications, 1999, 14(4): 32-43.
[5] Liu B, Niu D, Lai K, et al. Growing story forest online from massive breaking news[C]//Proceedings of the ACM on Conference on Information and Knowledge Management. 2017: 777-785.
[6] Rehm G, Zaczynska K, Moreno J. Semantic storytelling: Towards identifying storylines in large amounts of text content[C]//Proceedings of Text2Story - the 2nd Workshop on Narrative Extraction from Texts, Co-located with the 41st European Conference on Information Retrieval,2019: 63-70.
[7] 李红艳. 突发事件发展演化研究述评[J]. 自然灾害学报, 2017, 26(2): 212-216.
[8] Kleinberg J. Bursty and hierarchical structure in streams[J]. TKDE, 2003, 7(4): 373-397.
[9] Alonso O. Stuff happens continuously: Exploring web contents with temporal information[C]//Proceedings of the 22nd International Conference on World Wide Web, 2013: 1083-1084.
[10] Whiting S, Jose J, Alonso O. Wikipedia as a time machine[C]//Proceedings of the 23rd International Conference on World Wide Web, 2014: 857-862.
[11] Alonso O, Bannur S, Khandelwal K, et al. The world conversation: Web page metadata generation from social sources[C]//Proceedings of the 24th International Conference on World Wide Web, 2015: 385-395.
[12] Santos A, Pasini B, Freire J. A first study on temporal dynamics of topics on the web[C]//Proceedings of the 25th International Conference Companion on World Wide Web, 2016: 849-854.
[13] Alonso O, Kandylas V, Tremblay S E. How it happened: Discovering and archiving the evolution of a story using social signals[C]//Proceedings of the 18th ACM/IEEE on Joint Conference on Digital Libraries, 2018: 193-202.
[14] Alonso O. Event evolution and archiving[C]//Proceedings of the 8th Biennial Conference on Innovative Data Systems Research, 2017.
[15] Alonso O, Kandylas V, Tremblay S. Automatic story evolution wikification from social data[C]//Proceedings of the International AAAI Conference on Web and Social Media, 2018: 713-714.
[16] Alonso O, Kandylas V, Tremblay S E, et al. What‘s happening and what happened: Searching the social web[C]//Proceedings of the ACM on Web Science Conference, 2017: 191-200.
[17] Mu L, Jin P, Zheng L, et al. EventSys: Tracking event evolution on microblogging platforms[C]//Proceedings of the International Conference on Database Systems for Advanced Applications. Springer, Cham, 2018: 797-801.
[18] Lu X S, Zhou M C, Qi L, et al. Clustering-algorithm-based rare-event evolution analysis via social media data[J]. IEEE Transactions on Computational Social Systems, 2019, 6(2): 301-310.
[19] 徐程浩, 郭斌, 欧阳逸, 等. 基于社交媒体的事件感知与多模态事件脉络生成[J]. 计算机科学, 2017, 44(Z6): 33-36.
[20] 张辉, 李国辉, 孙博良, 等. 一种新闻事件演化建模方法[J]. 国防科技大学学报, 2013, 35(4): 166-170.
[21] Huang D, Hu S, Cai Y, et al. Discovering event evolution graphs based on news articles relationships[C]//Proceedings of the 11th IEEE International Conference on e-Business Engineering, 2014: 246-251.
[22] Wu C, Wu B, Wang B. Event evolution model based on random walk model with hot topic extraction[C]//Proceedings of the International Conference on Advanced Data Mining and Applications. Springer, Cham, 2016: 591-603.
[23] Zhou P, Wu B, Cao Z. Emmbtt: A novel event evolution model based on TF×IEF and TDC in tracking news streams[C]//Proceedings of the 2nd IEEE International Conference on Data Science in Cyberspace (DSC), 2017: 102-107.
[24] Nomoto T. Two-tier similarity model for story link detection[C]//Proceedings of the 19th ACM International CIKM, 2010: 789-798.
[25] Lu Z, Yu W, Zhang R, et al. Discovering event evolution chain in microblog[C]//Proceedings of the 17th IEEE International Conference on High Performance Computing and Communications, the 7th International Symposium on Cyberspace Safety and Security, and 12th IEEE International Conference on Embedded Software and Systems, 2015: 635-640.
[26] Guo B, Ouyang Y, Zhang C, et al. Crowdstory: Fine-grained event storyline generation by fusion of multi-modal crowdsourced data[C]//Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies, 2017, 1(3): 1-19.
[27] Alonso O, Tremblay S E, Diaz F. Automatic generation of event timelines from social data[C]//Proceedings of the ACM on Web Science Conference, 2017: 207-211.
[28] Chen C C, Chen Y T, Sun Y, et al. Life cycle modeling of news events using aging theory[C]//Proceedings of European Conference on Machine Learning. Springer, Berlin, Heidelberg, 2003: 47-59.
[29] Qiu J, Li C, Qiao S, et al. Timeline analysis of web news events[C]//Proceedings of the International Conference on Advanced Data Mining and Applications. Springer, Berlin, Heidelberg, 2008: 123-134.
[30] Cai H, Huang Z, Srivastava D, et al. Indexing evolving events from tweet streams[J]. IEEE TKDE, 2015, 27(11): 3001-3015.
[31] 李风环, 郑德权, 赵铁军. 动态增量式子主题事件演化分析[J]. 计算机研究与发展, 2015, 52(11): 2441-2450.
[32] Laban P, Hearst M A. Newslens: Building and visualizing long-ranging news stories[C]//Proceedings of the Events and Stories in the News Workshop, 2017: 1-9.
[33] Lee P, Lakshmanan L V S, Milios E E. Event evolution tracking from streaming social posts[J]. arXiv preprint arXiv: 1311.5978, 2013.
[34] 付佳兵, 董守斌. 一种基于词覆盖的新闻事件脉络链构建方法[J]. 北京大学学报(自然科学版), 2016: 1.
[35] Shou L, Wang Z, Chen K, et al. Sumblr: Continuous summarization of evolving tweet streams[C]//Proceedings of the 36th International ACM SIGIR Conference on Research and Development in Information Retrieval, 2013: 533-542.
[36] Zhou Y, Kanhabua N, Cristea A I. Real-time timeline summarisation for high-impact events in twitter[C]//Proceedings of the 22nd European Conference on Artificial Intelligence. 2016: 1158-1166.
[37] Goyal P, Kaushik P, Gupta P, et al. Multilevel event detection, storyline generation, and summarization for tweet streams[J]. IEEE Transactions on Computational Social Systems, 2019, 7(1): 8-23.
[38] Hawwash B, Nasraoui O. From tweets to stories: Using stream-dashboard to weave the twitter data stream into dynamic cluster models[C]//Proceedings of the 3rd International Workshop on Big Data, Streams and Heterogeneous Source Mining: Algorithms, Systems, Programming Models and Applications, 2014: 182-197.
[39] Swan R, Allan J. Automatic generation of overview timelines[C]//Proceedings of the 23rd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, 2000: 49-56.
[40] Wen A, Lin W, Ma Y, et al. News event evolution model based on the reading willingness and modified TF-IDF formula[J]. Journal of High Speed Networks, 2017, 23(1): 33-47.
[41] Ansah J, Liu L, Kang W, et al. A graph is worth a thousand words: Telling event stories using timeline summarization graphs[C]//Proceedings of the 28th International Conference on World Wide Web Conference, 2019: 2565-2571.
[42] Zhou D, Xu H, He Y. An unsupervised Bayesian modelling approach for storyline detection on news articles[C]//Proceedings of the Conference on Empirical Methods in Natural Language Processing, 2015: 1943-1948.
[43] Zhou D, Xu H, Dai X Y, et al. Unsupervised storyline extraction from news articles[C]//Proceedings of the 25th International Joint Conference on Artificial Intelligence, 2016: 3014-3021.
[44] Hua T, Zhang X, Wang W, et al. Automatical storyline generation with help from twitter[C]//Proceedings of the 25th ACM International CIKM, 2016: 2383-2388.
[45] 佘玉轩, 熊赟. 基于贝叶斯网络的故事线挖掘算法[J]. 计算机工程, 2018: 44(3): 55-59.
[46] Guo L, Zhou D, He Y, et al. Storyline extraction from news articles with dynamic dependency[J]. Intelligent Data Analysis, 2020, 24(1): 183-197.
[47] Chang Y, Tang J, Yin D, et al. Timeline summarization from social media with life cycle models[C]//Proceedings of the 25th International Joint Conference on Artificial Intelligence, 2016: 3698-3704.
[48] Liu T Y. Learning to rank for information retrieval[M]. Berlin Heidelberg: Springer, 2011.
[49] Mele I, Bahrainian S A, Crestani F. Linking news across multiple streams for timeliness analysis[C]//Proceedings of the ACM on International CIKM, 2017: 767-776.
[50] 赵天资, 段亮, 岳昆, 等. 基于 Biterm 主题模型的新闻线索生成方法[J]. 数据分析与知识发现, 2021,5(2): 1-13.
[51] Wang Q, Xu J, Li H, et al. Regularized latent semantic indexing: A new approach to large-scale topic modeling[J]. ACM Transactions on Information Systems (TOIS), 2013, 31(1): 1-44.
[52] Kalyanam J, Velupillai S, Conway M, et al. From event detection to storytelling on microblogs[C]//Proceedings of IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining (ASONAM), 2016: 437-442.
[53] 王园园, 赵亚娟. 基于非负矩阵分解的技术主题演化分析[J]. 图书情报工作, 2018, 62(10): 94-105.
[54] Zhou D, Guo L, He Y. Neural storyline extraction model for storyline generation from news articles[C]//Proceedings of the Conference of the NAACL, 2018: 1727-1736.
[55] 欧阳逸, 郭斌, 何萌, 等. 微博事件感知与脉络呈现系统[J]. 浙江大学学报, 2016, 50(6): 1176-1182.
[56] Wang D, Li T, Ogihara M. Generating pictorial storylines via minimum-weight connected dominating set approximation in multi-view graphs[C]//Proceedings of the AAAI Conference on Artificial Intelligence, 2012, 26(1).
[57] Meladianos P, Xypolopoulos C, Nikolentzos G, et al. An optimization approach for sub-event detection and summarization in twitter[C]//Proceedings of European Conference on Information Retrieval. Springer, Cham, 2018: 481-493.
[58] 李莹莹, 马帅, 蒋浩谊, 等. 一种基于社交事件关联的故事脉络生成方法[J]. 计算机研究与发展, 2018, 55(9): 1972.
[59] Lin F, Huang F, Liang C. Individualized storyline-based news topic retrospection[C]//Proceedings of the 11th Pacific Asia Conference on Information Systems: Managing Diversity in Digital Enterprises, 2007: 140-152.
[60] Lin F, Liang C H. Storyline-based summarization for news topic retrospection[J]. Decision Support Systems, 2008, 45(3): 473-490.
[61] Kolomiyets O, Bethard S, Moens M F. Extracting narrative timelines as temporal dependency structures[C]//Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics, 2012: 88-97.
[62] 樊笑冰, 饶元, 王硕, 等. 基于命名实体敏感的分层新闻故事线生成方法[J]. 中文信息学报, 2021, 35(1): 113-124.
[63] Zhou W, Shen C, Li T, et al. Generating textual storyline to improve situation awareness in disaster management[C]//Proceedings of the 15th IEEE International Conference on Information Reuse and Integration, 2014: 585-592.
[64] Lin C, Lin C, Li J, et al. Generating event storylines from microblogs[C]//Proceedings of the 21st ACM International Conference on Information and Knowledge Management, 2012: 175-184.
[65] 李培, 翁伟, 林琛. 中文微博故事线生成方法[J]. 中文信息学报, 2016, 30(3): 143-151.
[66] Yuan R, Zhou Q, Zhou W. dTexSL: A dynamic disaster textual storyline generating framework[J]. World Wide Web, 2019, 22(5): 1913-1933.
[67] Yuan R, Ni J, Zhou Q. Generating multimedia storyline for effective disaster information awareness[J]. IEEE Access, 2019, 7: 47401-47410.
[68] Nazanin D, Masoud A. SGSG: Semantic graph-based storyline generation in Twitter[J]. Journal of Information Science, 2019, 45(3): 304-321.
[69] Chatzichristofis S A, Boutalis Y S. CEDD: Color and edge directivity descriptor: A compact descriptor for image indexing and retrieval[C]//Proceedings of the International Conference on Computer Vision Systems. Springer, Berlin, Heidelberg, 2008: 312-322.
[70] Long R, Wang H, Chen Y, et al. Towards effective event detection, tracking and summarization on microblog data[C]//Proceedings of the International Conference on Web-age Information Management. Springer, Berlin, Heidelberg, 2011: 652-663.
[71] Huang L, Lv S, Zang L, et al. A fresh look at understanding news events evolution[C]//Companion Proceedings of the Web Conference, 2018: 29-30.
[72] Mishra A, Berberich K. Event digest: A holistic view on past events[C]//Proceedings of the 39th International ACM SIGIR Conference on Research and Development in Information Retrieval, 2016: 493-502.
[73] 徐伟, 赵斌, 吉根林. 基于滑动窗口的微博时间线摘要算法[J]. 数据采集与处理, 2017: 523-532.
[74] Yan R, Wan X, Otterbacher J, et al. Evolutionary timeline summarization: A balanced optimization framework via iterative substitution[C]//Proceedings of the 34th International ACM SIGIR Conference on Research and Development in Information Retrieval, 2011: 745-754.
[75] Huang L. Optimized event storyline generation based on mixture-event-aspect model[C]//Proceedings of the Conference on Empirical Methods in Natural Language Processing, 2013: 726-735.
[76] Wang L, Cardie C, Marchetti G. Socially-informed timeline generation for complex events[C]//Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2015: 1055-1065.
[77] Tran T A, Niederée C, Kanhabua N, et al. Balancing novelty and salience: Adaptive learning to rank entities for timeline summarization of high-impact events[C]//Proceedings of the 24th ACM International on Conference on Information and Knowledge Management, 2015: 1201-1210.
[78] Schubotz T, Krestel R. Online temporal summarization of news events[C]//Proceedings of the 2015 IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology, 2015, 1: 409-412.
[79] AlNoamany Y, Weigle M C, Nelson M L. Generating stories from archived collections[C]//Proceedings of the ACM on Web Science Conference, 2017: 309-318.
[80] Zhao W X, Wen J R, Li X. Generating timeline summaries with social media attention[J]. Frontiers of Computer Science, 2016, 10(4): 702-716.
[81] Wang H, Koh J L. Timeline summarization for event-related discussions on a Chinese social media platform[C]//Proceedings of the International Conference on Industrial, Engineering and Other Applications of Applied Intelligent Systems. Springer, Cham, 2017: 579-594.
[82] Swan R, Jensen D. Timemines: Constructing timelines with statistical models of word usage[C]//Proceedings of the KDD-2000 Workshop on Text Mining, 2000: 73-80.
[83] Binh Tran G, Alrifai M, Quoc Nguyen D. Predicting relevant news events for timeline summaries[C]//Proceedings of the 22nd International Conference on World Wide Web, 2013: 91-92.
[84] Yan R, Kong L, Huang C, et al. Timeline generation through evolutionary trans-temporal summarization[C]//Proceedings of the Conference on Empirical Methods in Natural Language Processing, 2011: 433-443.
[85] Yan R, Wan X, Lapata M, et al. Visualizing timelines: Evolutionary summarization via iterative reinforcement between text and image streams[C]//Proceedings of the 21st ACM International Conference on Information and Knowledge Management, 2012: 275-284.
[86] 王东波, 叶文豪, 吴毅, 等. 基于多特征时间抽取模型的食品安全事件演化序列生成研究[J]. 情报学报, 2017, 36(9): 930-939.
[87] Alonso O, Shiells K. Timelines as summaries of popular scheduled events[C]//Proceedings of the 22nd International Conference on World Wide Web, 2013: 1037-1044.
[88] 梁小艳, 庄亚明. 基于贝叶斯网络的突发事件信息生命阶段研判方法[J]. 情报科学, 2016, 34(4): 35-39.
[89] 刘国威, 成全. 基于网络舆情生命周期的微博热点事件主题演化研究[J]. 情报探索, 2018, 1(4): 11-19.
[90] 谢科范, 赵湜, 陈刚, 等. 网络舆情突发事件的生命周期原理及集群决策研究[J]. 武汉理工大学学报, 2010, 23(4): 482-486.
[91] Mu L, Jin P, Zheng L, et al. Lifecycle-based event detection from microblogs[C]//Companion Proceedings of the Web Conference 2018, 2018: 283-290.
[92] Menczer F, Belew R K, Willuhn W. Artificial life applied to adaptive information agents[C]//Working Notes of the AAAI Symposium on Information Gathering from Distributed, Heterogeneous Databases, 1995.
[93] Chen C C, Chen Y T, Chen M C. An aging theory for event life-cycle modeling[J]. IEEE Transactions on Systems, Man, and Cybernetics-Part A: Systems and Humans, 2007, 37(2): 237-248.
[94] Chen J, Niu Z, Fu H. A multi-news timeline summarization algorithm based on aging theory[C]//Proceedings of Asia-Pacific Web Conference. Springer, Cham, 2015: 449-460.
[95] Chen K Y, Luesukprasert L, Seng-cho T C. Hot topic extraction based on timeline analysis and multidimensional sentence modeling[J]. IEEE Transactions on Knowledge and Data Engineering, 2007, 19(8): 1016-1025.
[96] Chang Y, Yamada M, Ortega A, et al. Ups and downs in buzzes: Life cycle modeling for temporal pattern discovery[C]//Proceedings of the International Conference on Data Mining, 2014: 749-754.
[97] Yang J, McAuley J, Leskovec J, et al. Finding progression stages in time-evolving event sequences[C]//Proceedings of the 23rd International Conference on World Wide Web, 2014: 783-794.
[98] 王建伟, 荣莉莉. 突发事件的连锁反应网络模型研究[J]. 计算机应用研究, 2008, 25(11): 3288-3291.
[99] 荣莉莉,张继永. 突发事件的不同演化模式研究[J]. 自然灾害学报,2012,21(3): 1-6.
[100] Holt X, Radford W, Hachey B. Presenting a new dataset for the timeline generation problem[C]//Proceedings of the Australasian Language Technology Association Workshop, 2016: 155-159.
[101] Caselli T, Vossen P. The storyline annotation and representation scheme (star): A proposal[C]//Proceedings of the 2nd Workshop on Computing News Storylines, 2016: 67-72.
[102] Cybulska A, Vossen P. Guidelines for ECB+ annotation of events and their coreference[R]//Technical Report. Technical Report NWR-2014-1, VU University Amsterdam, 2014.
[103] Pustejovsky J, Castano J M, Ingria R, et al. TimeML: Robust specification of event and temporal expressions in text[J]. New Directions in Question Answering, 2003, 3: 28-34.
[104] Caselli T, Vossen P. The event storyline corpus: A new benchmark for causal and temporal relation extraction[C]//Proceedings of the Events and Stories in the News Workshop, 2017: 77-86.
[105] Caselli T, Inel O. Crowdsourcing storylines: Harnessing the crowd for causal relation annotation[C]//Proceedings of Events and Stories in the News. Association for Computational Linguistics (ACL), 2018: 44-54.
[106] Rosenberg A, Hirschberg J. V-measure: A conditional entropy-based external cluster evaluation measure[C]//Proceedings of the Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, 2007: 410-420.
[107] Lin C Y. Rouge: A package for automatic evaluation of summaries[C]//Text Summarization Branches Out, 2004: 74-81.
[108] Wu Z, Pan S, Chen F, et al. A comprehensive survey on graph neural networks[J]. IEEETransactions on Neural Networks and Learning Systems, 2021, 32(1): 4-24.

基金

教育部人文社科基金(17YJCZH260);国家自然科学基金(61672479);四川省科学技术厅重点项目(2020YFS0057);赛尔网络下一代互联网技术创新项目 (NGII20180403)
PDF(3429 KB)

1949

Accesses

0

Citation

Detail

段落导航
相关文章

/