每天有大量的信息涌现在论坛上,用户可以通过论坛获知目前国际国内正在发生的一些突发事件。如何使用机器自动化的方法检测论坛中的突发话题已经成为搜索引擎以及网络挖掘系统的一项基础任务。话题检测与跟踪模型(TDT)可以很好的解决话题发现问题,但是TDT处理的对象是新闻语料,与论坛内容相比,新闻语料更准确、严谨、规范。TDT中使用的方法不适合用语随意的论坛。因此在网络论坛这种噪音环境下的话题检测面临着一定的困难与挑战。文中提出一种基于噪音过滤的话题发现模型,它从内容和用户参与度两个角度来检测论坛话题。在“水木社区”的“水木特快”上进行了相关的实验,实验结果表明该文提出的模型不仅可以检测突发话题,而且可以检测与这些话题相对应的用户社区。
Abstract
Web forum has become an important resource on the Web due to its rich information contributed by millions of Internet users every day. Consequently, the outburst topic detection becomes a fundamental task in Search Engine and Web Mining systems. Most existing topic detection and tracking (TDT) methods deal with the news stories, which are proved not suitable for extracting topics in casual, oral and informal languageon the noisy Web formus. This paper presents a noise-filtered model to extract the outburst topics from web forums using terms and participations of users. The proposed model employs not only content similarity, but also user participation information. Experiments on ShuiMu community demonstrate the efficiency of the proposed modelnot only extracting the outburst topics which are better organized for search and visualization but also discovering communities corresponding to these topics.
Key wordscomputer application; Chinese information processing; outburst topic; web forum; time sequence
关键词
计算机应用 /
中文信息处理 /
突发话题 /
网络论坛 /
时间序列
{{custom_keyword}} /
Key words
computer application /
Chinese information processing /
outburst topic /
web forum /
time sequence
{{custom_keyword}} /
{{custom_sec.title}}
{{custom_sec.title}}
{{custom_sec.content}}
参考文献
[1] http://www.newsmth.net/[EB/OL].
[2] J. Allan, R. Papka, and V. Lavrenko. On-line new event detection and tracking[C]//Proceedings of the 21st Annual International ACM SIGIR Conference, 1998: 37-45.
[3] K. Zhang, J. Li and G. Wu. New Event Detection Based on Indexing-tree and Named Entity[C]//Proc. of ACM SIGIR’07, 2007: 215-222.
[4] J. Allan, V. Lavrenko, and H. Jin. First story detection in tdt is hard[C]//CIKM,2000: 374-381.
[5] N. Stokes and J. Carthy. Combining semantic and syntactic document classifiers to improve first story detection[C]//SIGIR, 2001: 424-425.
[6] Y. Yang, J. Zhang, J. Carbonell, and C. Jin. Topic-conditioned novelty detection[C]//SIGKDD, 2002: 688-693.
[7] G. Kumaran and J. Allan. Text classification and named entities for new event detection[C]//SIGIR, 2004: 297-304.
[8] J. Allan, C. Wade, and A. Bolivar. Retrieval and novelty detection at the sentence level[C]//SIGIR, 2003: 314-321.
[9] Zhi-Li Wu, and Chun-hung Li. Topic Detection in Online Discussion using Non-Negative Matrix Factorization[C]//IEEE/WIC/ACM International Conferences on Web Intelligence and Intelligent Agent Technology, 2007.
[10] Victor Cheng, and C.H.Li. Topic Detection Via Participation using Markov Logic Network[C]//Third International IEEE Conference on Signal-Image Technologies and Internet-Based System, 2007.
[11] Mingliang Zhu, Weiming Hu, and Qu Wu. Topic Detection and Tracking for Threaded Discussion Communities[C]//IEEE/WIC/ACM International Conferences on Web Intelligence and Intelligent Agent Technology, 2008.
[12] J. Kleinberg. Bursty and hierarchical structure in streams[C]//SIGKDD,2002: 91-101.
[13] R. Kumar, J. Novak, P. Raghavan, and A. Tomkins[C]//On the bursty evolution of blogspace. In WWW, 2005: 159-178.
[14] Qi He, Kuiyu Chang, and Ee-Peng Lim. Analyzing feature Trajectories for event detection[C]//Proceedings of the 30th Annual International ACM SIGIR Conference, 2007: 207-214.
[15] Gabriel Pui Cheong Fung, Jeffrey Xu Yu, Philip S. Yu, and Hongjun Lu. Parameter free bursty events detection in text streams[C]//Proceedings of the 31st international conference on Very large data base, 2005: 181-192.
[16] R. C. Swan and J. Allan. Extracting significant time varying features from text[C]//Proceedings of the 8th international conference on Information and knowledge management, 1999.
[17] Nish Parikh and Neel Sundaresan. Scalable and near real-time burst detection from eCommerce queries[C]//Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining, 2008: 972-980.
{{custom_fnGroup.title_cn}}
脚注
{{custom_fn.content}}
基金
国家自然科学基金重点项目资助(60933005);国家973基础研究计划资助项目(2007CB311100);国家863计划资助项目(2007AA01Z438)
{{custom_fund}}