话题追踪(TT)是研究自动追踪事件动态发展过程的一种信息智能获取技术,是话题检测与追踪(TDT)技术的一个子任务,其目标在于自动发现新闻报道信息流中与某一已知话题有关的新报道。该文通过分析传统文档向量空间模型的不足,结合新闻报道的特征,提出了一种三维文档向量模型,在此基础上建立了一种符合新闻报道特征的话题模型。该话题模型在追踪过程中能够根据事件的动态发展进行自我学习和自我修正。结合话题模型,该文还设计了一种自适应的KNN新闻话题追踪器,从而形成了一种完整的中文话题追踪器模型。实验数据表明该方法在描述新闻话题、避免话题漂移方面具有一定优势,在中文话题追踪领域取得了较好效果。
Abstract
Topic Tracking (TT), which grows out of the Topic Detection and Tracking (TDT) tasks, is a technology of information intelligent acquisition for dynamic developments of events. Its aim is to automatically track the subsequent news stories of known events from the information stream of news media. By analyzing the lacks of traditional document vector space model and the characteristics of news reports, this paper presents a new document vector model of 3 dimensions, which stresses the theme and entities of news stories. Then we proposed a topic model consistent with the feature of news reports, which can adjust itself to the developments of events in the process of topic tracking by means of self-learning. Combining with the topic model, we also designed a complete adaptive KNN topic tricking model for Chinese topic tracking. The experimental results show that the proposed approach can accurately describe the news topic and effectively avoid theme drift and eventually achieve good performance in Chinese topic tracking.
Key wordstopic tracking; topic model; 3-dimensional document vector model; adaptive KNN
关键词
话题追踪 /
话题模型 /
三维文档向量模型 /
自适应KNN追踪器
{{custom_keyword}} /
Key words
topic tracking /
topic model /
3-dimensional document vector model /
adaptive KNN
{{custom_keyword}} /
{{custom_sec.title}}
{{custom_sec.title}}
{{custom_sec.content}}
参考文献
[1] Allan J., 2002a.Topic Detection and Tracking: Event-based Information Organization[M]. Dordrecht: Kluwer Academic.
[2] 洪宇, 张宇, 刘挺,等. 话题检测与跟踪的评测及研究综述[J]. 中文信息学报, 2007, 21(7): 71-87.
[3] Fiscus J, Doddiongton G., Topic Detection and Tracking Evaluation overview[M]. Dordrecht, London:Kluwer Academic Publishers, 2002:17-30.
[4] Watanabe Y Okaxta, K Kaneji, and Y Sakamoto. Multiple Media Database System for TV Newscasts and Newspapers[C]//Technical Report of IEIGE. Japan, 1998, 47254.
[5] C Buckley and G Salton. Optimization of relevance feedback weights[C]//Proceedings of SIGIR ’95. Washington, United States: Seattle, 1995, 351-357.
[6] B Masland, GLinoff, and D Waltz. Classifying news stories using memory based reasoning[C]//Proceedings of SIGIR ’92. Denmark: Copenhagen, 1992: 59-65.
[7] Y. Zhang, J. G. Carbonell, J. Allan. Topic Detection and Tracking: Detection Task[C]//Proceedings of the Workshop of Topic Detection and Tracking, 1997.
[8] J Carbonell, Y Yang, J Lafferty, R D. Brown, TPierce, and X. Liu. CMU Report on TDT2: Segmentation, Detection and Tracking[C]//Proceedings of the DARPA Broadcast News Transcription and Understanding Workshop. San Francisco: Morgan Kauffman, 1999: 117-120.
[9] J Kupiec and J Pedersen. A trainable document summarizer[C]//Proceedings of the 18th Annual In ACM SIGIR Conf on Research and Development in Information Retrieval (SIGIR ’95). Seattle, Washington, USA: ACM Press, 1995: 68-73.
[10] James Allan, Ron Papka, Victor Lavrenko. Online New Event Detection and Tracking[C]//the proceedings of SIGIR ’98. University of Massachusetts: Amherst, 1998: 37-45.
[11] J M Schultz and Mark Liberman. Topic detection and tracking using idf-weighted cosine coefficient[C]//Proceedings of the DARPA Broadcast News Workshop. San Francisco: Morgan Kaufmann, 1999, 189-192.
[12] J P Yamron, S Knecht, and P V Mulbregt. Dragon’s Tracking and Detection Systems for the TDT2000 Evaluation[C]//Topic Detection and Tracking Workshop. USA: National Institute of Standard and Technology, 2000: 75-79.
[13] J Allan, V Lavrenko, D Frey, V Khandelwal. UMass at TDT 2000[C]//Proceedings of Topic Detection and Tracking Workshop. USA: National Institute of Standar and Technology, 2000, 109-115.
[14] W Lam, S Mukhopadhyay, J Mostafa, and MPalakal. Detection of Shif t s in User Interest s for Personalized Information Filtering[C]//Proceedings of the 19th Annual International ACM SIGIR Conference on Research and Development in Information Ret rieval. Konstanz: Hartung Gorre Verlag, 1996: 317-325.
[15] Y Lo, J L Gauvain. The L IMSI Topic Tracking System for TDT 2002[C]//Topic Detection and Tracking Workshop. Gaithersburg, USA, 2002.
[16] 贾自艳, 何清, 张海俊,等. 一种基于动态进化模型的事件探测和追踪算法[J]. 计算机研究与发展, 2004, 41(7): 1273-1280.
[17] 王会珍, 朱靖波, 季铎,等. 基于反馈学习自适应的中文话题追踪[J].中文信息学报,2006,20(3):92-98.
[18] G. Salton, A. Wong, C. S. Yang. A vector space model for automatic indexing[C]//Communications of the ACM. 1975, 18(11): 613-620.
[19] Y. Yang, J. Carbonell, R. Brown, T. Pierce, B. T. Archibald, X. Liu. Learning Approaches for Detection and Tracking News Events[C]//IEEE Intelligent System: Special Issue on Application of Intelligent Information Retrieval. 1999, 14(4): 32-43.
{{custom_fnGroup.title_cn}}
脚注
{{custom_fn.content}}
基金
国家科技基础条件平台建设基金(2005DKA63901)
{{custom_fund}}