现实生活中不断有新话题的产生和旧话题的衰减,同时话题的内容也会随着时间发生变化。自动探测话题随时间的演化越来越受到人们的关注。Latent Dirichlet Allocation模型是近年提出的概率话题模型,已经在话题演化领域得到较为广泛的应用。该文提出了话题演化的两个方面 内容演化和强度演化,总结了基于LDA话题模型的话题演化方法,根据引入时间的不同方式将目前的研究方法分为三类 将时间信息结合到LDA模型、对文本集合后离散和先离散方法。在详细叙述这三种方法的基础上,针对时间粒度、是否在线等多个特征进行了对比,并且简要描述了目前广泛应用的话题演化评测方法。文章最后分析了目前存在的挑战,并且对该研究方向进行了展望。
Abstract
With topics evolve over time, new topics emerge and old ones decay. Many researches are devoted to detect the topic evolution automatically. Latent Dirichlet Allocation (LDA), as a recently emerged probabilistic topic model, has been widely used in the research of topic evolution. This paper discusses two aspects of evolution on topic, i.e. the content and the topic intensity. It summarizes three methods in LDA based topic evolution detection according to the dealing with timejoining the time to LDA model, post-discretizing or pre-discretizing methods. The three methods are also compared in several featuresthe time granularity, on-line or off-line, etc. In addition, the evaluation methods for topic evolution are introduced. Finally, the paper gives some analysis and suggestions for future researches on topic evolution based on LDA.
Key wordstopic model;topic evolution;Latent Dirichlet Allocation
关键词
话题模型 /
话题演化 /
Latent Dirichlet Allocation
{{custom_keyword}} /
Key words
topic model /
topic evolution /
Latent Dirichlet Allocation
/
/
/
/
/
/
/
/
{{custom_keyword}} /
{{custom_sec.title}}
{{custom_sec.title}}
{{custom_sec.content}}
参考文献
[1] 洪宇,张宇,刘挺,等. 话题检测与跟踪的评测与研究综述[J]. 中文信息学报,2007,21(6): 71-87.
[2] Thomas Hofmann. Probabilistic latent semantic indexing[C]//Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. Berkeley,CA,USA,1999,50-57.
[3] David M. Blei,Andrew Y. Ng, Michael I. Jordan. Latent dirichlet allocation[J]. The Journal of Machine Learning Research,2003,3: 993-1022.
[4] T. Griffiths,M. Steyvers. A probabilistic approach to semantic representation[C]//Proceedings of the 24th Annual Conference of the Congnitive Science Society.Mahwah,NJ: Erlbaum,2002,381-386.
[5] M. Steyvers,T. Griffiths. Probabilistic topic models. In: T. Landauer, D. S. McNamara, S. Dennis, W. Kintsch (Eds.), handbook of Latent Semantic Analysis [M]. Hillsdale, NJ.. Erlbaum. 2007.
[6] X.Wang,A.McCallum. Topic over time: A non-markov continuous-time model of topical trends[C]//Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining.Philadelphia,PA,USA,2006:424-433.
[7] D.Hall,D.Jurafsky,C.D.Manning. Studying the history of ideas using topic models[C]//Proceedings of the Conference on Empirical Methods in Natural Language Processing.Honolulu,Hawaii,2008,363-371.
[8] D.M.Blei,J.D.Lafferty. Dynamic topic model[C]//Proceedings of the 23rd International Conference on Machine Learning.Pittsburgh,Pennsylvania,2006:113-120.
[9] L.Alsumait,D.Barbara,C.Domeniconi. On-line LDA: Adaptive topic models of mining text streams with applications to topic detection and tracking[C]//Proceeding of the 8th IEEE International Conference on Data Mining.Washington,DC,USA: IEEE Computer Society,2008:3-12.
[10] 楚克明. 基于LDA新闻话题的演化[C]//第五届全国信息检索学术会议.上海,中国,2009:64-72.
[11] A.Gohr,A.Hinnerburg,R.Schult,M.Spiliopoulou. Topic evolution in a stream of documents[C]//Proceeding of the Society for Industrial and Applied Mathematics.2009:859-870.
[12] S.Deerwester,S.Dumais,T.Landauer,etc. Indexing by latent semantic analysis[J].Journal of the American Society of Information Science,1990,41(6): 391-407.
[13] D.M.Blei,J.D.McAuliffe. Supervised topic models[C]//Proceeding of the 22nd Annual Conference on Neural Information Processing Systems,2008.
[14] M.Rosen-Zvi, T.Griffiths, M.Steyvers, etc. The author-topic model for authors and documents[C]//Proceedings of the 20th Conference on Uncertainty in Artificial Intelligence.Banff,Canada,2004:487-494.
[15] T.L.Griffiths,M.Steyvers. Finding scientific topics[C]//Proceeding of the National Academy of Science of United States of America,2004,101: 5228-5235.
[16] R.M.Nallapati,S.Ditmore,J.D.Lafferty,etc. Multiscale topic tomography[C]//Proceedings of the 13th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining.San Jose,California,USA,2007:520-529.
[17] X.Wei,J.Sun,X.Wang. Dynamic mixture models for multiple time series[C]//Proceedings of the 20th International Joint Conference on Artificial Intelligent.Hyderabad,India,2007:2909-2914.
[18] X.Song,C.Y.Lin,B.L.Tseng,etc. Modeling and predicting personal information dissemination behavior[C]//Proceedings of the 11th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining.Chicago,Illinois,USA,2005:479-488.
[19] C.Wang,D.Blei,D.Heckerman. Continuous time dynamic topic models[C]//Proceeding of the 23rd Conference on Uncertainty in Artificial Intelligence,2008.
[20] D.M.Blei,J.D.Lafferty. Correlated topic model[C]//Advances in Neural Information Processing System 17.Cambridge,MA: MIT Press,2005.
[21] L.AlSumait,D.Barbara,C.Domeniconi. The role of semantic history on online generative topic modeling[R].http://www.ise.gmu.edu/~carlotta/publications/Siam_SemOLDA.pdf: 2009.
[22] G.Shafer. Advances in the understanding and use of conditional independence[J].Annals of Mathematics and Artificial Intelligence,1997,21(1): 1-11.
[23] R.Nallapati,A.Ahmed,E.P.Xing,etc. Joint latent topic models for text and citations[C]//Proceeding of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining.Las Vegas,Nevada,USA,2008:542-550.
[24] D.M.Blei,J.D.Lafferty. Visualizing topics with multi-word expressions[J]. The Journal of Machine Learning Research,2009,7.
{{custom_fnGroup.title_cn}}
脚注
{{custom_fn.content}}
基金
国家自然科学基金资助项目(60873134)
{{custom_fund}}