Abstract:Automatic extraction of semantic information and its evolution from large-scale corpus has appealed to many experts and scholars in recent years. Topics are regarded as the latent semantic meanings underlying the document collectionand the topic evolution describes the contents of topics changing over time. This paper proposes a novel extraction method for the topics evolution and the topic relations based on the topic context. Since a topic often co-occurs with other topics in the same document, the co-occurrence information is defined as the context of a topic. Topics with its context are used not only to calculate the semantic relations among topics in the same period, but also to identify the same topics across different time periods. The experiments on NPC&CPPCC news reports from 2008 to 2012 and NIPS scientific literature from 2007 to 2011 have shown that the method has not only improved the results of topic evolution but also mined semantic relations among topics.
[1] Steyvers M, Griffiths T. Probabilistic Topic Models. In: T. Landauer, D. S. McNamara, S. Dennis, W. Kintsch(Eds.), handbook of Latent Semantic Analysys[M]. Hillsdale, NJ. Erlbaum. 2007. [2] Thomas H. Probabilistic Latent Semantic Indexing// Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. Berkeley, CA, USA, 1999: 50-57. [3] David M B, Andrew Y N, Michael I J. Latent Dirichlet Allocation. The Journal of Machine Learning Research, 2003, 3: 993-1022. [4] 单斌, 李芳. 基于LDA话题演化研究方法综述. 中文信息学报, 2010,24(6):43-49. [5] Michal R Z, Thomas G, Mark S, et al. The Author Topic Model for Authors and Documents[C]//Proceedings of the 20th Conference on Uncertainty in Artificial Intelligence. Banff, Canada,2005. [6] David M B, John D L. Dynamic Topic Models[C]//Proceedings of the 23rd International Conference on Machine Learning, Pittsburgh, Pennsylvania, 2006: 113-120. [7] Ali D, Li Juanzi, Zhou Lizhu, et al. A Generalized Topic Modeling Approach for Maven Search. APWeb/WAIM 2009, LNCS 5446, 2009: 138-149. [8] Chenghua Lin, Yulan He. Joint Sentiment/Topic Model for Sentiment Analysis[C]//Proceedings of the CIKM’09, Hong Kong, China, 2009. [9] R. Nallapati, A Ahmed, E P Xing. Joint Latent Topic Models for Text and Citations[C]//Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. Las Vegas, Nevada, USA, 2008: 542-550. [10] Chong Wang, David M, David H. Continuous Time Dynamic Topic Models[C]//Proceedings of the 23rd Conference on Uncertainty in Artificial Intelligence, 2008. [11] Andre G, Alexander H. Topic evolution in a stream of documents[C]//Proceedings of the Ninth SIAM International Conference on Data Mining, 2009: 859-870. [12] Mei Qiaozhu, Zhai Chengxiang. Discovering Evolutionary Theme Patterns from Text—An Exploration of Temporal Text Mining[C]//Proceedings of the KDD’05, Chicago, Illinois, USA,2005. [13] 楚克明, 李芳. 基于LDA话题关联的话题演化.交大学报,2010:11:1496-1500. [14] Jo Y Y, John E H, Carl L. The Web of Topics: Discovering the Topology of Evolution in a Corpus[C]//Proceedings of the WWW 2011, Hyderabad, India 2011. [15] Xianpei Han, Le Sun. A Generative Entity-Mention Model for Linking Entities with Knowledge Base[C]//Proceedings of the ACL 2011. 2011: 945-954. [16] Xianpei Han, Jun Zhao. Structural Semantic Relatedness: A Knowledge-Based Method to Named Entity Disambiguation[C]//Proceedings of the 48th Annual Meeting of the Association of Computational Linguistics, 2010: 50-59. [17] Blei D, Jordan M, Ng A. Hierarchical Bayesian Models for Applications in Information Retrieval. In Bayesian Statistics, 2003,7: 25-44. [18] Antoniak C. Mixtures of Dirichlet Processes with Applications to Bayesian Nonparametric Problems. Annals of Statistics, 1974,2(6): 1152-1174. [19] Thomas L. G., Mark S. Finding Scientific Topics[C]//Proceedings of the National Academic of Science of United States of America, 2004. [20] Ian P, David N, Alexander I. Fast Collapsed Gibbs Sampling For Latent Dirichlet Allocation. KDD’08, Las Vegas, Nevada, USA 2008.