石晶,胡明,戴国忠. 基于小世界模型的中文文本主题分析[J]. 中文信息学报, 2007, 21(3): 69-75.
SHI Jing, HU Ming, DAI Guo-zhong. Topic Analysis of Chinese Text Based on Small World Model. , 2007, 21(3): 69-75.
Topic Analysis of Chinese Text Based on Small World Model
SHI Jing1,2,3, HU Ming3, DAI Guo-zhong1
1. Computer Human Interaction and Intelligent Information Processing Laboratory Institute of Software, The Chinese Academy of Sciences ,Beijing 100080,China; 2. Graduate University of Chinese Academy of Sciences, Beijing 100049, China; 3. Changchun University of Technology, Changchun, Jilin 130021, China
Abstract:The paper aims to perform topic spotting of segments based on text segmentation using small world structure. Main topic of the whole text is generalized and the skeleton of text shows itself. It is explained that the term co-occurrence graph of text is highly clustered and has short path length, which proves that texts have small world structure. Clusters in the small world structure are detected. The density of each cluster is computed to discover the boundary of a segment. Topic words are extracted from clusters of the graph. Words which are not distinctly in the analyzed text can be included to express the topics with the help of word clustering of background and topic words association .The signification behind the words are attempted to dig out. Although much research on applications of small world structure, it is a new task to analyze texts with the characteristics of small world. The experiments tell that the result of tests is far better than that of other methods. Valuable pre-processing is provided for next work of text reasoning.
[1] Ath. Kehagias, A. Nicolaou, P. Fragkou and V. Petridis. Text Segmentation by Product Partition Models and Dynamic Programming [J]. Mathematical and Computer Modeling, 2004. 39: 209-217. [2] Gina-Anne Levow. Prosody-based topic segmentation for mandarin broadcast news [A]. In: Proceedings of HLT-NAACL [C]. 2004. [3] Ferret, Olivier. Using collocations for topic segmentation and link detection [A]. In: Proceedings of COLING [C]. Taipei. 260-266. [4] Hang Li and Kenji Yamanishi. Topic Analysis Using a Finite Mixture Model [J]. Information Processing & Management, 2003, 39(4): 521-541. [5] Brants, T.; Chen, F. R.; Farahat, A. O. Arabic document topic analysis [A]. In: LREC-2002 Workshop on Arabic Language Resources and Evaluation [C]. Las Palmas; Spain, 2002. [6] D.M. Blei, A.Y. Ng, and M.I. Jordan. Latent Dirichlet allocation [J]. Journal of Machine Learning Research, 2003(3):993-1022. [7] Steyvers, M. & Griffiths, T. Probabilistic topic models. In: T. Landauer, D. S. McNamara, S. Dennis, & W. Kintsch (Eds.), handbook of Latent Semantic Analysis [M]. Hillsdale, NJ: Erlbaum. 2007. [8] 索红光, 刘玉树,曹淑英.一种基于词汇链的关键词抽取方法[J]. 中文信息学报, 2006,20(6): 27-32. [9] Ferrer-i-Cancho, R. and Sole, R. V. The small world of human language [A]. In: Proceedings of the Royal Society of London. Series B, Biological Sciences [C]. 2001. 268(1482): 2261-2265. [10] Matsuo, Y.; Ohsawa, Y.; and Ishizuka, M. A docu-ment as a small world [A]. In: Proceedings the 5th World Multi-Conference on Systemics, Cybenetics and Infomatics [C]. 2001, 8: 410-414. [11] D.Watts and S. Strogatz. Collective dynamics of small-world networks [J]. Nature, 1998, 393: 440-442. [12] Yutaka Matsuo: Clustering using Small World Structure [A]. In: Proc. 6th Int’l Conf. on Knowledge-based Intelligent Information Engineering Systems & Applied Technologies(KES2002) [C]. IOS Press/Ohmsha (ISSN:0922-6389), Crema, Italy, 2002, 1252-1256. [13] Yutaka Matsuo, Yukio Ohsawa and Mitsuru Ishizuka: KeyWorld: Extracting Keywords in a Document as a Small World [A]. In: DS-2001 [C]. 2001, 271-281. [14] D. Beeferman, A. Berger, J. Lafferty. Statistical Models for Text Segmentation [J]. In: Machine Learning, 1999,34,1-34. [15] L. Pevzner and M. Hearst. A critique and improvement of an evaluation metric for text segmentation [J]. Computational Linguistics. 2002,28(1): 19-36. [16] Thorsten Brants, Francine Chen, Ioannis Tsochantaridis. Topic-based document segmentation with probabilistic latent semantic analysis [A]. In: Proceedings of the eleventh international Conference on Information and knowledge management [C]. McLean, Virginia, USA.2002.211-218. [17] F. Y. Y. Choi, P. Wiemer-Hastings, and J. Moore. Latent semantic analysis for text segmentation [A]. In: Proceedings of the 2001 Conference on Empirical Methods in Natural Language Processing [C]. 2001.109~117. [18] 石晶,戴国忠. 基于PLSA模型的文本分割[J]. 计算机研究与发展,2007, 44(2): 242-248. [19] Liu Y, Ciliax BJ, Borges K, Dasigi V, Ram A, Navathe SB, Dingledine R. Comparison of two schemes for automatic keyword extraction from MEDLINE for functional gene clustering [A]. In: Procceedings of the 2004 IEEE Computational Systems Bioinformatics Conference (CSB’04) [C]. 2004. 394-404.