Abstract:This paper presents a novel method to enhance graph-based multi-document summarization by incorporating Wikipedia entities. The Wikipedia contents of high-frequency entities are extracted and arranged as the document collections background knowledge. Then the PageRank algorithm is used to sort these sentences in the document collections and an improved DivRank algorithm is applied to sort the sentences both in the document collections and the background knowledge. Finally the summary sentences are chosen based on a liner combination of these two ranking results. Results of experiments on the data of document understanding conference (DUC) 2005 show that the method proposed in this paper can effectively make use of the Wikipedia knowledge to improve the summary quality.
[1] Shareha A A A, Rajeswari M, Ramachandram D. Multimodal integration (image and text) using ontology alignment[J]. American Journal of Applied Sciences, 2009, 6(6): 1217-1224 [2] Nasir S A M, Noor N L M. Automating the mapping process of traditionalmalay textile knowledge model with the core ontology[J]. American Journal of Economics and Business Administration, 2011, 3(1): 191-196. [3] Luhn H P. The automatic creation of literature abstracts[J]. IBM Journal of research and development, 1958, 2(2): 159-165. [4] Radev D R, Jing H, Stys' M, et al. Centroid-based summarization of multiple documents[J]. Information Processing & Management, 2004, 40(6): 919-938. [5] Kleinberg J M. Authoritative sources in a hyperlinkedenvironment[J]. Journal of the ACM (JACM), 1999, 46(5): 604-632. [6] Brin S, Page L. The anatomy of a large-scale hypertextual Web search engine[J]. Computer networks and ISDN systems, 1998, 30(1): 107-117. [7] Erkan G, Radev D R. LexRank: Graph-based lexical centrality as salience in text summarization[J]. J. Artif. Intell. Res. (JAIR), 2004, 22: 457-479. [8] Mihalcea R, Tarau P. TextRank: Bringing order into texts[C]//Proceedings of EMNLP. 2004, 4(4). [9] Geng H, Cai Q, Zhao P, etal. Research on Document Automatic Summarization Based on Word Co-occurrence[J].Journal of the china society for scientific and technical information,2005,24(6):652. [10] Mei Q,Guo J, Radev D. Divrank: the interplay of prestige and diversity in information networks[C]//Proceedings of the 16th ACM SIGKDD international conference on knowledge discovery and data mining. ACM, 2010: 1009-1018. [11] Wan X, Yang J. Improved affinity graph based multi-document summarization[C]//Proceedings of the Human Language Technology Conference of the NAACL, Companion Volume: Short Papers. Association for Computational Linguistics, 2006: 181-184. [12] Khelif K, Dieng-Kuntz R, Barbry P. An ontology-based approach to support text mining and information retrieval in the biological domain[J]. Universal Computer Science, Special Issue on Ontologies and their Applications, 2007, 13(12): 1881-1907. [13] Ramanathan K, Sankarasubramaniam Y, Mathur N, et al. Document summarization using Wikipedia[C]//Proceedings of the First International Conference on Intelligent Human Computer Interaction. Springer India, 2009: 254-260. [14] Nastase V. Topic-driven multi-document summarization with encyclopedic knowledge and spreading activation[C]//Proceedings of the Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 2008: 763-772. [15] Milne D, Witten I H. An open-source toolkit for miningWikipedia[J]. Artificial Intelligence, 2013,194:222-239. [16] Zhao L, Wu L, Huang X. Using query expansion in graph-based approach for query-focused multi-documentsummarization[J]. Information Processing & Management, 2009, 45(1): 35-41. [17] Yan R, Kong L,Huang C, et al. Timeline generation through evolutionary trans-temporal summarization[C]//Proceedings of the Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 2011: 433-443. [18] Lin C Y. Rouge: A package for automatic evaluation of summaries. Text Summarization Branches Out[C]//Proceedings of the ACL-04 Workshop. 2004: 74-81.