Abstract:Proper processing of the document set based on its semantic structure helps bring about better multi-document summaries. In this paper, subject-object-predicate triples are firstly extracted from document set to construct document semantic graph. Then the edit distance-based clustering and PageRank algorithm are applied to optimize the graph structure and to assign weights to the vertices and links, respectively. Finally, triples with more weighted vertices and links are collected as the summary. Evaluated against the extraction-based summarization in terms of the ROUGE score on a set of manual generated summaries, it shows that the semantic graph-based summarization gained more overlaps with manually created summaries, and the edit distance-based graph structure optimization is positive to the the summarization quality. Key words computer application; Chinese information processing; document semantic graph; edit distance; Page-Rank; ROUGE; Chinese multi-document summarization
[1] Jun秦兵,刘挺,李生.多文档自动文摘综述[J].中文信息学报,2005,19(6):14-20. [2] 刘德喜,何炎祥,姬东鸿,等.一种基于演化算法进行句子抽取的多文档自动摘要系统SBGA[J].中文信息学报,2006,20(6):46-53. [3] 傅间莲,陈群秀.自动文摘系统中的主题划分问题研究 [J].中文信息学报,2005,19(6):28-35. [4] 傅间莲,陈群秀.基于规则和统计的中文自动文摘系统 [J].中文信息学报,2006,20(5):10-16. [5] 马慧芳, 祁云平, 杨小东. 一种基于文本关系图的多文档自动摘要技术[J]. 情报学报, 2007, 23(3): 67-69. [6] 耿焕同, 蔡庆生, 赵 鹏, 等. 一种基于词共现图的文档自动摘要研究[J]. 情报学报, 2005, 24(6): 651-656. [7] 王建波,王开铸. 自然语言篇章理解及基于理解的自动文摘研究[J].中文信息学报,1992,6(2):1-7. [8] 王萌,何婷婷,姬东鸿,等. 基于HowNet概念获取的中文自动文摘系统[J].中文信息学报,2005,19(3):87-93. [9] Lucy Vanderwende, Michele Banko, Arul Menezes. Event-centric summary generation[C]//Proceedings of Document Understanding Conference, Boston, USA, 2004. Available at: duc.nist.gov/pubs/2004papers/microsoft.banko.pdf. [10] Jure Leskovec, Natasa Milic-Frayling, Marko Grobelnik. Extracting Summary Sentences Based on the Document Semantic Graph. MSR-TR-2005-07. Available at: ftp://ftp.research.microsoft.com/pub/tr/TR-2005-07.pdf. [11] Rada Mihalcea. Graph-based Ranking Algorithms for Sentence Extraction, Applied to Text Summarization[C]//Proceedings of the 42nd Annual Meeting of the Association for Computational Linguistics, Barcelona, Spain, 2004: 170-173. [12] G. Salton, A. Singhal, M. Mitra, et al. Automatic text structuring and summarization [J]. Information Processing and Management, 1997, 2(32): 193-207. [13] 清华大学汉语分词和句法树分析功能库, 清华大学计算机系自然语言处理组. [14] E. S. Ristad, P. N. Yianilos. Learning string-edit distance[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 1998, 20(5): 522-532. [15] 车万翔,刘挺,秦兵, 等. 面向双语句对检索的汉语句子相似度计算[C]//全国第七届计算语言学联合学术会议论文集. 北京: 清华大学出版社, 2003: 81-88. [16] 董振东, 董强. 知网. http://www.keenage.com. [17] 梅家驹, 竺一鸣, 高蕴琦, 等.同义词词林[M]. 上海: 上海辞书出版社,1996. [18] A Survey of Google’s PageRank. Available at: http://pr.efactory.de/. [19] http://news.sina.com.cn. [20] Chin-Yew Lin. ROUGE: A Package for Automatic Evaluation of Summaries[C]//Proceedings of Workshop on Text Summarization Branches Out, Post-Conference Workshop of ACL 2004, Barcelona, Spain, 2004: 74-81. [21] The CMU-Cambridge Statistical Language Modeling toolkit. Available at: http://mi.eng.cam.ac.uk/~prc14/toolkit.html.