Document Summarization Based on Penetrability of Key Words
REN Liyuan1,2, XIE Zhenping1,2, LIU Yuan1,2
1.School of Digital Media, Jiangnan University, Wuxi, Jiangsu 214122, China;
2.Jiangsu Key Laboratory of Media Design and Software Technology, Wuxi, Jiangsu 214122, China
Abstract:Automatic document summarization aims to extract brief and important information from massive texts. In order to further explore novel features for text summarization, knowledge network is introduced to model document information. Specifically, key words of documents are viewed as network nodes, sentences are represented as the paths of sequential key words on knowledge network. Then, the feature model for the penetrability of key words is proposed, in which width and depth of penetrability of key words are defined to measure each sentence. A maximum entropy based document summarization model is implemented with the proposed feature, which is validated in the experiments for its effectiveness.
[1] Kupiec J, Pedersen J, Chen F. A trainable document summarizer[C]//Proceedings of International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, 1995: 68-73.
[2] 程娟. 中文文档自动摘要技术[D]. 山东: 山东大学硕士学位论文, 2006.
[3] Luhn H P. The automatic creation of literature abstracts[J]. IBM Journal of research and development, 1958, 2(2): 159-165.
[4] Document Understanding Conference [EB/OL]. http: //duc. nist. gov
[5] Angela Iills. Text Anlysis Conference [EB/OL]. https: //tac. nist. gov.
[6] Lin C Y, Och F J. Automatic evaluation of machine translation quality using longest common subsequence and skip-bigram statistics[C]//Proceedings of the 42nd Annual Meeting on Association for Computational Linguistics. Association for Computational Linguistics, 2004: 605.
[7] Qian X, Liu Y. Fast Joint Compression and Summarization via Graph Cuts[C]//Proceedings of the EMNLP. 2013: 1492-1502.
[8] Berg-Kirkpatrick T, Dan G, Dan K. Jointly Learning to Extract and Compress[C]//Proceedings of The Meeting of the Association for Computational Linguistics: Human Language Technologies, Proceedings of the Conference. DBLP, 2011: 481-490.
[9] Shen D, Sun J T, Li H, et al. Document summarization using conditional random fields[C]//Proceedings of International Joint Conference on Artifical Intelligence. Morgan Kaufmann Publishers Inc. 2007: 2862-2867.
[10] Schilder F, Kondadadi R. FastSum: fast and accurate query-based multi-document summarization[C]//Proceedings of Meeting of the Association for Computational Linguistics on Human Language Technologies: Short Papers. Association for Computational Linguistics, 2008: 205-208.
[11] Ouyang Y, Li W, Li S, et al. Applying regression models to query-focused multi-document summarization[J]. Information Processing & Management, 2011, 47(2): 227-237.
[12] 辛霄, 范士喜, 王轩, 等. 基于最大熵的依存句法分析[J]. 中文信息学报, 2009, 23(02): 18-22.
[13] Li C, Qian X, Liu Y. Using Supervised Bigram-based ILP for Extractive Summarization[C]//Proceedings of the ACL. 2013: 1004-1013.
[14] 索红光. 一种基于词汇链的关键词抽取方法[J]. 中文信息学报, 2006, 20(06): 25-30.
[15] 宋锐, 林鸿飞. 基于文档语义图的中文多文档摘要生成机制[J]. 中文信息学报, 2009, 23(03): 110-115.
[16] Liu F, Flanigan J, Thomson S, et al. Toward Abstractive Summarization Using Semantic Representations[C]//Proceedings of conference of the North American chapter of the Association for Computational Linguistics: Haman Language Technologies. 2015: 1077-1086.
[17] 刘挺. 基于篇章多级依存结构的自动文摘研究[J]. 计算机研究与发展, 1994, 4: 479-488.
[18] Wan X, Zhang J. CTSUM: extracting more certain summaries for news articles[C]//Proceedings of the International ACM SIGIR Conference on Research & Development in Information Retrieval. ACM, 2014: 787-796.
[19] Yan S, Wan X. SRRank: Leveraging Semantic Roles for Extractive Multi-Document Summarization[J]. IEEE/ACM Transactions on Audio Speech & Language Processing, 2014, 22(12): 2048-2058.
[20] Jaynes E T. Information Theory and Statistical Mechanics[J]. Physical Review, 1957, 106(4): 620-630.
[21] Jaynes E T. On the rationale of maximum-entropy methods[C]//Proceedings of the IEEE, 1982, 70(9): 939-952.
[22] Berger A L, Pietra V J D, Pietra S A D. A maximum entropy approach to natural language processing[J]. Computational Linguistics, 1996, 22(1): 39-71.
[23] Pietra S D, Pietra V J D, John D. La erty: Inducing Features of Random Fields[C]//Proceedings of the IEEE Transactions on Pattern Analysis and Machine Intelligence, 1997: 380-393.
[24] 李荣陆, 王建会, 陈晓云, 等. 使用最大熵模型进行中文文本分类[J]. 计算机研究与发展, 2005, 42(01): 94-101.
[25] 刘向, 马费成, 王晓光. 知识网络的结构及过程模型[J]. 系统工程理论与实践, 2013, 33(7): 1836-1844.
[26] 刘知远, 孙茂松, 林衍凯, 等. 知识表示学习研究进展[J]. 计算机研究与发展, 2016, 53(02): 247-261.
[27] Mikolov T, Chen K, Corrado G, et al. Efficient Estimation of Word Representations in Vector Space[C]//International Conference or Learning Representations 2013: 1-12.
[28] 汪小帆. 复杂网络理论及其应用[M]. 北京: 清华大学出版社, 2006.
[29] 刘挺. 哈工大社会计算与信息检索研究中心 [EB/OL]. http: //ir. hit. edu. cn/.
[30] 马费成, 张勤. 国内外知识管理研究热点—基于词频的统计分析[J]. 情报学报, 2006, 25(2): 163-171.
[31] 祁瑞华, 杨德礼, 胡润波. 基于特征缺失补偿最熵模型的文本分类[J]. 情报杂志, 2010, 29(5): 141-143.
[32] Casella G, Edward I. George. Explaining the Gibbs Sampler[J]. The American Statistician, 1992, 46(3): 167-174.