面向查询的多文档摘要技术有两个难点 第一,为了保证摘要与查询密切相关,容易造成摘要内容重复,不够全面;第二,原始查询难以完整描述查询意图,需进行查询扩展,而现有查询扩展方法多依赖于外部语义资源。针对以上问题,该文提出一种面向查询的多文档摘要方法,利用主题分析技术识别出当前主题下的子主题,综合考虑句子所在的子主题与查询的相关度以及子主题的重要度两方面因素来选择摘要句,并根据词语在子主题之间的共现信息,在不使用任何外部知识的情况下,进行查询扩展。在DUC2006评测语料上的实验结果表明,与Baseline系统相比,该系统取得了更高的ROUGE评价值,基于子主题的查询扩展方法则进一步提高了摘要的质量。
Abstract
There are two difficulties in the technique of query-focused multi-document summarization. First, to ensure the high relevancy with the query, the summarization tends to be repetitive. Second, the original query needs to be expanded to fully reflect user’s intention, but current query expansion methods usually depend on exterior linguistic resources. To solve the above problems, this paper proposes a query-focused multi-document summarization approach, in which subtopics are identified by topic analysis technique. While selecting sentences, both the relevancy with query and the importance of the subtopic are considered. Then, the query is expanded according to the co-occurrence of words among subtopics without using any external knowledge. Experimental results on DUC2006 corpus show that the new approach achieves higher performance than the baseline system. The query expansion method further improved the summarization quality.
Key wordsquery-focused;multi-document summarization;subtopic;relevancy;query expansion
关键词
面向查询 /
多文档摘要 /
子主题 /
相关度 /
查询扩展
{{custom_keyword}} /
Key words
query-focused /
multi-document summarization /
subtopic /
relevancy /
query expansion
/
/
/
/
/
/
/
/
{{custom_keyword}} /
{{custom_sec.title}}
{{custom_sec.title}}
{{custom_sec.content}}
参考文献
[1] Jade Goldstein, Mark Kantrowitz, Vibhu Mittal, et al. Summarizing Text Documents: Sentence Selection and Evaluation Metrics[C]//Proceedings of SIGIR-99. Berkeley, CA. 1999: 121-128.
[2] Prasad Pingali, Rahul K and Vasudeva Varma. IIIT Hyderabad at DUC 2007[C]//Proceedings of DUC 2007. 2007.
[3] Liang Zhou, Chin-Yew Lin, and Eduard Hovy. A BE-based Multi-document Summarizer with Query Interpretation[C]//Proceedings of DUC 2005. B.C. Canada. 2005.
[4] G.A. Miller. WordNet: A Lexical Databases for English. Communications of the ACM[J]. New York. 1995: 39-41.
[5] Eduard Hovy, Chin-Yew Lin, Junichi Fukumoto. Automated Summarization Evaluation With Basic Elements[C]//Proceedings of the 5th International Conference on Language Resources and Evaluation. 2006.
[6] Finley Lacatusu, Andrew Hickl. LCC’s GISTexter at DUC 2006: Multi-Strategy Multi-Document Summarization[C]//Proceedings of DUC 2006. 2006.
[7] Katja Filippova, Mihai Surdeanu, Massimiliano Ciaramita, et al. Company-Oriented Extractive Summarization of Financial News[C]//Proceedings of the 12th Conference of the European Chapter of the ACL, Athens, Greece. 2009: 246-254.
[8] 秦兵, 刘挺, 陈尚林,等. 多文档文摘中句子优化选择方法研究[J].计算机研究与发展, 2006, 43(6): 1129-1134.
[9] 郑义, 黄萱菁, 吴立德. 文本自动综述系统的研究与实现[J]. 计算机研究与发展, 2003, 40(11): 1606-1611.
[10] Kathleen R. McKeown, Judith L. Klavans, Vasileios Hatzivassiloglou, et al. Towards multi-document summarization by reformulation: Progress and prospects[C]//Proceedings of the 17th National Conference on Artificial Intelligence. 1999.
[11] Olivier Ferret. Finding document topics for improving topic segmentation[C]//Proceedings of the 45th Annual Meeting of the Association of Computational Linguistics. Prague, Czech Republic. 2007: 480-487.
[12] Freddy Y. Y. Choi. Advances in domain independent linear text segmentation[C]//Proceedings of North American chapter of the Association for Computational Linguistics annual meeting. Seattle. 2000.
[13] Fragkou Pavlina, Petridis Vassilios, Kehagias Athanasios. A Dynamic Programming Algorithm for Linear Text Segmentation[J]. Journal of Intelligent Information Systems. 2004, 23(2): 179-197.
[14] Chin-Yew Lin. Looking for a few good metrics: ROUGE and its evaluation[C]//Proceedings of NTCIR Workshop. Tokyo, Japan. 2004.
[15] Hoa Trang Dang. Overview of DUC 2006[C]//Proceedings of DUC 2006. 2006
{{custom_fnGroup.title_cn}}
脚注
{{custom_fn.content}}
基金
辽宁省教育厅高校科研计划资助项目(L2010422)
{{custom_fund}}