本文提出了一种新的句子相似度度量的方法并应用于文本自动摘要中。其创新处在于相似度计算不仅考虑句子中的unigram ,还考虑了bi-gram 和tri-gram ,通过回归方法将这几种相似度结果综合起来。实验证明这种相似度计算方法是有效的。同时本文还提出了一种新的,利用句子间相似度以及句子的权重的抽句
式文摘算法,在抽取出句子的同时也去掉了冗余。DUC2003、DUC2004 (Document Understanding Conference 2003 ,2004) 的评测结果征明了方法的有效性。我们的系统在DUC2004 的评测中列第二位。
Abstract
This paper introduces a new method for calculating similarity between sentences. The algorithm uses not only unigram but also bi2gram and tri2gram to calculate similarity. The algorithm is based on regression methods. Experimentations show that the method effective. The final summarization result is better than the algorithm that does not use it.We also propose a new summarization algorithm based on sentenceps weight and the new sentence similarity calculating method. While extracting the most important sentences ,redundancy is also reduced. The evaluation of DUC2003 and DUC2004 shows its effectiveness.Our system rank second among all systems that join in the DUC 2004.
关键词
计算机应用 /
中文信息处理 /
文本自动摘要 /
向量模型 /
相似度计算
{{custom_keyword}} /
Key words
computer application /
Chinese information processing /
text summarization /
vector model /
similarity calculating
{{custom_keyword}} /
{{custom_sec.title}}
{{custom_sec.title}}
{{custom_sec.content}}
参考文献
[1 ] H. P.Luhn. The automatic creation of literature abstracts [A] . IBMJournal of Research Development [C] ,2 :159 -165 ,1958.
[2 ] G. Salton ,A. Singhai ,M. Mitra ,C. Buckly ,1999. Automatic text structuring and summarization [A] . In advances in Automatic Text Summarization [C] ,Eds. I.Mani and M. T.Maybury. The MIT Press. Pp 62 - 70.
[3 ] Jae2Hoon Kim,JoonHong Kim,Dosam Hwang ,2000. Korean Text Summarization Using an Aggregate Similarity [A]. The 5th International Workshop on Information Retrieval with Asian Languages [C]. Hong Kong ,September 30 to October 3 ,2000.
[4 ] MINIPAR [R] .MINIPARps Home Page. http :/ / www. cs. ualberta. ca/ ~lindek/ minipar. htm.
[5 ] D. K.Lin ,1993. Principle2based parsing without overgeneration [A] . In Proceedings of ACL - 93 [ C] ,pages 112 -120 ,Columbus ,Ohio.
[6 ] J . Carbonell ,J . Goldstein ,1998. The use of MMR ,diversity2based reranking for reordering documents and producing summaries [A] ,In : Proceedings of the 21st ACM2SIGIR International Conference on Research and Development in Information Retrieval [C] ,Melbourne ,Australia.
[7 ] Lin ,Chin2Yew and E. H. Hovy 2003. Automatic Evaluation of Summaries Using N2gram Co2occurrence Statistics [A]. In Proceedings of 2003Language Technology Conference (HLT2NAACL 2003) [C] ,Edmonton ,Canada ,May 27 2June 1 ,2003.
[8 ] Lin ,Chin2Yew and E. H. Hovy. 2002.Automated Multi2document Summarization in NeATS [A] . In Proceedings of the Human Language Technology Conference (HLT2002) [C] ,San Diego ,CA ,U. S.A. ,March 23 - 27 ,2002.
[9 ] Radev ,D. R. ,Jing ,H. ,and Budzikowska ,M. 2000. Centroid2based summarization of multiple documents [A] . In ANLP2NAACL workshop on summarization [C] .
[10 ] Hovy ,E. and Lin ,C. 1997.Automated text summarization in SUMMARIST [A] . Pages 18 - 24. In ACL p97 workshop on Intelligent Scalable Text Summarization [C] .
[11 ] Wesley T. Chuang and Jihoon Yang. 2000. Extracting Sentence Segments for Text Summarization :AMachine Learning Approach[A] . In :Proceeding of The 26th Annual International ACM SIGIR Conference [C] .
[12 ] G. Salton.Automatic Text Processing :The Transformation ,Analysis ,and Retrieval of Information by Computer [M] .Addison2Wesley ,1989.
[13 ] Sasha Blair2Goldensohn. 2004 ,Columbia University at DUC 2004[R] . In DUC2004.
[14 ] H. P. Edmundson. ,1998 New Methods in Automatic Extraction[A] . Pages 23 - 42. In Advances in Automatic Text Summarization[C] .
[15 ] 葛加银. 文本自动摘要技术的研究[D] . 上海:复旦大学硕士论文,2004.
{{custom_fnGroup.title_cn}}
脚注
{{custom_fn.content}}
基金
国家自然科学基金资助项目(60103014) ;上海市科委重要研究项目资助(035005028)
{{custom_fund}}