TP-AS: 一种面向长文本的两阶段自动摘要方法

PDF(3818 KB)

中文信息学报 ›› 2018, Vol. 32 ›› Issue (6) : 71-79.

信息抽取与文本挖掘

TP-AS: 一种面向长文本的两阶段自动摘要方法

王帅¹,赵翔^1,2,李博¹,葛斌^1,2,汤大权^1,2

作者信息 +

TP-AS: A Two-phase Approach to Long Text Automatic Summarization

WANG Shuai¹, ZHAO Xiang^1,2, LI Bo¹, GE Bin^1,2, TANG Daquan^1,2

Author information +

History +

摘要

随着互联网上信息的爆炸式增长,如何有效提高知识获取效率变得尤为重要。文本自动摘要技术通过对信息的压缩和精炼,为知识的快速获取提供了很好的辅助手段。现有的文本自动摘要方法在处理长文本的过程中,存在准确率低的问题,无法达到令用户满意的性能效果。为此,该文提出一种新的两阶段的长文本自动摘要方法TP-AS,首先利用基于图模型的混合文本相似度计算方法进行关键句抽取,然后结合指针机制和注意力机制构建一种基于循环神经网络的编码器—解码器模型进行摘要生成。通过基于真实大规模金融领域长文本数据上的实验,验证了TP-AS方法的有效性,其自动摘要的准确性在ROUGE-1的指标下分别达到了36.6%(词)和33.9%(字符),明显优于现有其他方法。

Abstract

With the explosive growth of information on the Internet,it becomes more important to improve the efficiency of knowledge acquisition. Automatic text summarization techniques provide a good means for fast knowledge acquisition by compressing and refining information. Existing automatic text summarization methods,when dealing with long text,exhibit poor accuracy,and fail to meet users’ need for performance. In this paper,we propose a two-phase automatic summarization method for long text,namely,TP-AS. Firstly,it employs a hybrid semantic similarity computation method based on a graph model to extract key sentences. Then,it constructs a recurrent neural network encoder-decoder model with attention and pointer mechanisms to generate summaries. Through experiments on real large-scale long-text corpora in financial area,the effectiveness of TP-AS is verified,and its accuracy for automatic summarization notably outperforms other existing methods.

导出引用

王帅,赵翔,李博,葛斌,汤大权. TP-AS: 一种面向长文本的两阶段自动摘要方法. 中文信息学报. 2018, 32(6): 71-79

WANG Shuai, ZHAO Xiang, LI Bo, GE Bin, TANG Daquan. TP-AS: A Two-phase Approach to Long Text Automatic Summarization. Journal of Chinese Information Processing. 2018, 32(6): 71-79

参考文献

[1] Hahn U,Mani I.The challenges of automatic summarization[J].Computer,2000,33(11):29-36.
[2] Erkan G,Radev D R.LexRank:Graph-based lexical centrality as salience in text summarization[J].Journal of Artificial Intelligence Research,2004(22):457-479.
[3] 任昭春,马军,陈竹敏.基于动态主题建模的Web论坛文档摘要[J].计算机研究与发展,2012,49(11):2359-2367.
[4] 孙春葵,李蕾,杨晓兰,等.基于知识的文本摘要系统研究与实现[J].计算机研究与发展,2000(07):874-881.
[5] Sutskever I,Vinyals O,Le Q V.Sequence to sequence learning with neural networks[C]//Proceedings of the NIPS 2014,2014:3104-3112.
[6] Cho K,Van Merrinboer B,Bahdanau D,et al.On the properties of neural machine translation:Encoder-decoder approaches[C]//Proceedings of the SSST@EMNLP 2014,2014:103-111.
[7] Bahdanau D,Cho K,Bengio Y.Neural machine translation by jointly learning to align and translate[Z].CoRR abs/1409.0473 (2014).
[8] Chorowski J K,Bahdanau D,Serdyuk D,et al.Attention-based models for speech recognition[C]//Proceedings of the NIPS 2015,2015:577-585.
[9] Rush A M,Chopra S,Weston J.A neural attention model for abstractive sentence summarization[C]//Proceedings of the EMNLP 2015,2015:379-389.
[10] Nallapati R,Zhou B,Gulcehre C,et al.Abstractive text summarization using sequence-to-sequence RNNs and beyond[C]//Proceedings of the CoNLL 2016,2016:280-290.
[11] Hu B,Chen Q,Zhu F.Lcsts:A large scale chinese short text summarization dataset[C]//Proceedings of the EMNLP 2015,2015:1967-1972.
[12] Gu J,Lu Z,Li H,et al.Incorporating copying mechanism in sequence-to-sequence learning[C]//Proceedings of the ACL 2016.
[13] Ayana S S,Liu Z,Sun M.Neural headline generation with minimum risk training[Z].CoRR abs/1604.01904 (2016).
[14] Kleinberg J M,Kumar R,Raghavan P,et al.The web as a graph:measurements,models,and methods[C]//Proceedings of the International Computing and Combinatorics Conference.Springer Berlin Heidelberg,1999:1-17.
[15] Le Q V,Mikolov T.Distributed representations of sentences and documents[C]//Proceedings of the ICML,2014(14):1188-1196.
[16] Levenshtein V I.Binary codes capable of correcting deletions,insertions,and reversals[J].Soviet physics doklady,1966,10(8):707-710.
[17] 姚建民,周明,赵铁军,等.基于句子相似度的机器翻译评价方法及其有效性分析[J].计算机研究与发展,2004(07):1258-1265.
[18] Gulcehre C,Ahn S,Nallapati R,et al.Pointing the unknown words[Z].CoRR abs/1603.08148 (2016).
[19] Lin C Y.Rouge:A package for automatic evaluation of summaries[C]//Proceedings of the Text summarization branches out:Proceedings of the ACL-04 workshop,2004:8.
[20] Page L,Brin S,Motwani R,et al.The PageRank citation ranking:Bringing order to the web[R].Stanford InfoLab,1999.
[21] Mikolov T,Sutskever I,Chen K,et al.Distributed representations of words and phrases and their compositionality[C]//Proceedings of the NIPS 2013:3111-3119.
[22] Hochreiter S,Schmidhuber J.Long short-term memory[J].Neural Computation,1997,9(8):1735-1780.
[23] Chung J,Gulcehre C,Cho K H,et al.Empirical evaluation of gated recurrent neural networks on sequence modeling[Z].CoRR abs/1412.3555 (2014).
[24] Mihalcea R,Tarau P.TextRank:Bringing order into texts[C]//Proceedings of the EMNLP 2004:25-26.

基金

国家自然科学基金(61402494,61402498,71690233);湖南省自然科学基金(2015JJ4009)

PDF(3818 KB)

843

Accesses

Citation

Detail

段落导航

摘要
Abstract
关键词
Key words
引用本文
参考文献
基金

Received	Published
2017-06-06	2018-06-15
Issue Date
2018-06-15

选择文件类型/文献管理软件名称

选择包含的内容

摘要

Abstract

关键词

Key words

引用本文

{{custom_sec.title}}

{{custom_sec.title}}

参考文献

{{custom_fnGroup.title_cn}}

脚注

基金