面向长文本涉法舆情信息的混合式摘要方法

席铁钧,段宗涛,曹建荣,杨博,卜娜娜,刘悦霞,肖媛媛

PDF(1368 KB)
PDF(1368 KB)
中文信息学报 ›› 2024, Vol. 38 ›› Issue (7) : 63-72.
信息抽取与文本挖掘

面向长文本涉法舆情信息的混合式摘要方法

  • 席铁钧,段宗涛,曹建荣,杨博,卜娜娜,刘悦霞,肖媛媛
作者信息 +

Hybrid Summarization Method for Long Judicial Public Opinion Texts

  • XI Tiejun, DUAN Zongtao, CAO Jianrong, YANG Bo, BU Na’na, LIU Yuexia, XIAO Yuanyuan
Author information +
History +

摘要

涉法舆情摘要旨在从冗长复杂的舆情文本中,准确地生成简短摘要。在长文本涉法舆情摘要中,现有的自动文本摘要方法存在语义不连贯、关键信息丢失的问题。为此,该文提出了一种结合抽取式和生成式的混合式摘要方法。首先将长文本分成多个语义片段;其次采用无监督对比学习方法微调RoBERTa-wwm-ext模型进行语义片段的表征;然后使用膨胀门卷积神经网络抽取与摘要相关的语义片段,合成抽取文本;最后通过微调预训练语言模型PEGASUS对抽取文本进行摘要生成,以获得最佳生成摘要。在CAIL 2022涉法舆情摘要数据集上的实验结果表明,相比于其他的基线模型,该方法能够生成ROUGE和BLEU得分更高的摘要,进一步提升了摘要的可靠性。

Abstract

Judicial public opinion summary aims to generate concise summaries from lengthy and complex public opinion texts. Existing automatic text summarization methods face challenges in generating coherent semantics and preserving key information. To address this issue, this paper proposes a hybrid summarization approach combining extractive and abstractive methods. Firstly, the long text is segmented into several semantic fragments. Then, an unsupervised contrastive learning method is employed to fine-tune the RoBERTa-wwm-ext model for semantic representation of these fragments. Subsequently, a dilate gated convolutional neural network is utilized to extract semantically relevant fragments and synthesize the extractive text. Finally, the fine-tuning is performed on the pre-trained language model PEGASUS to generate the optimal summary from the extracted text. Experimental results on the CAIL 2022 Judicial Opinion Summary Dataset demonstrate that, compared to other baseline models, this method is capable of generating summaries with higher ROUGE and BLEU scores.

关键词

涉法舆情摘要 / 混合式摘要 / 预训练语言模型

Key words

judicial public opinion summarization / hybrid summarization / pre-trained language model

引用本文

导出引用
席铁钧,段宗涛,曹建荣,杨博,卜娜娜,刘悦霞,肖媛媛. 面向长文本涉法舆情信息的混合式摘要方法. 中文信息学报. 2024, 38(7): 63-72
XI Tiejun, DUAN Zongtao, CAO Jianrong, YANG Bo, BU Na’na, LIU Yuexia, XIAO Yuanyuan. Hybrid Summarization Method for Long Judicial Public Opinion Texts. Journal of Chinese Information Processing. 2024, 38(7): 63-72

参考文献

[1] 韩鹏宇,高盛祥,余正涛,等.基于案件要素指导的涉案舆情新闻文本摘要方法[J].中文信息学报,2020,34(5):56-63.
[2] KOH H Y,JU J,LIU M,et al. An empirical survey on long document summarization: Datasets,models,and metrics[J]. ACM Computing Surveys,2022,55(8):1-35.
[3] ERKAN G,RADEV D R. LexRank:Graph-based lexical centrality as salience in text summarization[J]. Journal of Artificial Intelligence Research,2004,22(1):457-479.
[4] RUSH A M,CHOPRA S,WESTON J. A neural attention model for abstractive sentence summarization[C]//Proceedings of the Conference on Empirical Methods in Natural Language Processing,2015:379-389.
[5] VINYALS O,FORTUNATO M,JAITLY N. Pointer networks[C]//Proceedings of the 29th International Conference on Neural Information Processing Systems,2015:2692-2700.
[6] SEE A,LIU P J,MANNING C.D. Get to the point:Summarization with pointer-generator networks[C]//Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics,2017:1073-1083.
[7] FEIJO D D,MOREIRA V P. Improving abstractive summarization of legal rulings through textual entailment[J]. Artif Intell Law,2021,31(1):91-113.
[8] GAO T,YAO X,CHEN D. SimCSE: Simple contrastive learning of sentence embeddings[J]. arXiv preprint arXiv:2104.08821,2021.
[9] CUI Y,CHE W,LIU T,et al. Pre-training with whole word masking for Chinese BERT[J]. IEEE/ACM Transactions on Audio,Speech,and Language Processing,2021,29:3504-3514.
[10] ZHANG J,ZHAO Y,SALEH M,et al. PEGASUS: Pre-training with extracted gap-sentences for abstractive summarization[C]//Proceedings of the 37th International Conference on Machine Learning,2020:11328-11339.
[11] MIHALCEA R,TARAU P. TextRank:Bringing order into text[C]//Proceedings of the Conference on Empirical Methods in Natural Language Processing,2004:404-411.
[12] LIU Y. Fine-tune BERT for extractive summarization[J]. arXiv preprint arXiv:1903.10318,2019.
[13] LIU Y,LIU P. SimCLS:A simple framework for contrastive learning of abstractive summarization[C]//Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing,2021:1065-1072.
[14] RAVAUT M,JOTY S,CHEN N. SummaReranker:A multi-task mixture-of-experts re-ranking framework for abstractive summarization[C]//Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics,2022:4504-4524.
[15] LIU Y,LIU P,RADEV D,et al. BRIO:Bringing order to abstractive summarization[C]//Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics,2022: 2890-2903.
[16] ZHANG X,LIU Y,WANG X,et al. Momentum calibration for text generation[J]. arXiv preprint arXiv:2212.04257,2022.
[17] MANAKUL P,GALES M J F. Long-span summarization via local attention and content selection[J]. arXiv preprint arXiv:2105.03801,2021.
[18] HUANG Y,SUN L,HAN C,et al. A high-precision two-stage legal judgment summarization[J]. Mathematics,2023,11(6):1320-1336.
[19] SRIVASTAVA N,HINTON G,KRIZHEVSKY A,et al. Dropout:A simple way to prevent neural networks from overfitting[J]. The Journal of Machine Learning Research,2014,15(1):1929-1958.
[20] GEHRING J,ULI M,GRANGIER D,et al. Convolutional sequence to sequence learning[C]//Proceedings of the International Conference on Machine Learning,2017:1243-1252.
[21] DAUPHIN Y N,FAN A,AULI M, et al. Language modeling with gated convolutional networks[C]//
Proceedings of the International Conference on Machine Learning,2017:933-941.
[22] LIN C Y. Rouge:A package for automatic evaluation of summaries[C]//Proceedings of the Text Summarization Branches Out,2004:74-81.
[23] PAPINENI K,ROUKOS S,WARD T,et al. BLEU:A method for automatic evaluation of machine translation[C]//Proceedings of the 40th Annual Meeting on Association for Computational Linguistics. 2002:311-318.
[24] DALAL V,MALIK L. A survey of extractive and abstractive text summarization techniques[C]//Proceedings of the 6th International Conference on Emerging Trends in Engineering and Technology. IEEE,2013:109-110.
[25] MILLER D. Leveraging BERT for extractive text summarization on lectures[J]. arXiv preprint arXiv:1906.04165,2019.
[26] LEWIS M,LIU Y,GOYAL N,et al. BART:Denoising sequence-to-sequence pre-training for natural language generation,translation,and comprehension[C]//Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics,2020:7871-7880.
[27] SHAO Y,GENG Z,LIU Y,et al. CPT:A pre-trained unbalanced transformer for both chinese language understanding and generation[J]. arXiv preprint arXiv:2109.05729,2021.
[28] DONG L,YANG N,WANG W,et al. Unified language model pre-training for natural language understanding and generation[C]//Proceedings of the 33rd International Conference on Neural Information Processing Systems, 2019:13063-13075.
[29] 赵嘉昕,崔喆.面向法律判决文书的长文档抽取式文摘方法——BIGDCNN[J].计算机应用,2023,43(S1):67-74.

基金

陕西省重点研发计划项目(2019ZDLGY17-08);陕西省特支计划科技创新领军人才项目(TZ0336)
PDF(1368 KB)

Accesses

Citation

Detail

段落导航
相关文章

/