一种面向长文本小数据集自动摘要任务的数据增强策略

皮洲,奚雪峰,崔志明,周国栋

PDF(5023 KB)
PDF(5023 KB)
中文信息学报 ›› 2022, Vol. 36 ›› Issue (9) : 46-56.
语言资源建设与应用

一种面向长文本小数据集自动摘要任务的数据增强策略

  • 皮洲1,2,奚雪峰1,2,崔志明1,2,周国栋3
作者信息 +

A Data Augmentation Method for Long Text Automatic Summarization

  • PI Zhou1, XI Xuefeng1,2, CUI Zhiming1,2, ZHOU Guodong3
Author information +
History +

摘要

当前长文本自动摘要任务缺乏充足的数据集,限制了该领域相关算法、模型的研究。数据增强是在不直接补充训练数据的情况下增加训练数据的方法。针对上述长文本自动摘要数据缺乏问题,基于CogLTX框架,该文提出了一种面向长文本自动摘要任务的数据增强方法EMDAM(Extract-Merge Data Augmentation Method)。EMDAM主要分为抽取和归并两个核心环节。首先,从原有长文本数据集中“抽取”得到若干短句;其次,将抽取出的短句按照定义顺序“归并”为长文本;最终形成满足限定条件的新增长文本数据集。与基线模型相比较,该文在PubMED_Min、CNN/DM_Min、news2016zh_Min数据集上采用增强策略能明显提高基线模型的性能;而在SLCTDSets上使用该文的数据集增强策略,最终的Rouge得分相比未使用增强策略的模型提高了近两个百分点。上述实验结果表明,EMDAM可以在小数据集上进行扩展,为文本摘要研究提供数据支持。

Abstract

Data augmentation is a method to increase the training data without directly supplementing the training data. To address the lack of data issue, this paper proposes an EMDAM (Extract-Merge Data Augmentation Method) data augmentation method based on the CogLTX framework for long-text automatic summarization. EMDAM is mainly divided into two core parts: extracting and merging. First, short sentences are extracted from the original long text data set. Secondly, these short sentences are combined into long text in the order of the definition, which are the augmented data set. Compared with the baseline model, this enhancement strategy significantly improves the performance of the baseline model on the PubMED_Min , CNN/DM_Min , and news2016zh_Min datasets. And on the SLCTDSets, the final Rouge score is improved by nearly 2 points compared to the model without the enhancement strategy.

关键词

数据增强 / TextRank / Seq2Seq / 生成式摘要

Key words

data augmentation / TextRank / Seq2Seq / text abstract summarization

引用本文

导出引用
皮洲,奚雪峰,崔志明,周国栋. 一种面向长文本小数据集自动摘要任务的数据增强策略. 中文信息学报. 2022, 36(9): 46-56
PI Zhou, XI Xuefeng, CUI Zhiming, ZHOU Guodong. A Data Augmentation Method for Long Text Automatic Summarization. Journal of Chinese Information Processing. 2022, 36(9): 46-56

参考文献

[1] Chopra S, Auli M, Rush A M.Abstractive sentence summarization with attentive recurrent neural networks[C]//Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2016: 93-98.
[2] Li Z, Peng Z, Tang S, et al.Text summarization method based on double attention pointer network[J][J].IEEE access, 2020, (8): 11279-11288
[3] Liu Y, Lapata M.Hierarchical transformers for multi-document summarization[C]//Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, 2019: 5070-5081.
[4] Zhong M, Liu P, Chen Y, et al.Extractive summarization as text matching[C]//Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, 2020: 6197-6208.
[5] 李维,闫晓东,解晓庆.基于改进TextRank的藏文抽取式摘要生成[J].中文信息学报,2020,34(09): 36-43.
[6] Li H, Zhu J, Zhang J, et al. Ensure the correctness of the summary: Incorporate entailment knowledge into abstractive sentence summarization[C]//Proceedings of the 27th International Conference on Computational Linguistics, 2018: 1430-1441.
[7] Koncel-kedziorski R, Bekal D, Luan Y, et al. Text generation from knowledge graphs with graph transformers[C]//Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2019: 2284-2293.
[8] Zhu C, Hinthorn W, Xu R, et al. Boosting factual correctness of abstractive summarization with knowledge graph[J]. arXiv eprints, 2020: arXiv: 2003.08612.
[9] 万莹,孙连英,赵平,等.基于信息增强BERT的关系分类[J].中文信息学报,2021,35(03): 69-77.
[10] 张虎,张颖,杨陟卓,等.基于数据增强的高考阅读理解自动答题研究[J].中文信息学报,2021,35(09): 132-140.
[11] 曾雪强, 华鑫, 刘平生, 等.基于情感轮和情感词典的文本情感分布标记增强方法[J].计算机学报, 2021, 44(06): 1080-1094.
[12] 李智星, 任诗雅, 王化明, 等.基于非结构化文本增强关联规则的知识推理方法[J].计算机科学, 2019, 46(11): 209-215.
[13] Wei J, Zou K.EDA: Easy data augmentation techniques for boosting performance on text classification tasks[C]//Proceedings of the Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, 2019: 6383-6389.
[14] Ding M, Zhou C, Yang H, et al.Cogltx: Applying bert to long texts[C]//Proceedings of the Advances in Neural Information Processing Systems, 2020, 33: 12792-12804.
[15] Xie Q, Dai Z, Hovy E, et al.Unsupervised data augmentation for consistency training[C]//Proceedings of the Advances in Neural Information Processing Systems, 2020, 33.
[16] Sennrich R, Haddow B, Birch A.Improving neural machine translation models with monolingual data[C]//Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, 2016: 86-96.
[17] Zhang H, Cisse M, Dauphin Y N, et al. Mixup: Beyond empirical risk minimization[J]. arXiv preprint arXiv: 1710.09412, 2017.
[18] Schwartz E, Karlinsky L, Shtok J, et al.Delta-encoder: An effective sample synthesis method for few-shot object recognition[C]//Proceedings of the NeurIPS, 2018: 2850-2860.
[19] Kumar A, Bhattamishra S, Bhandari M, et al.Submodular optimization-based diverse paraphrasing and its effectiveness in data augmentation[C]//Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2019: 3609-3619.
[20] Kobayashi S.Contextual augmentation: Data augmentation by words with paradigmatic relations[C]//Proceedings of the NAACL-HLT, 2018: 452-457.
[21] Yang Y, Malaviya C, Fernandez J, et al.GDAug: Generative data augmentation for commonsense reasoning[C]//Proceedings of the Conference on Empirical Methods in Natural Language Processing: Findings, 2020: 1008-1025.
[22] Feng S Y, Li A W, Hoey J.Keep calm and switch on!: preserving sentiment and fluency in semantic text exchange[C]//Proceedings of the Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, 2019: 2701-2711.
[23] Radford A, Wu J, Child R, et al. Language models are unsupervised multitask learners[J]. OpenAI blog, 2019, 1(8): 9.
[24] Quteineh H, Samothrakis S, Sutcliffe R.Textual data augmentation for efficient active learning on tiny datasets[C]//Proceedings of the Conference on Empirical Methods in Natural Language Processing, 2020: 7400-7410.
[25] Fabbri A R, Han S, Li H, et al.Improving zero and few-shot abstractive summarization with intermediate fine-tuning and data augmentation[C]//Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2021: 704-717.
[26] Parida S, Motlicek P.Abstract text summarization: a low resource challenge[C]//Proceedings of the Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, 2019: 5994-5998.
[27] Zhu H, Dong L, Wei F, et al. Transforming wikipedia into augmented data for query: focused summarization[J]. arXiv preprint arXiv: 1911.03324, 2019.
[28] Pasunuru R, Celikyilmaz A, Galley M, et al.Data augmentation for abstractive query-focused multi-document summarization[C]//Proceedings of the AAAI Conference on Artificial Intelligence, 2021, 35(15): 13666-13674.
[29] Mihalcea R, Tarau P.Textrank: Bringing order into text[C]//Proceedings of the Conference on Empirical Methods in Natural Language Processing, 2004: 404-411.
[30] Liu X, Zhang C, Chen X, et al.CLTS: A new Chinese long text summarization dataset[C]//Proceedings of the CCF International Conference on Natural Language Processing and Chinese Computing. Springer, Cham, 2020: 531-542.
[31] Kim B, Kim H, Kim G.Abstractive summarization of reddit posts with multi-level memory networks[C]//Proceedings of NAACL-HLT, 2019: 2519-2531.
[32] Narayan S, Cohen S B, Lapata M.Don’t give me the details, just the summary! topic-aware convolutional neural networks for extreme summarization[C]//Proceedings of the Conference on Empirical Methods in Natural Language Processing, 2018: 1797-1807.
[33] Hermann K M, Kocisky T, Grefenstette E, et al.Teaching machines to read and comprehend[J].Advances in Neural Information Processing Systems, 2015, (28): 1693-1701.
[34] Koupaee M, Wang W Y. Wikihow: A large scale text summarization dataset[J]. arXiv preprint arXiv: 1810.09305, 2018.
[35] Cohan A, Dernoncourt F, Kim D S, et al.A discourse-aware attention model for abstractive summarization of long documents[C]//Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2018: 615-621.
[36] Van D, Maaten L.Accelerating t-SNE using tree-based algorithms[J].The Journal of Machine Learning Research, 2014, 15(1): 3221-3245.
[37] Lin C Y.Rouge: A package for automatic evaluation of summaries[C]//Proceedings of the Text Summarization Branches Out, 2004: 74-81.

基金

国家自然科学基金(61876217,62176175);江苏省“六大人才高峰”高层次人才项目(XYDXX-086);苏州市科技计划项目(SGC2021078)
PDF(5023 KB)

1751

Accesses

0

Citation

Detail

段落导航
相关文章

/