成语润色: 任务、数据集与预训练基线模型

廖俊伟,程帅

PDF(1303 KB)
PDF(1303 KB)
中文信息学报 ›› 2024, Vol. 38 ›› Issue (11) : 146-159.
自然语言理解与生成

成语润色: 任务、数据集与预训练基线模型

  • 廖俊伟,程帅
作者信息 +

Text Polishing with Chinese Idiom: Task, Datasets and Pre-trained Baselines

  • LIAO Junwei, CHENG Shuai
Author information +
History +

摘要

该文提出了文本润色的任务,其目标是在保留输入句子原本语义的前提下生成表达更为优美的句子。文本润色在现实应用中具有很大价值,是现代智能写作辅助系统的重要组成部分。但是现有文献对文本润色的研究却鲜有涉及,在这个重要方向上的进一步研究需要更为正式的任务定义、基准数据集和强大的基线模型。该文以中文成语润色为例对文本润色任务进行了研究: 首先,将文本润色任务形式化为一个上下文相关的序列到序列的文本生成问题;其次,为了解决任务数据标注困难问题,提出了一种基于人机协作的半自动数据构建方法,并使用该方法创建了一个包括150万条数据的大规模中文文本润色数据集;最后,针对文本润色任务提出了两类特定任务的预训练目标,并使用这些目标训练了一系列基于Transformer的预训练语言模型作为文本润色任务的基线模型。使用基线模型在创建的文本润色任务数据集上进行了广泛的实验,得到了一些重要的发现与结论。人工评测则进一步展示了该基线模型具有较好的文本润色能力。

Abstract

This work presents the task of text polishing, which generates a sentence that is more graceful than the input sentence while retaining its semantic meaning. Text polishing is of great value in practical use and is an important part of modern writing assistance systems, which has not been well-addressed in the literature. In this work, we formulate the task as a context-dependent text generation problem and conduct a case study on the text polishing with Chinese idiom. We proposed a semi-automatic data construction pipeline based on human-machine collaboration and build a large-scale text polishing dataset of 1.5 million instances. We proposed two types of task-specific pre-training objectives for the text polishing task and implemented a series of Transformer-based pre-trained baselines. We conduct extensive experiments with baseline models on the constructed text polishing datasets, and reveal the improved performance via human evaluation.

关键词

文本润色 / 智能写作辅助 / 人机协作的数据构建 / 预训练语言模型

Key words

text polish / intelligent writing assistance / human-machine collaborative data construction / pre-trained language model

引用本文

导出引用
廖俊伟,程帅. 成语润色: 任务、数据集与预训练基线模型. 中文信息学报. 2024, 38(11): 146-159
LIAO Junwei, CHENG Shuai. Text Polishing with Chinese Idiom: Task, Datasets and Pre-trained Baselines. Journal of Chinese Information Processing. 2024, 38(11): 146-159

参考文献

[1] DALE R, VIETHEN J. The automated writing assistance landscape in 2021[J]. Natural Language Engineering, 2021, 27(4): 511-518.
[2] DEVLIN J, CHANG M W, LEE K, et al. BERT: Pre-training of deep bidirectional transformers for language understanding[C]//Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Minneapolis, Minnesota: Association for Computational Linguistics, 2019: 4171-4186.
[3] OMELIANCHUK K, ATRASEVYCH V, CHERNODUB A N, et al. Gector-grammatical error correction: tag, not rewrite[C]//Proceedings of the 15th Workshop on Innovative Use of NLP for Building Educational Applications, 2020: 163-170.
[4] RADFORD A, WU J, CHILD R, et al. Language models are unsupervised multitask learners[J]. Open AI blog, 2019, 1(8): 9-18.
[5] LIU Y, LIU B, SHAN L, et al. Modelling context with neural networks for recommending idioms in essay writing[J/OL]. Neurocomputing, 2018, 275: 2287-2293. https://www.sciencedirect.com/science/article/pii/S0925231217317198.
[6] LIU Y, PANG B, LIU B. Neural-based Chinese idiom recommendation for enhancing elegance in essay writing[C]//Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, 2019: 5522-5526.
[7] MADNANI N, DORR B J. Generating phrasal and sentential paraphrases: A survey of data-driven methods[J/OL]. Computational Linguistics, 2010, 36(3): 341-387. https://aclanthology.org/J10-3003.
[8] ZHANG B, SUN W, WAN X, et al. PKU paraphrase bank: A sentence-level paraphrase corpus for Chinese[G]//Natural Language Processing and Chinese Computing. Cham: Springer International Publishing, 2019: 814-826.
[9] JIN D, JIN Z, HU Z, et al. Deep learning for text style transfer: A survey [J]. Computational Linguistics, 2022, 48(1): 155-205.
[10] PRABHUMOYE S, TSVETKOV Y, SALAKHUTDINOV R, et al. Style transfer through back-translation[C]//Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, 2018: 866-876.
[11] VASWANI A, SHAZEER N, PARMAR N, et al. Attention is all you need[C]//Proceedings of the 31st International Annual Conference on Neural Information Processing Systems, USA, 2017: 5998-6008.
[12] SUTSKEVER I, VINYALS O, LE Q V. Sequence to sequence learning with neural networks[C]//Proceedings of the 28th International Annual Conference on Neural Information Processing Systems, 2014.
[13] TAN M, JIANG J, DAI B T. ABERT-based two-stage model for Chinese chengyu recommendation[J/OL]. ACM Translation Asian Low-Resource Language Information Processing, 2021. https://doi.org/10.1145/3453185.
[14] SENNRICH R, HADDOW B, BIRCH A. Improving neural machine translation models with monolingual data[C]//Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics,2016: 86-96.
[15] GANITKEVITCH J, CALLISON BURCH C, NAPOLES C, et al. Learning sentential paraphrases from bilingual parallel corpora for text-to-text generation[C]//Proceedings of the Conference on Empirical Methods in Natural Language Processing. Edinburgh, Scotland, UK,2011: 1168-1179.
[16] MALLINSON J, SENNRICH R, LAPATA M. Paraphrasing revisited with neural machine translation[C]//Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics. Valencia, Spain,2017: 881-893.
[17] WIETING J, MALLINSON J, GIMPEL K. Learning paraphrastic sentence embeddings from back-translated bitext[C]//Proceedings of the Conference on Empirical Methods in Natural Language Processing. Copenhagen, Denmark,2017: 274-285.
[18] HO W Y, KNG C, WANG S, et al. Identifying idioms in Chinese translations[C]//Proceedings of the 9th International Conference on Language Resources and Evaluation. Reykjavik, Iceland, 2014: 716-721.
[19] SHAO Y, SENNRICH R, WEBBER B, et al. Evaluating machine translation performance on Chinese idioms with a blacklist method[C]//Proceedings of the 11th International Conference on Language Re-sources and Evaluation. Miyazaki, Japan, 2018.
[20] LEWIS M, LIU Y, GOYAL N, et al. BART: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension[C]//Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, ACL, 2020,7871-7880.
[21] RAFFEL C, SHAZEER N, ROBERTS A, et al. Exploring the limits of transfer learning with a unified text-to-text transformer[J/OL]. Learning Research, 2020, 21: 140: 1-140: 67. http://jmlr.org/papers/v21/20-074.html.
[22] CUI Y, CHE W, LIU T, et al. Pre-training with whole word masking for Chinese BERT[J/OL]. IEEE ACM Trans. Audio Speech Language Processing, 2021, 29: 3504-3514. https://doi.org/10.1109/TASLP.2021.3124365.
[23] WU Y, SCHUSTER M, CHEN Z, et al. Googles neural machine translation system: Bridging the gap between human and machine translation[J/OL]. CoRR, 2016, abs/1609.08144. http://arxiv.org/abs/1609.08144.
[24] CUI Y, CHE W, LIU T, et al. Revisiting pre-trained models for Chinese natural language processing[G/OL]//Proceedings of the Association for Computational Linguistics: EMNLP. Online: Association for Computational Linguistics, 2020: 657-668. https://aclanthology.org/2020.findings-emnlp.58. DOI: 10.18653/v1/2020.findings-emnlp.58.
[25] PAPINENI K, ROUKOS S, WARD T, et al. BLEU: A method for automatic evaluation of machine translation[C]//Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, Philadelphia, PA, USA. ACL, 2002: 311-318.
[26] ZHANG T, KISHORE V, WU F, et al. BERTScore: Evaluating text generation with BERT[C]//Proceedings of the 8th International Conference on Learning Representations, ICLR,2020.
[27] WOLF T, DEBUT L, SANH V, et al. Hugging faces transformers: State-of-the-art natural language processing[J/OL]. arXiv.preprine arXiv:1910.03771.
[28] KINGMA D P, BA J. Adam: A method for stochastic optimization[C]//Proceedings of the 3rd International Conference on Learning Representations, ICLR,2015.

基金

国家自然科学基金(61976043)
PDF(1303 KB)

229

Accesses

0

Citation

Detail

段落导航
相关文章

/