面向问题复述识别的定向数据增强方法

朱鸿雨,金志凌,洪宇,苏玉兰,张民

PDF(1132 KB)
PDF(1132 KB)
中文信息学报 ›› 2022, Vol. 36 ›› Issue (9) : 38-45.
语言资源建设与应用

面向问题复述识别的定向数据增强方法

  • 朱鸿雨,金志凌,洪宇,苏玉兰,张民
作者信息 +

Directional Data Augmentation for Question Paraphrase Identification

  • ZHU Hongyu, JIN Zhiling, HONG Yu, SU Yulan, ZHANG Min
Author information +
History +

摘要

问题复述识别旨在召回“同质异构”的问句对子(语义相同表述迥异的问句)和摒弃语义无关的噪声问句,对输入的问句对进行“是复述”和“非复述”的二相判别。现有预训练语言模型(如BERT、RoBERTa和MacBERT)被广泛应用于自然语言的语义编码,并取得了显著的性能优势。然而,其优势并未在问句复述问题的求解中得到充分的体现,原因在于: ①预训练语言模型对特定任务中精细的语义表示需求并不敏感; ②复述样本的“是与非”往往取决于极为微妙的语义差异。微调预训练语言模型成为提高其任务适应性的关键步骤,但其极大地依赖训练数据的数量(多样性)与质量(可靠性)。为此,该文提出一种基于生成模型的定向数据增强方法(DDA)。该方法能够利用诱导标签对神经生成网络进行引导,借以自动生成多样的复述和非复述的增强样本(即高迷惑性的异构样本),促进训练数据的自动扩展。此外,该文设计了一种多模型集成的标签投票机制,并用其修正增强样本的潜在标签错误,以此提高扩展数据的可靠性。在中文问题复述数据集LCQMC上的实验结果证明,与传统数据增强方法相比,该文方法生成的样本质量更高,且语义表达更加多元化。

Abstract

The purpose of the Question Paraphrase Identification is to find the “homogeneous and heterogeneous” question pairs (questions with different semantic expressions) and to discard semantic independent noise questions. The existing pre-trained language models are widely used in semantic encoding of natural texts, but not well-performed in Question Paraphrase Identification. We propose a Direcctional Data Augmentation (DDA) method based on generation model to fine-tune the pre-trained language model. DDA uses the directional label to guide the neural generation network, so as to automatically generate a variety of “paraphrase and non-paraphrase” as an augment to the training set. In addition, we design a model-ensemble voting mechanism to correct the potential label errors of augmentation samples. The results of LCQMC show that, compared with the traditional data Augmentation methods, DDA can produce higher quality samples with more diversified semantic expression.

关键词

复述识别 / 预训练 / 微调 / 数据增强

Key words

paraphrase identification / pre-trained / fine-tune / data augmentation

引用本文

导出引用
朱鸿雨,金志凌,洪宇,苏玉兰,张民. 面向问题复述识别的定向数据增强方法. 中文信息学报. 2022, 36(9): 38-45
ZHU Hongyu, JIN Zhiling, HONG Yu, SU Yulan, ZHANG Min. Directional Data Augmentation for Question Paraphrase Identification. Journal of Chinese Information Processing. 2022, 36(9): 38-45

参考文献

[1] Devlin J, Chang M W, Lee K, et al. BERT: Pre-training of deep bidirectional transformers for language understanding[C]//Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics, 2019: 4171-4186.
[2] Wei J, Zou K. EDA: Easy data augmentation techniques for boosting performance on text classification tasks[C]//Proceedings of the Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, 2019: 6382-6388.
[3] Yu A W,Dohan D, Luong M T, et al. Qanet: Combining local convolution with global self-attention for reading comprehension[J]. arXiv preprint arXiv:1804.09541, 2018.
[4] Bao H, Dong L, Wei F, et al. Unilmv2: Pseudo-masked language models for unified language model pre-training[C]//Proceedings of the International Conference on Machine Learning, 2020: 642-652.
[5] Liu X, Chen Q, Deng C, et al.LCQMC: A large-scale Chinese question matching corpus[C]//Proceedings of the 27th International Conference on Computational Linguistics, 2018: 1952-1962.
[6] Sennrich R, Haddow B, Birch A. Improving neural machine translation models with monolingual data[C]//Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, 2016: 86-96.
[7] Fadaee M, Bisazza A, Monz C. Data augmentation for low-resource neural machine translation[C]//Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics, 2017: 567-573.
[8] Jin D, Jin Z, Zhou J T, et al. Is bert really robust?: A strong baseline for natural language attack on text classification and entailment[C]//Proceedings of the AAAI Conference on Artificial Intelligence, 2020, 34(05): 8018-8025.
[9] Cer D, Yang Y, Kong S, et al. Universal sentence encoder[J]. arXiv preprint arXiv:1803.11175, 2018.
[10] Li L, Ma R,Guo Q, et al. BERT-ATTACK: Adversarial attack against BERT using BERT[C]//Proceedings of the Conference on Empirical Methods in Natural Language Processing, 2020: 6193-6202.
[11] Taylor W L. Cloze procedure: A new tool for measuring readability[J]. Journalism Quarterly, 1953, 30(4): 415-433.
[12] Vaswani A, Shazeer N, Parmar N, et al. Attention is all you need[C]//Proceedings of the Advances in Neural Information Processing Systems, 2017: 5998-6008.
[13] Liu Y,Ott M, Goyal N, et al. RoBERTa: A robustly optimized bert pretraining approach[J]. arXiv preprint arXiv:1907.11692, 2019.
[14] Cui Y,Che W, Liu T, et al. Revisiting pretrained models for Chinese natural language processing[J]. arXiv preprint arXiv:2004.13922, 2020.
[15] Rutherford T. Lecture notes on constant elasticityfunctions[C]//Proceedings of the University of Colorado, 2002.
[16] Wang H L, Hu Y X. 中文近义词工具包 Synonyms[EB/OL].https://github.com/chatopera/Synonyms[2017-09-27].
[17] Zhu H, Chen Y, Yan J, et al.DuQM: A Chinese dataset of linguistically perturbed natural questions for evaluating the robustness of question matching models[J]. arXiv preprint arXiv:2112.08609, 2021.
[18] Dror R, Baumer G, Shlomov S, et al. The hitchhiker’s guide to testing statistical significance in natural language processing[C]//Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, 2018: 1383-1392.
[19] Johnson D H. The insignificance of statistical significance testing[J]. The Journal of Wildlife Management, 1999: 763-772.

基金

科技部重大专项课题(2020YFB1313601);国家自然科学基金(62076174,61773276)
PDF(1132 KB)

1189

Accesses

0

Citation

Detail

段落导航
相关文章

/