深度神经网络(Deep Neural Networks,DNNs)在自然语言处理各项任务中均表现出良好性能,但它们易受到对抗性样本的干扰,导致DNNs模型的性能降低。而现有的对抗防御侧重于在训练阶段提升模型的鲁棒性,忽略了在推理过程中抵御对抗性攻击。针对此问题,该文提出了词频检测-掩码恢复(Word Frequency detection Mask Recover,WFMR)的防御方法,该方法主要分两个步骤,通过词频异常检测WF和MR掩码恢复相结合来提升模型的鲁棒性。WF对句子中的单词进行词频检测,将低频的词视为异常单词,而MR通过掩码异常单词来使模型恢复到原始句子的周围。该文分别在三个数据集上利用四种攻击方法进行了综合实验,实验取得了良好的防御效果,验证了该方法的有效性。
Abstract
Deep Neural Networks(DNNs), with excellent performance in various natural language processing tasks, have been shown to suffer from performance drop if disturbed by carefully crafted adversarial examples. Existing adversarial defense methods focus on improving the robustness of the model during the training phase, ignoring the defense against adversarial attacks during the inference process. To address this issue, this paper proposes a defense method named Word Frequency Mask Recover (WFMR). WF detects anomalies by analyzing word frequencies in a sentence, considering low-frequency words as anomalous. MR makes the model recover around the original sentence by masking the abnormal words. This paper conducts comprehensive experiments on three text classification datasets using four attack methods, verifying the effectiveness of the method by a remarkable defense effect.
关键词
自然语言处理 /
对抗防御 /
词频检测 /
掩码
{{custom_keyword}} /
Key words
natural language processing /
adversarial defense /
word frequency detection /
mask
{{custom_keyword}} /
{{custom_sec.title}}
{{custom_sec.title}}
{{custom_sec.content}}
参考文献
[1] SZEGEDY C, ZAREMBA W, SUTSKEVER I, et al. Intriguing properties of neural networks[J]. arXiv, preprint arXiv: 1412.6199,2013.
[2] DEVLIN J, CHANG M W, LEE K, et al. BERT: Pretraining of deep bidirectional transformers for language understanding[C]//Proceedings of the Conference on the North American Chapter of the Association for Computational Linguistics: Human Language Technologies,2018: 4171-4186.
[3] GAO J, LANCHANTIN J, SOFFA M L, et al. Black-box generation of adversarial text sequences to evade deep learning classifiers[C]//Proceedings of the IEEE Security and Privacy Workshops,2018: 50-56.
[4] LI J, JI S, DU T, et al. TextBugger: Generating adversarial text against real-world applications[C]//Proceedings Network and Dis-tributed System Security Symposium,2019.
[5] LI L, MA R, GUO Q, et al. Bert-attack: Adversarial attack against bert using bert[C]//Proceedings of the Conference on Empirical Methods in Natural Language Processing,2020: 6193-6202.
[6] JIN D, JIN Z, ZHOU J T, et al. Isbert really robust?: A strong baseline for natural language attack on text classification and entailment[C]//Proceedings of the AAAI Conference on Artificial Intelligence,2020: 34(05),8018-8025.
[7] YOO J Y, QI Y. Towards improving adversarial training of NLP models[C]//Proceedigns of the Association for Computational Linguistics: EMNLP,2021: 945-956.
[8] MENG Z, DONG Y, SACHAN M, et al. Self-supervised contrastive learning with adversarial perturbations for defending word substitution-based attacks[C]//Proceedings of the Association for Computational Linguistics: NAACL,2022: 87-101.
[9] ZHOU Y, ZHENG X, HSIEH C J, et al. Defense against synonym substitution-based adversarial attacks via Dirichlet neighborhood ensemble [C]//Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing,2021: 5482-5492.
[10] DONG X, LUU A T, JI R, et al. Towards robustness against natural language word substitution [J].arXiv preprint arXiv: 2107.13541,2021.
[11] LIU Q, ZHENG R, RONG B, et al. Flooding-X: Improving BERT's resistance to adversarial attacks via loss-restricted fine-tuning[C]//Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics,2022: 5634-5644.
[12] LE T, PARK N, LEE D. SHIELD: Defending textual neural networks against multiple black-box adversarial attacks with stochastic multi-expert patcher[C]//Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics,2022: 6661-6674.
[13] WANG B, WANG S, CHENG Y, et al.InfoBERT: Improving robustness of language models from an information theoretic perspective[C]//Proceedings of the Internet Conference on Learning Representations,2021.
[14] ZENG J, ZHENG X, XU J, et al. Certified robustness to text adversarial attacks by randomized [MASK][J]. arXiv preprint arXiv: 2015.037 43,2021.
[15] YE M, GONG C, LIU Q. SAFER: A structure-free approach for certified robustness to adversarial word substitutions[C]//Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics,2020: 3465-3475.
[16] HUANG P S, STANFORTH R, WELBL J, et al. Achieving verified robustness to symbol substitutions via interval bound propagation[C]//Proceedings of the Conference on Empirical Methods in Natural Language Processing,2019: 4083-4093.
[17] MOZES M, STENETORP P, KLEINBERG B, et al. Frequency-guided word substitutions for detecting textual adversarial examples[C]//Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics,2021: 171-186.
[18] MADRY A, MAKELOV A, SCHMIDT L, et al. Towards deep learning models resistant to adversarial attacks[C]//Proceedings of the Internet Conference on Learning Representations,2018: 1-28.
[19] ZHU C, CHENG Y, GAN Z, et al.FreeLB: Enhanced adversarial training for natural language understanding[C]//Proceedings of the Internet Conference on Learning Representations, 2020: 1-14.
[20] ZHANG C, ZHOU X, WAN Y, et al. Improving the adversarial robustness of NLP models by information bottleneck[C]//Proceedings of the Association for Computational Linguistics: ACL,2022: 3588-3598.
[21] FELLBAUM C. WordNet: An Electronic Lexical Database[M]. NY:MIT press, 1998.
[22] MORRIS J X, LIFLAND E, YOO J Y, etal.Textattack: A framework for adversarial attacks, data augmentation, and adversarial training in NLP[C]//Proceedings of the Conference on Empirical Methods in Natural Language Processing: System Demonstrations,2020: 119-126.
{{custom_fnGroup.title_cn}}
脚注
{{custom_fn.content}}
基金
CCF-智谱大模型基金(CCF-Zhipi202312)
{{custom_fund}}