基于多教师多学生知识蒸馏的情感分类方法

常晓琴,李雅梦,李子成,李寿山

PDF(1269 KB)
PDF(1269 KB)
中文信息学报 ›› 2024, Vol. 38 ›› Issue (10) : 127-134.
情感分析与社会计算

基于多教师多学生知识蒸馏的情感分类方法

  • 常晓琴,李雅梦,李子成,李寿山
作者信息 +

Multiple-Teacher and Multiple-Student Knowledge Distillation on Sentiment Classification

  • CHANG Xiaoqin, LI Yameng, LI Zicheng, LI Shoushan
Author information +
History +

摘要

预训练语言模型在情感分类任务中取得了卓越的性能提升。然而,预训练模型的巨额参数量和缓慢的推理速度成为这些模型应用落地的主要障碍。知识蒸馏是一种将知识从大型的预训练教师模型转移到小型学生模型的技术。不同于现有的单教师或单学生蒸馏模型,该文提出一种基于多教师和多学生蒸馏的集成蒸馏方法。该方法既可以充分利用不同教师模型的不同知识,又可以弥补单个学生学习能力不足的缺点。此外,该文使用了大量情感分类任务相关的未标注样本来提高蒸馏性能。实验结果表明,该文提出的方法在情感分类任务上能够在基本保持教师模型的分类性能的基础上,减少97.8%~99.5%参数量,并提升了176~645倍的CPU推理速度。

Abstract

Pre-trained language models (PLMs) have achieved great success in sentiment classification. To compress the huge model parameters and speed up the inference, the knowledge distillation is a popular solution to transfer knowledge from a large pre-trained teacher model to a compacted student model. In contrast to the existing single-teacher or single-student models, this paper proposes a multiple-teacher and multiple-student knowledge distillation approach to exploit the different knowledge of teacher models and improve the learning ability of a single student. We also leverage large-scale unlabeled data to improve the performance of students. Experiments on sentiment classification show that, compared with the teacher model, our approach achieves similar performance with 97.8% ~ 99.5% less parameters and 176 ~645 times CPU inference speed.

关键词

情感分类 / 知识蒸馏 / 集成学习 / 模型压缩

Key words

sentiment classification / knowledge distillation / ensemble learning / model compression

引用本文

导出引用
常晓琴,李雅梦,李子成,李寿山. 基于多教师多学生知识蒸馏的情感分类方法. 中文信息学报. 2024, 38(10): 127-134
CHANG Xiaoqin, LI Yameng, LI Zicheng, LI Shoushan. Multiple-Teacher and Multiple-Student Knowledge Distillation on Sentiment Classification. Journal of Chinese Information Processing. 2024, 38(10): 127-134

参考文献

[1] DEVLIN J, CHANG M W, LEE K, et al. BERT: Pre-training of deep bidirectional transformers for language understanding[C]//Proceedings of NAACL-HLT, 2019: 4171-4186.
[2] ZHANG Y, WALLACE B C. A sensitivity analysis of convolutional neural networks for sentence classification[C]//Proceedings of IJCNLP, 2017: 253-263.
[3] HOCHREITER S, SCHMIDHUBER J. Long short-term memory[J]. Neural Computation, 1997, 9(8): 1735-1780.
[4] SANH V, DEBUT L, CHAUMOND J, et al. DistilBERT, a distilled version of BERT: Smaller, faster, cheaper and lighter[J]. ArXiv, abs/1910.01108. 2019.
[5] JIAO X, YIN Y, SHANG L, et al.TinyBERT: Distilling BERT for natural language understanding[C]//Proceedings of EMNLP, 2020: 4163-4174.
[6] SUN Z, YU H, SONG X, et al.MobileBERT: A compact task-agnostic BERT for resource-limited devices[C]//Proceedings of ACL, 2020: 2158-2170.
[7] LIUY, OTT M, GOYAL N, et al. RoBERTa: A robustly optimized BERT pretraining approach[J]. ArXiv abs/1907.11692. 2019.
[8] DONG L, YANG N, WANG W, et al. Unified language model pre-training for natural language understanding and generation[C]//Proceedings of the 33rd International Conference on Neural Information Processing Systems, 2019: 13063-13075.
[9] TURC I, CHANG M W, LEE K, et al. Well-read students learn better: On the importance of pre-training compact models[J]. ArXiv: Computation and Language, 2019.
[10] YANG M, TU W, WANG J, et al. Attention based LSTM for target dependent sentiment classification[C]//Proceedings of AAAI, 2017.
[11] MARCHEGGIANI D, TITOV I. Encoding sentences with graph convolutional networks for semantic role labeling[C]//Proceedings of EMNLP, 2017: 1506-1515.
[12] PETERS, M E. Deep contextualized word representations[C]//Proceedings of the NAACL-HLT. 2018: 2227-2237.
[13] HINTON G,VINYALS O, DEAN J. Distilling the knowledge in a neural network[J]. arXiv preprint arXiv: 1503.02531, 2015.
[14] SUN S, CHENG Y, GAN Z, et al. Patient knowledge distillation for BERT model compression[C]//Proceedings of EMNLP-IJCNLP, 2019: 4323-4332.
[15] HOU L, HUANG Z, SHANG L, et al.DynaBERT: Dynamic BERT with Adaptive width and depth[C]//Proceedings of the Advances in Neural Information Processing Systems, 2020.
[16] TANGR, LU Y, LIU L, et al. Distilling task-specific knowledge from BERT into simple neural networks[J]. ArXiv abs/1903.12136, 2019.
[17] REICH S, MUELLER D, ANDREWS N. Ensemble distillation for structured prediction: Calibrated, accurate, fast—choose three[C]//Proceedings of EMNLP, 2020: 5583-5595.
[18] WU C, WU F, HUANG Y. One teacher is enough? Pretrained language model distillation from multiple teachers[J].arXiv preprint arXiv: 2106.01023, 2021.
[19] CHEN D, MEI J P, WANG C, et al. Online knowledge distillation with diverse peers[C]//Proceedings of the AAAI. 2020, 34(04): 3430-3437.
[20] GUO Q, WANG X, WU Y, et al. Online knowledge distillation via collaborative learning[C]//Proceedings of the IEEE/CVF, 2020: 11020-11029.
[21] 张锡敏,钱亚冠,马丹峰,等.基于知识蒸馏的差异性深度集成学习[J].浙江科技学院学报,2021,33(03): 220-226.
[22] MIKOLOV T, CHEN K, CORRADO G, et al. Efficient estimation of word representations in vector space[J]. arXiv: 1301.3781, 2013.
[23] PENNINGTON J,SOCHER R, MANNING C D. Glove: Global vectors for word representation[C]//Proceedings of EMNLP, 2014: 1532-1543.
[24] LI J, JIA R, HE H, et al. Delete, retrieve, generate: A simple approach to sentiment and style transfer[C]//Proceedings of NAACL, 2018.
[25] ZHANG X, ZHAO J, YANG L C. Character-level convolutional networks for text classification[J]. Advances in Neural Information Processing Systems, 2015, 28: 649-657.
[26] MAAS A L, DALY R E, PHAM P T, et al. Learning word vectors for sentiment analysis[C]//Proceedings of the Meeting of theACL: Human Language Technologies. ACL, 2011.
[27] BA R F, CAMACHO COLLADOS J, NEVES L, et al. TweetEval: Unified benchmark and comparative evaluation for tweet classification[J]. ArXiv abs/2010.12421, 2020.
[28] LOSHCHILOV I, HUTTER F. Fixing weight decay regularization in adam[J]. ArXiv abs/1711.05101, 2017.
[29] KINGMA D, BA J. Adam: A method for stochastic optimization[J]. arXiv.1412.6980,2014.
[30] MUKHERJEES, AWADALLAH A H. XtremeDistil: Multi-stage distillation for massive multilingual models[C]//Proceedings of ACL, 2020.
[31] ZHOU X, ZHANG X, TAO C, et al. Multi-grained knowledge distillation for named entity recognition[C]//Proceedings of NAACL, 2021: 5704-5716.

基金

国家自然科学基金(62076176)
PDF(1269 KB)

269

Accesses

0

Citation

Detail

段落导航
相关文章

/