大语言模型微调方法研究综述

PDF(4014 KB)

中文信息学报 ›› 2025, Vol. 39 ›› Issue (2) : 1-26.

综述

大语言模型微调方法研究综述

吴春志^1,2,赵玉龙¹,刘鑫⁴,司念文^1,3,张鲁飞¹,范昊¹

作者信息 +

Fine Tuning Methods for Large Language Models: A Survey

WU Chunzhi^1,2, ZHAO Yulong¹, LIU Xin⁴, SI Nianwen^1,3, ZHANG Lufei¹, FAN Hao¹

Author information +

History +

摘要

近年来,大语言模型成为人工智能领域非常受关注的技术,引发了自然语言处理领域新的研究范式。在大语言模型训练实践中,参数微调是其中非常重要的一个环节,它允许用户在资源受限条件下,通过调整少部分参数来提升模型理解用户指令、解决下游任务的能力。该文全面回顾了2019—2024年间50余种主要的大语言模型微调方法,从全新的角度进行了系统性的整理和概括,分为全参数微调、部分参数微调、新增参数微调和无参数微调方法,对每种方法的原理、微调位置及方法特点作了总结归纳和比较;接着,从计算的视角出发,着重分析比较了各类方法的参数量、内存消耗和计算量;最后,基于该文的微调方法调研及相关的参数微调实践,对大语言模型微调策略给出建议,以促进该领域的发展。

Abstract

In recent years, the advent of large language models (LLMs) has revolutionized the field of artificial intelligence, particularly in natural language processing (NLP). A pivotal aspect of LLM training is parameter fine-tuning, a technique that enables the enhancement of model capabilities under limited resources. This paper offers an exhaustive review of the primary fine-tuning approaches for LLMs from 2019 to 2024, examining over 50 methods. It categorizes and elucidates these methods into four distinct types: full-parameter tuning, partial-parameter tuning, additional-parameter tuning, and parameter-free tuning. Each approach is discussed in terms of its underlying principles, the specific parameters adjusted, and its unique characteristics. Further, the paper delves into a comparison of the parameter volume, memory usage, and computational demands of these methods. Concluding with insights drawn from the fine-tuning research and practical applications, the paper presents recommendations for optimizing LLM fine-tuning.

导出引用

吴春志,赵玉龙,刘鑫,司念文,张鲁飞,范昊. 大语言模型微调方法研究综述. 中文信息学报. 2025, 39(2): 1-26

WU Chunzhi, ZHAO Yulong, LIU Xin, SI Nianwen, ZHANG Lufei, FAN Hao. Fine Tuning Methods for Large Language Models: A Survey. Journal of Chinese Information Processing. 2025, 39(2): 1-26

参考文献

[1] OpenAI. ChatGPT: Optimizing language models for dialogue[EB/OL]. https://onlinechatgpt.com/[2024-05-31].
[2] OpenAI. GPT-4 technical report[J/OL]. arXiv: 2303.08774, 2023.
[3] HOULSBY N, GIURGIU A, JASTRZEBSKI S, et al. Parameter-efficient transfer learning for NLP[C]//Proceedings of the International Conference on Machine Learning. PMLR, 2019: 2790-2799.
[4] DING N, QIN Y, YANG G, et al. Parameter-efficient fine-tuning of large-scale pre-trained language models[J]. Nature Machine Intelligence, 2023, 5(3): 220-235.
[5] XU L, XIE H, QIN S Z J, et al. Parameter-efficient fine-tuning methods for pretrained language models: A critical review and assessment[J/OL]. arXiv preprint arXiv: 2312.12148, 2023.
[6] LIALIN V, DESHPANDE V, RUMSHISKY A. Scaling down to scale up: A guide to parameter-efficient fine-tuning[J/OL]. arXiv preprint arXiv: 2303.15647, 2023.
[7] DONG Q, LI L, DAI D, et al. A survey for in-context learning[J/OL]. Proceedings of the Conference on Empirical Methods in Natural Language Processing, 2024: 1107-1128.
[8] ZHANG S, DONG L, LI X, et al. Instruction tuning for large language models: A survey[J/OL]. arXiv preprint arXiv: 2308.10792, 2023.
[9] VASWANI A, SHAZEER N, PARMAR N, et al. Attention is all you need[C]//Proceedings of the 31st International Conference on Neural Information Processing Systems, 2017: 6000-6010.
[10] KRIZHEVSKY A, SUTSKEVER I, HINTON G. Imagenet classification with deep convolutional neural networks[C]//Proceedings of the 26th Annual Conference on Neural Information Processing Systems. 2012: 1106-1114.
[11] SZEGEDY C, LIU W, JIA Y, et al. Going deeper with convolutions[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015: 1-9.
[12] SAK H, SENIOR A, BEAUFAYS F. Long short-term memory based recurrent neural network architectures for large vocabulary speech recognition[J/OL]. arXiv preprint arXiv: 1402.1128, 2014.
[13] STAUDEMEYER R C, MORRIS E R. Understanding LSTM tutorial into long short-term memory recurrent neural networks[J/OL]. arXiv preprint arXiv: 1909.09586, 2019.
[14] BA J L, KIROS J R, HINTON G E. Layer normalization[J/OL]. arXiv preprint arXiv: 1607.06450, 2016.
[15] DEVLIN J, CHANG M W, LEE K, et al. Bert: Pre-training of deep bidirectional transformers for language understanding[C]//Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2019: 4171-4186.
[16] BROWN T, MANN B, RYDER N, et al. Language models are few-shot learners[C]//Proceedings of the 34th International Conference on Neural Information Processing Systems, 2020: 1877-1901.
[17] RADFORD A, WU J, CHILD R, et al. Language models are unsupervised multitask learners[EB/OL]. https://cdn.openai.com/better-language-models/language_models_are_unsupervised_multitask_learners.pdf#: ~: text=This%20paper%20shows%20that%20language%20models%20can. [2024-05-31].
[18] DING N, HU S, ZHAO W, et al. Openprompt: An open-source framework for prompt-learning[J/OL]. arXiv preprint arXiv: 2111.01998, 2021.
[19] LIU P, YUAN W, FU J, et al. Pre-train, prompt, and predict: A systematic survey of prompting methods in natural language processing[J]. ACM Computing Surveys, 2023, 55(9): 1-35.
[20] SUNDAR P. Introducing Gemini: Our largest and most capable AI model[EB/OL]. https://blog.google/technology/ai/google-gemini-ai/#sundar-note.[2024-05-31].
[21] HENDRYCKS D, BURNS C, BASART S, et al. Measuring massive multitask language understanding[C]//Proceedings of the ICLR, 2021.
[22] XU D, YEN I E H, ZHAO J, et al. Rethinking Network pruning--under the pre-train and fine-tune paradigm[C]//Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2021: 2376-2382.
[23] MIKHAILOV D I. Optimizing national security strategies through LLM-Driven artificial intelligence integration[J/OL]. arXiv preprint arXiv: 2305.13927, 2023.
[24] WU S, IRSOY O, LU S, et al. Bloomberggpt: A large language model for finance[J/OL]. arXiv preprint arXiv: 2303.17564,2023.
[25] THIRUNAVUKARASU A J, TING D S J, ELAN-GOVAN K, et al. Large language models in medicine[J]. Nature Medicine, 2023, 29(8): 1930-1940.
[26] WEN R, WANG T, BACKES M, et al. Last one standing: A comparative analysis of security and privacy of soft prompt tuning, LoRA, and In-context learning[J/OL]. arXiv preprint arXiv: 2310.11397, 2023.
[27] LAMBDA LABS. OpenAIs GPT-3 language model: A technical overview[EB/OL]. https://lambdalabs.com/blog/demystifying-gpt-3#1. [2024-05-31].
[28] WEN H, LI Y, LIU G, et al. Empowering llm to use smartphone for intelligent task automation[J/OL]. arXiv preprint arXiv: 2308.15272, 2023.
[29] FLORIDI L, CHIRIATTI M. GPT-3: Its nature, scope, limits, and consequences[J]. Minds and Machines, 2020, 30: 681-694.
[30] LIU Y, OTT M, GOYAL N, et al. Roberta: A robustly optimized bert pretraining approach[J/OL]. arXiv preprint arXiv: 1907.11692, 2019.
[31] YANG J, JIN H, TANG R, et al. Harnessing the power of llms in practice: A survey on chatgpt and beyond[J]. ACM Transactions on Knowledge Discovery from Data, 2023, 18(6): 1-32.
[32] TAY Y, DEHGHANI M, TRAN V Q, et al. UL2: Unifying language learning paradigms[C]//Proceedings of the 11th International Conference on Learning Representations, 2022.
[33] MUKANS E, BARZDINS G. RIGA at SemEval Task 2: NER enhanced with GPT-3[C]//Proceedings of the 17th International Workshop on Semantic Evaluation, 2023: 331-339.
[34] WANG X, AITCHISON L, RUDOLPH M. LoRA ensembles for large language model fine-tuning[J/OL]. arXiv preprint arXiv: 2310.00035, 2023.
[35] ZHANG X, RAJABI N, DUH K, et al. Machine translation with large language models: Prompting, few-shot learning, and fine-tuning with QLoRA[C]//Proceedings of the 8th Conference on Machine Translation, 2023: 468-481.
[36] DE CURT J, DE ZARZ I, ROIG G, et al. LLM-informed multi-armed bandit strategies for non-stationary environments[J]. Electronics, 2023, 12(13): 14-28.
[37] CHANG K W, TSENG W C, LI S W, et al. Speechprompt: An exploration of prompt tuning on generative spoken language model for speech processing tasks[J/OL]. arXiv preprint arXiv: 2203.16773, 2022.
[38] LIU Y, HAN T, MA S, et al. Summary of chatgpt-related research and perspective towards the future of large language models[J]. Meta-Radiology, 2023: 100-107.
[39] WANG A, SINGH A, MICHAEL J, et al. GLUE: A multi-task benchmark and analysis platform for natural language understanding[C]//Proceedings of the EMNLP Workshop Blackbox NLP: Analyzing and Interpreting Neural Networks for NLP, 2018: 353-355.
[40] SARLIN P E, DETONE D, MALISIEWICZ T, et al. Superglue: Learning feature matching with graph neural networks[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2020: 4938-4947.
[41] RAJPURKAR P, ZHANG J, LOPYREV K, et al. Squad: 100,000+ questions for machine comprehension of text[C]//Proceedings of the Conference on Empirical Methods in Natural Language Processing, 2016: 2383-2392.
[42] HASAN T, BHATTACHARJEE A, ISLAM M S, et al. XL-sum: Large-scale multilingual abstractive summarization for 44 languages[C]//Proceedings of the Association for Computational Linguistics: ACL-IJCNLP, 2021: 4693-4703.
[43] GUILLOU L, HARDMEIER C, NAKOV P, et al. Findings of the WMT shared task on cross-lingual pronoun prediction[C]//Proceedings of the 1st Conference on Machine Translation, Shared Task Papers, 2016: 525-542,.
[44] DODGE J, SAP M, MARASOVIC＇ A, et al. Documenting large webtext corpora: A case study on the colossal clean crawled corpus[C]//Proceedings of the Conference on Empirical Methods in Natural Language Processing, 2021: 1286-1305.
[45] LIAO B, TAN S, MONZ C. Make your pre-trained model reversible: From parameter to memory efficient fine-tuning[C]//Proceedings of the 37th International Conference on Neural Information Processing Systems, 2023: 15186-15209.
[46] SUN Y, WANG S, FENG S, et al. Ernie 3.0: Large-scale knowledge enhanced pre-training for language understanding and generation[J/OL]. arXiv preprint arXiv: 2107.02137, 2021.
[47] 王俊,李石君,杨莎等. 一种新的用于跨领域推荐的迁移学习模型 [J]. 计算机学报, 2017, 40 (10): 2367-2380.
[48] CLARK K, KHANDELWAL U, LEVY O, et al. What does bert look at? An analysis of bert's attention[C]//Proceedings of the ACL Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP, 2019: 276-286.
[49] HUMZA NAVEED, ASAD ULLAH KHAN, SHI QIU, et al. A comprehensive overview of large language models[J/OL]. arXiv preprint arXiv: 2307.06435, 2023.
[50] HOULSBY N, GIURGIU A, JASTRZEBSKI S, et al. Parameter-efficient transfer learning for NLP[C]//Proceedings of the 36th International Conference on Machine Learning, 2019: 2790-2799.
[51] LI, X.L., P. LIANG.Prefix-tuning: Optimizing continuous prompts for generation[C]//Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, 2021: 4582-4597.
[52] LESTER, B., R. AL RFOU, N. CONSTANT, The power of scale for parameter-efficient prompt tuning[C]//Proceedings of the Conference on Empirical Methods in Natural Language Processing, 2021: 3045-3059.
[53] HU E J, SHEN Y, WALLIS P, et al. Lora: Low-rank adaptation of large language models[J/OL]. arXiv preprint arXiv: 2106.09685, 2021.
[54] DETTMERS T, PAGNONI A, HOLTZMAN A, et al. Qlora: Efficient finetuning of quantized llms[C]//Proceedings of the 37th International Conference on Neural Information Processing System, 2023: 10088-10115.
[55] ZAKEN E B, RAVFOGEL S, GOLDBERG Y. Bitfit: Simple parameter-efficient fine-tuning for transformer-based masked language-models[C]//Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics, 2021: 1-9.
[56] LV K, YANG Y, LIU T, et al. Full parameter fine-tuning for large language models with limited resources[C]//Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics, 2024: 8187-8198.
[57] LIAO B, MENG Y, MONZ C. Parameter-efficient fine-tuning without introducing new latency[C]//Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics, 2023: 4242-4260.
[58] XIE E, YAO L, SHI H, et al. DiffFit: Unlocking transferability of large diffusion models via simple parameter-efficient fine-tuning[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023: 4230-4239.
[59] LI Y, WEI F, ZHAO J, et al. Rain: Your language models can align themselves without finetuning[J/OL]. arXiv preprint arXiv: 2309.07124, 2023.
[60] ZHAO J, ZHANG Z, CHEN B, et al. Galore: Memory-efficient llm training by gradient low-rank projection[C]//Proceedings of the 41st International Conference on Machine Learning, 2024: 61121-61143.
[61] AN S, LI Y, LIN Z, et al. Input-tuning: Adapting unfamiliar inputs to frozen pretrained models[J/OL]. arXiv preprint arXiv: 2203.03131, 2022.
[62] ZHU Y, FENG J, ZHAO C, et al. Counter-interference adapter for multilingual machine translation[C]//Proceedings of the Association for Computational Linguistics: EMNLP, 2021: 2812-2823.
[63] SUNG Y L, CHO J, BANSAL M. Lst: Ladder side-tuning for parameter and memory efficient transfer learning[J]. Advances in Neural Information Processing Systems, 2022, 35: 12991-13005.
[64] LIU X, ZHENG Y, DU Z, et al. GPT understands, too[J/OL]. arXiv preprint arXiv: 2103.10385, 2023.
[65] LI X L, LIANG P. Prefix-tuning: Optimizing continuous prompts for generation[C]//Proceedings of the 59th Annual Meeting of the Association for Computational Linguisticsand the 11th International Joint Conference on Natural Language Processing, 2021: 4582-4597.
[66] LIU X, JI K, FU Y, et al. P-tuning v2: Prompt tuning can be comparable to fine-tuning universally across scales and tasks[C]//Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics, 2022.
[67] GUO D, RUSH A M, KIM Y. Parameter-efficient transfer learning with diff pruning[C]//Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, 2021: 4884-4896.
[68] ZHAO M, LIN T, MI F, et al. Masking as an efficient alternative to finetuning for pretrained language models[C]//Proceedings of the Conference on Empirical Methods in Natural Language Processing, 2020: 2226-2241.
[69] SUNG Y L, NAIR V, RAFFEL C A. Training neural networks with fixed sparse masks[J]. Advances in Neural Information Processing Systems, 2021, 34: 24193-24205.
[70] HE S, DING L, DONG D, et al. Sparseadapter: An easy approach for improving the parameter-efficiency of adapters[C]//Proceedings of the Association for Computational Linguistics: EMNLP, 2022: 2184-2190.
[71] DING N, LV X, WANG Q, et al. Sparse low-rank adaptation of pre-trained language models[C]//Proceedings of the Conference on Empirical Methods in Natural Language Processing, 2023: 4133-4145.
[72] LIU H, TAM D, MUQEETH M, et al. Few-shot parameter-efficient fine-tuning is better and cheaper than in-context learning[J]. Advances in Neural Information Processing Systems, 2022, 35: 1950-1965.
[73] LV K, YAN H, GUO Q, et al. AdaLomo: Low-memory Optimization with Adaptive Learning Rate[C]//Proceedings of the Association for Computational Linguistics: ACL, 2024: 12486-12502.
[74] MALLADI S, GAO T, NICHANI E, et al. Fine-tuning language models with just forward passes[J]. Advances in Neural Information Processing Systems, 2023, 36: 53038-53075.
[75] SARKAR S, SARATHI D,RANRAN H, et al.Unified low-resource sequence labeling by sample-aware dynamic sparse finetuning[C]//Proceedings of the Conference on Empirical Methods in Natural Language Processing,2023: 6998-7010.
[76] LIU S Y, WANG C Y, YIN H, et al. DoRA: Weight-decomposed low-rank adaptation[J/OL]. arXiv preprint arXiv: 2402.09353, 2024.
[77] MENG X, DAI D, LUO W, et al. PeriodicLoRA: Breaking the low-rank bottleneck in LoRA optimization[J/OL]. arXiv preprint arXiv: 2402.16141, 2024.
[78] PFEIFFER J, KAMATH A, RCKL A, et al. AdapterFusion: Non-destructive task composition for transfer learning[C]//Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics,2021: 487-503.
[79] RCKL A, GEIGLE G, GLOCKNER M, et al. Adapterdrop: On the efficiency of adapters in transformers[C]//Proceedings of the Conference on Empirical Methods in Natural Language Processing, 2021: 7930-7946.
[80] KARIMI MAHABADI R, HENDERSON J, RUDER S. Compacter: Efficient low-rank hypercomplex adapter layers[J]. Advances in Neural Information Processing Systems, 2021, 34: 1022-1035.
[81] ZHANG A, TAY Y, ZHANG S, et al. Beyond fully-connected layers with quaternions: Parameterization of hypercomplex multiplications with 1/n parameters[C]//Proceedings of ICLR, 2021.
[82] WANG Y, AGARWAL S, MUKHERJEE S, et al. AdaMix: Mixture-of-adaptations for parameter-efficient model tuning[C]//Proceedings of the Conference on Empirical Methods in Natural Language Processing, 2022: 5744-5760.
[83] WANG R, TANG D, DUAN N, et al. K-adapter: Infusing knowledge into pre-trained models with adapters[C]//Proceedings of the Association for Computational Linguistics: ACL-IJCNLP,2021: 1405-1418.
[84] CHRONOPOULOU A, PETERS M E, FRASER A, et al. Adaptersoup: Weight averaging to improve generalization of pretrained language models[C]//Proceedings of the Association for Computational Linguistics: EACL,2023: 2054-2063.
[85] ZHAO H, TAN H, MEI H. Tiny-attention adapter: Contexts are more important than the number of parameters[C]//Proceedings of the Conference on Empirical Methods in Natural Language Processing, 2022: 6626-6638.
[86] LU X, BRAHMAN F, WEST P, et al. Inference-time policy adapters (ipa): Tailoring extreme-scale lms without fine-tuning[C]//Proceedings of the Conference on Empirical Methods in Natural Language Processing, 2023: 6863-6883.
[87] EDALATI A, TAHAEI M, KOBYZEV I, et al. Krona: Parameter efficient tuning with kronecker adapter[J/OL]. arXiv preprint arXiv: 2212.10650, 2022.
[88] HU Y, XIE Y, WANG T, et al. Structure-aware low-rank adaptation for parameter-efficient fine-tuning[J]. Mathematics, 2023, 11(20): 4317.
[89] KOPICZKO D J, BLANKEVOORT T, ASANO Y M. VeRA: Vector-based random matrix adaptation[C]//Proceedings of ICLR, 2024.
[90] CHAVAN A, LIU Z, GUPTA D, et al. One-for-All: Generalized LoRA for parameter-efficient fine-tuning[J/OL]. arXiv preprint arXiv: 2306.07967, 2023.
[91] ZHANG Q, CHEN M, BUKHARIN A, et al. Adaptive budget allocation for parameter-efficient fine-tuning[C]//Proceedings of ICLR, 2023.
[92] VALIPOUR M, REZAGHOLIZADEH M, KOBYZEV I, et al. Dylora: Parameter efficient tuning of pre-trained models using dynamic search-free low-rank adaptation[C]//Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics, 2023: 3274-3287.
[93] CHEN Y, QIAN S, TANG H, et al. Longlora: Efficient fine-tuning of long-context large language models[J/OL]. arXiv preprint arXiv: 2309.12307, 2023.
[94] SHENG Y, CAO S, LI D, et al. S-lora: Serving thousands of concurrent lora adapters[J/OL]. arXiv preprint arXiv: 2311.03285, 2023.
[95] HAYOU S, GHOSH N, YU B. LoRA+: Efficient low rank adaptation of large models[J/OL]. arXiv preprint arXiv: 2402.12354, 2024.
[96] ZHANG R, QIANG R, SOMAYAJULA S A, et al. AutoLoRA: Automatically tuning matrix ranks in low-rank adaptation based on meta learning[C]//Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2024: 5048-5060.
[97] FENG W, HAO C, ZHANG Y, ET al. Mixture-of-loras: An efficient multitask tuning for large language models[C]//Proceedings of the Joint International Conference on Computational Linguistics, Language Resources and Evaluation, 2024: 11371-11380.
[98] LI D, MA Y, WANG N, et al. MixLoRA: Enhancing large language models fine-tuning with LoRA based mixture of experts[J/OL]. arXiv preprint arXiv: 2404.15159, 2024.
[99] ZHU W, TAN M. SPT: Learning to selectively insert prompts for better prompt tuning[C]//Proceedings of the Conference on Empirical Methods in Natural Language Processing. 2023: 11862-11878.
[100] SHI Z, LIPANI A. Dept: Decomposed prompt tuning for parameter-efficient fine-tuning[C]//Proceedings of the 3rd Workshop on Efficient Natural Language and Speech Processing, 2023.
[101] BANDARA W G C, PATEL V M. Attention prompt tuning: Parameter-efficient adaptation of pre-trained models for spatiotemporal modeling[C]//Proceedings of the 18th International Conference on Automatic Face and Gesture Recognition, 2024.
[102] CHEN J, ZHANG A, SHI X, et al. Parameter-efficient fine-tuning design spaces[J/OL]. arXiv preprint arXiv: 2301.01821, 2023.
[103] MAO Y, MATHIAS L, HOU R, et al. Unipelt: A unified framework for parameter-efficient language model tuning[C]//Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics, 2021: 6253-6264.
[104] LEWIS P, PEREZ E, PIKTUS A, et al. Retrieval-augmented generation for knowledge-intensive nlp tasks[C]//Proceedings of the 34th Conference on Neural Information Processing Systems,2020: 9459-9474.
[105] BAKKER M, CHADWICK M, SHEAHAN H, et al. Fine-tuning language models to find agreement among humans with diverse preferences[J]. Advances in Neural Information Processing Systems, 2022, 35: 38176-38189.
[106] FAN Y, JIANG F, LI P, et al. Grammargpt: Exploring open-source llms for native chinese grammatical error correction with supervised fine-tuning[C]//Proceedings of the CCF International Conference on Natural Language Processing and Chinese Computing. Cham: Springer Nature Switzerland, 2023: 69-80.
[107] ROZIERE B, GEHRING J, GLOECKLE F, et al. Code LLaMa: Open foundation models for code[J/OL]. arXiv preprint arXiv: 2308.12950, 2023.
[108] ZHENG H, SHEN L, TANG A, et al. Learn from model beyond fine-tuning: A survey[J/OL]. arXiv preprint arXiv: 2310.08184, 2023.
[109] DAO T, FU D, ERMON S, et al. Flashattention: Fast and memory-efficient exact attention with io-awareness[J]. Advances in Neural Information Processing Systems, 2022, 35: 16344-16359.
[110] ZHAO Y, GU A, VARMA R, et al. PyTorch FSDP: Experiences on scaling fully sharded data parallel[J/OL]. arXiv preprint arXiv: 2304.11277, 2023.
[111] TAN C, SUN F, KONG T, et al. A survey on deep transfer learning[C]//Proceedings of the Artificial Neural Networks and Machine Learning - ICANN: 27th International Conference on Artificial Neural Networks, 2018: 270-279.
[112] KOVALEVA O, ROMANOV A, ROGERS A, et al. Revealing the dark secrets of BERT[C]//Proceedings of the Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, 2019: 4365-4374.
[113] MICHEL P, LEVY O, NEUBIG G. Are sixteen heads really better than one? [C]//Proceedings of the 33rd International Conference on Neural Information Processing Systems,2019: 14037-14047.
[114] LEE J, TANG R, LIN J. What would elsa do? Freezing layers during transformer fine-tuning[J/OL]. arXiv preprint arXiv: 1911.03090, 2019.
[115] ZHANG Q, CHEN M, BUKHARIN A, et al. Adaptive budget allocation for parameter-efficient fine-tuning[C]//Proceedings of ICLR, 2023.
[116] WANG S, ZHU Y, LIU H, et al. Knowledge editing for large language models: A survey[J]. ACM Computing Surveys, 2023,57(3): 1-37.
[117] HE J, ZHOU C, MA X, et al. Towards a unified view of parameter-efficient transfer learning[C]//Proceedings of ICLR, 2022.
[118] RUNWAL B, PEDAPATI T, CHEN P Y. Parameter efficient finetuning for reducing activation density in transformers[C]//Proceedings of the 37th Conference on Neural Information Processing System. 2023: 1-6.
[119] ZHANG Q, CHEN M, BUKHARIN A, et al. Adaptive budget allocation for parameter-efficient fine-tuning[C]//Proceedings of ICLR, 2023.
[120] HUANG C, LIU Q, LIN B Y, et al. Lorahub: Efficient cross-task generalization via dynamic lora composition[C]//Proceedings of the Workshop on Robustness of Zero/Few-Shot Learning in Foundation Models, 2023.
[121] TOUVRON H, MARTIN L, STONE K, et al. LLaMa 2: Open foundation and fine-tuned chat models[J/OL]. arXiv preprint arXiv: 2307.09288, 2023.
[122] Liu Z, Lin Y, Cao Y, et al. Swin transformer: Hierarchical vision transformer using shifted windows[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision. 2021: 10012-10022.
[123] SU Y, WANG X, QIN Y, et al. On transferability of prompt tuning for natural language processing[C]//Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2021: 3949-3969.
[124] ZHAO X, LU J, DENG C, et al. Domain specialization as the key to make large language models disruptive: A comprehensive survey[J/OL]. arXiv preprint arXiv: 2305.18703, 2023.
[125] PETRONI F, ROCKTSCHEL T, LEWIS P, et al. Language models as knowledge bases? [C]//Proceedings of the Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, 2019: 2463-2473.
[126] WEI J, WANG X, SCHUURMANS D, et al. Chain-of-thought prompting elicits reasoning in large language models[C]//Proceedings of the 36th Joint International Conference on Neural Information Processing Systems, 2022, 35: 24824-24837.
[127] WEI J, TAY Y, BOMMASANI R, et al. Emergent abilities of large language models[J/OL]. arXiv preprint arXiv: 2206.07682, 2022.
[128] SUN T, SHAO Y, QIAN H, et al. Black-box tuning for language-model-as-a-service[C]//Proceedings of the International Conference on Machine Learning. PMLR, 2022: 20841-20855.
[129] LI Y. A practical survey on zero-shot prompt design for in-context learning[C]//Proceedings of the 14th International Conference on Recent Advances in Natural Language Processing, 2023: 641-647.
[130] TOUVRON H, LAVRIL T, IZACARD G, et al. LLaMa: Open and efficient foundation language models[J/OL]. arXiv preprint arXiv: 2302.13971, 2023.
[131] KORTHIKANTI V A, CASPER J, LYM S, et al. Reducing activation recomputation in large transformer models[C]//Proceedings of the 6th MLSys Conference, 2023.
[132] RAJBHANDARI S, RASLEY J, RUWASE O, et al. Zero: Memory optimizations toward training trillion parameter models[C]//Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE, 2020: 1-16.
[133] Transformer Math 101 EleutherAI Blog.[EB/OL]. https://blog.eleuther.ai/Transformer-math/. [2024-05-31].
[134] ZHANG L, ZHANG L, SHI S, et al. LoRA-FA: Memory-efficient low-rank adaptation for large language models fine-tuning[J/OL]. arXiv preprint arXiv: 2308.03303, 2023.
[135] CHEN T, XU B, ZHANG C, et al. Training deep nets with sublinear memory cost[J/OL]. arXiv preprint arXiv: 1604.06174, 2016. https://arxiv.org/abs/1604.06174.
[136] KAPLAN J, MCCANDLISH S, HENIGHAN T, et al. Scaling laws for neural language models[J/OL]. arXiv preprint arXiv: 2001.08361, 2020.
[137] ZHAO H, CHEN H, YANG F, et al. Explainability for large language models: A survey[J/OL]. arXiv preprint arXiv: 2309.01029, 2023.
[138] YIN F, VIG J, LABAN P, et al. Did you read the instructions? Rethinking the effectiveness of task definitions in instruction learning[C]//Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics, 2023: 3063-3079.
[139] KWON Y, WU E, WU K, et al. DataInf: Efficiently estimating data influence in LoRA-tuned LLMs and diffusion models[J/OL]. arXiv preprint arXiv: 2310.00902, 2023.
[140] YANG Z, DAI Z, YANG Y, et al. Xlnet: Generalized autoregressive pretraining for language understanding[C]//Proceedings of the 33rd International Conference on Neural Information Processing Systems, 2019: 5753-5763.
[141] SOLAIMAN I, BRUNDAGE M, CLARK J, et al. Release strategies and the social impacts of language models[J/OL]. arXiv preprint arXiv: 1908.09203, 2019.
[142] HE P, LIU X, GAO J, et al. Deberta: Decoding-enhanced bert with disentangled attention[J/OL]. arXiv preprint arXiv: 2006.03654, 2020.
[143] LEWIS M, LIU Y, GOYAL N, et al. Bart: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension[C]//Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, 2020: 7871-7880.
[144] CONNEAU A, KHANDELWAL K, GOYAL N, et al. Unsupervised cross-lingual representation learning at scale[C]//Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics: Tutorial Abstracts. 2019: 31-38.
[145] SANH V, WEBSON A, RAFFEL C, et al. Multitask prompted training enables zero-shot task generalization[C]//Proceedings of the ICLR, 2022.
[146] ZHANG Z, GU Y, HAN X, et al. Cpm-2: Large-scale cost-effective pre-trained language models[J]. AI Open, 2021, 2: 216-224.
[147] ZHANG S, ROLLER S, GOYAL N, et al. Opt: Open pre-trained transformer language models[J/OL]. arXiv preprint arXiv: 2205.01068, 2022.
[148] ZENG A, LIU X, DU Z, et al. Glm-130b: An open bilingual pre-trained model[C]//Proceedings of the ICLR, 2023.
[149] WORKSHOP B S, SCAO T L, FAN A, et al. Bloom: A 176b-parameter open-access multilingual language model[J/OL]. arXiv preprint arXiv: 2211.05100, 2022.
[150] YANG A, XIAO B, WANG B, et al. Baichuan 2: Open large-scale language models[J/OL]. arXiv preprint arXiv: 2309.10305, 2023.
[151] LI X, YAO Y, JIANG X, et al. Flm-101b: An open llm and how to train it with $100 k budget[J/OL]. arXiv preprint arXiv: 2309.03852, 2023.
[152] TAYLOR, M.E., P. STONE, Transfer learning for reinforcement learning domains: A survey[J]. Journal of Machine Learning Research, 2009. 10: 1633-1685.
[153] ZHAO, W.X., et al., A survey of large language models[J]. ACM Transactions on Intelligent Systems and Technology, 2023,15(3): 1-45.
[154] CUI Y, YANG Z, YAO X. Efficient and effective text encoding for chinese LLaMa and alpaca[J/OL]. arXiv preprint arXiv: 2304.08177, 2023.
[155] Fine-Tuning LLMs: LoRA or full-parameter? An in-depth analysis with LLaMa 2. [EB/OL]. https://www.anyscale.com/ blog/fine-tuning-llms-lora-or-full-parameter-an-in-depth-analysis-with-LLaMa-2. [2024-05-31].