基于语言模型的预训练技术研究综述

PDF(1771 KB)

中文信息学报 ›› 2021, Vol. 35 ›› Issue (9) : 15-29.

综述

基于语言模型的预训练技术研究综述

岳增营,叶霞,刘睿珩

作者信息 +

A Survey of Language Model Based Pre-training Technology

YUE Zengying, YE Xia, LIU Ruiheng

Author information +

History +

摘要

预训练技术当前在自然语言处理领域占有举足轻重的位置。尤其近两年提出的ELMo、GTP、BERT、XLNet、T5、GTP-3等预训练模型的成功,进一步将预训练技术推向了研究高潮。该文从语言模型、特征抽取器、上下文表征、词表征四个方面对现存的主要预训练技术进行了分析和分类,并分析了当前自然语言处理中的预训练技术面临的主要问题和发展趋势。

Abstract

Pre-training technology has stepped into the center stage of natural language processing, especially with the emergence of ELMo, GTP, BERT, XLNet, T5, and GTP-3 in the last two years. In this paper, we analyze and classify the existing pre-training technologies from four aspects: language model, feature extractor, contextual representation, and word representation. We discuss the main issues and development trends of pre-training technologies in current natural language processing.

导出引用

岳增营,叶霞,刘睿珩. 基于语言模型的预训练技术研究综述. 中文信息学报. 2021, 35(9): 15-29

YUE Zengying, YE Xia, LIU Ruiheng. A Survey of Language Model Based Pre-training Technology. Journal of Chinese Information Processing. 2021, 35(9): 15-29

参考文献

[1] Ferrone L, Zanzotto F M. Symbolic, distributed and distributional representations for natural language processing in the era of deep learning: A survey[J]. arXiv preprint arXiv:1702.00764, 2017.
[2] Bengio Y, Ducharme R, Vincent P, et al. A neural probabilistic language model[J]. Journal of Machine Learning Research, 2003, 3(Feb): 1137-1155.
[3] Mikolov T, Chen K, Corrado G, et al. Efficient estimation of word representations in vector space[J]. arXiv preprint arXiv:1301.3781, 2013.
[4] Pennington J, Socher R, Manning C D. GloVe: Global vectors for word representation[C]//Proceedings of the Conference on Empirical Methods in Natural Language Processing, 2014: 1532-1543.
[5] Bojanowski P, Grave E, Joulin A, et al. Enriching word vectors with subword information[J]. Transactions of the Association for Computational Linguistics, 2017, 5: 135-146.
[6] Li J, Monroe W, Jurafsky D. Understanding neural networks through representation erasure[J]. arXiv preprint arXiv:1612.08220, 2016.
[7] Ji S, Yun H, Yanardag P, et al. WordRank: Learning word embeddings via robust ranking[J]. arXiv preprint arXiv:1506.02761, 2015.
[8] Ruder S, Peters M E, Swayamdipta S, et al. Transfer learning in natural language processing[C]//Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics, 2019: 15-18.
[9] Vaswani A, Shazeer N, Parmar N, et al. Attention is all you need[C]//Proceedings of Advances in Neural Information Processing Systems, 2017: 5998-6008.
[10] 谭咏梅,刘姝雯,吕学强.基于CNN与双向LSTM的中文文本蕴含识别方法[J].中文信息学报,2018,32(07):11-19.
[11] Young T, Hazarika D, Poria S, et al. Recent trends in deep learning based natural language processing[J]. IEEE Computational Intelligence Magazine, 2018, 13(3): 55-75.
[12] Pires T, Schlinger E, Garrette D. How multilingual is Multilingual BERT?[J]. arXiv preprint arXiv:1906.01502, 2019.
[13] Yang Z, Dai Z, Yang Y, et al. XLnet: Generalized autoregressive pretraining for language understanding[C]//Proceedings of Advances in Neural Information Processing Systems, 2019: 5753-5763.
[14] Peters M E, Neumann M, Iyyer M, et al. Deep contextualized word representations[J]. arXiv preprint arXiv:1802.05365, 2018.
[15] Jozefowicz R, Vinyals O, Schuster M, et al. Exploring the limits of language modeling[J]. arXiv preprint arXiv:1602.02410, 2016.
[16] Howard J, Ruder S. Universal language model fine-tuning for text classification[J]. arXiv preprint arXiv:1801.06146, 2018.
[17] Radford A, Narasimhan K, Salimans T, et al. Improving language understanding by generative pre-training[EB/OL].https://cdn.openai.com/research-covers/language-unsupervised/language_understanding_paper.pdf.[2020-9-25].
[18] Palatucci M, Pomerleau D, Hinton G E, et al. Zero-shot learning with semantic output codes[C]//Proceedings of Advances in Neural Information Processing Systems, 2009: 1410-1418.
[19] Radford A, Wu J, Child R, et al. Language models are unsupervised multitask learners [EB/OL].https://d4mucfpksywv.cloudfront.net/better-language-models/language-models.pdf.[2019-02-14].
[20] Zhao G, Lin J, Zhang Z, et al. Explicit sparse transformer: Concentrated attention through explicit selection[J]. arXiv preprint arXiv:1912.11637, 2019.
[21] Devlin J, Chang M W, Lee K, et al. BERT: Pre-training of deep bidirectional transformers for language understanding[J]. arXiv preprint arXiv:1810.04805, 2018.
[22] Vincent P, Larochelle H, Bengio Y, et al. Extracting and composing robust features with denoising autoencoders[C]//Proceedings of the 25th International Conference on Machine Learning, 2008: 1096-1103.
[23] Wu Y, Schuster M, Chen Z, et al. Google's neural machine translation system: Bridging the gap between human and machine translation[J]. arXiv preprint arXiv:1609.08144, 2016.
[24] Song K, Tan X, Qin T, et al. Mass: Masked sequence to sequence pre-training for language generation[J]. arXiv preprint arXiv:1905.02450, 2019.
[25] Dong L, Yang N, Wang W, et al. Unified language model pre-training for natural language understanding and generation[C]//Proceedings of Advances in Neural Information Processing Systems, 2019: 13063-13075.
[26] Raffel C, Shazeer N, Roberts A, et al. Exploring the limits of transfer learning with a unified text-to-text transformer[J]. arXiv preprint arXiv:1910.10683, 2019.
[27] Joshi M, Chen D, Liu Y, et al. Span BERT: Improving pre-training by representing and predicting spans[J]. Transactions of the Association for Computational Linguistics, 2020, 8: 64-77.
[28] Liu X, He P, Chen W, et al. Multi-task deep neural networks for natural language understanding[J]. arXiv preprint arXiv:1901.11504, 2019.
[29] Dai Z, Yang Z, Yang Y, et al. Transformer-xl: Attentive language models beyond a fixed-length context[J]. arXiv preprint arXiv:1901.02860, 2019.
[30] Uria B, Cté M A, Gregor K, et al. Neural autoregressive distribution estimation[J]. The Journal of Machine Learning Research, 2016, 17(1): 7184-7220.
[31] Qiu X, Sun T, Xu Y, et al. Pre-trained models for natural language processing: A survey[J]. arXiv preprint arXiv:2003.08271, 2020.
[32] 朱张莉饶, 吴渊,祁江楠,等. 注意力机制在深度学习中的研究进展[J]. 中文信息学报, 2019, 33(6): 1-11.
[33] Liu Y, Ott M, Goyal N, et al. Roberta: A robustly optimized bert pretraining approach[J]. arXiv preprint arXiv:1907.11692, 2019.
[34] Sun Y, Wang S, Li Y, et al. ERNIE: Enhanced representation through knowledge integration[J]. arXiv preprint arXiv:1904.09223, 2019.
[35] Zhang Z, Han X, Liu Z, et al. ERNIE: Enhanced language representation with informative entities[J]. arXiv preprint arXiv:1905.07129, 2019.
[36] Lauscher A, Vuli＇c I, Ponti E M, et al. Informing unsupervised pretraining with external linguistic knowledge[J]. arXiv preprint arXiv:1909.02339, 2019.
[37] Levine Y, Lenz B, Dagan O, et al. SenseBERT: Driving some sense into BERT[J]. arXiv preprint arXiv:1908.05646, 2019.
[38] Sun Y, Wang S, Li Y, et al. ERNIE 2.0: A continual pre-training framework for language understanding[J]. Proceedings of the AAAI Conference on Artificial Intelligence, 2020, 34(5):8968-8975.
[39] Lewis M, Liu Y, Goyal N, et al. Bart: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension[J]. arXiv preprint arXiv:1910.13461, 2019.
[40] Jawahar G, Sagot B, Seddah D. What Does BERT Learn about the Structure of Language?[C]//Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, 2019: 3651-3657.
[41] Conneau A, Lample G. Cross-lingual language model pretraining[C]//Proceedings of Advances in Neural Information Processing Systems, 2019: 7059-7069.
[42] Huang H, Liang Y, Duan N, et al. Unicoder: A universal language encoder by pre-training with multiple cross-lingual tasks[J]. arXiv preprint arXiv:1909.00964, 2019.
[43] Conneau A, Khandelwal K, Goyal N, et al. Unsupervised cross-lingual representation learning at scale[J]. arXiv preprint arXiv:1911.02116, 2019.
[44] Lan Z, Chen M, Goodman S, et al. ALBERT: A lite BERT for self-supervised learning of language representations[J]. arXiv preprint arXiv:1909.11942, 2019.
[45] Wang W, Wei F, Dong L, et al. Minilm: Deep self-attention distillation for task-agnostic compression of pre-trained transformers[J]. arXiv preprint arXiv:2002.10957, 2020.
[46] Sanh V, Debut L, Chaumond J, et al. DistilBERT: A distilled version of BERT: smaller, faster, cheaper and lighter[J]. arXiv preprint arXiv:1910.01108, 2019.
[47] Jiao X, Yin Y, Shang L, et al. TinybBERT: Distilling bert for natural language understanding[J]. arXiv preprint arXiv:1909.10351, 2019.
[48] Tang R, Lu Y, Liu L, et al. Distilling task-specific knowledge from bert into simple neural networks[J]. arXiv preprint arXiv:1903.12136, 2019.
[49] Gordon M A, Duh K, Andrews N. Compressing BERT: Studying the effects of weight pruning on transfer learning[J]. arXiv preprint arXiv:2002.08307, 2020.
[50] Fan A, Grave E, Joulin A. Reducing transformer depth on demand with structured dropout[J]. arXiv preprint arXiv:1909.11556, 2019.
[51] Zafrir O, Boudoukh G, Izsak P, et al. Q8BERT: Quantized 8bit BERT[J]. arXiv preprint arXiv:1910.06188, 2019.
[52] Rosset C. Turing-nlg: A 17-billion-parameter language model by microsoft[EB/OL].https://www.microsoft.com/en-us/research/blog/turing-nlg-a-17-billion-parameter-language-model-by-microsoft/.[2020-02-13].
[53] Gururangan S, Marasovi＇c A, Swayamdipta S, et al. Dont stop pretraining: Adapt language models to domains and tasks[J]. arXiv preprint arXiv:2004.10964, 2020.
[54] Jia R, Liang P. Adversarial examples for evaluating reading comprehension systems[J]. arXiv preprint arXiv:1707.07328, 2017.
[55] Niven T, Kao H Y. Probing neural network comprehension of natural language arguments[J]. arXiv preprint arXiv:1907.07355, 2019.
[56] Linzen T. How can we accelerate progress towards human-like linguistic generalization?[J]. arXiv preprint arXiv:2005.00955, 2020.
[57] Ribeiro M T, Singh S, Guestrin C. Semantically equivalent adversarial rules for debugging NLP models[C]//Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, 2018: 856-865.
[58] Ribeiro M T, Wu T, Guestrin C, et al. Beyond accuracy: behavioral testing of NLP models with CheckList[J]. arXiv preprint arXiv:2005.04118, 2020.
[59] Lu J, Batra D, Parikh D, et al. Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks[C]//Proceedings of Advances in Neural Information Processing Systems, 2019: 13-23.
[60] Chen Y C, Li L, Yu L, et al. Uniter: Learning universal image-text representations[J]. arXiv preprint arXiv:1909.11740, 2019.
[61] Li G, Duan N, Fang Y, et al. Unicoder-VL: A universal encoder for vision and language by cross-modal Pre-training[C]//Proceedings of AAAI, 2020: 11336-11344.
[62] Xu Y, Li M, Cui L, et al. Layoutlm: Pre-training of text and layout for document image understanding[C]//Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2020: 1192-1200.

基金

国家自然科学基金青年基金(62006240)

PDF(1771 KB)

9280

Accesses

Citation

Detail

段落导航

摘要
Abstract
关键词
Key words
引用本文
参考文献
基金

Received	Published
2020-08-16	2021-09-30
Issue Date
2021-09-30

选择文件类型/文献管理软件名称

选择包含的内容

摘要

Abstract

关键词

Key words

引用本文

{{custom_sec.title}}

{{custom_sec.title}}

参考文献

{{custom_fnGroup.title_cn}}

脚注

基金