在大规模无监督语料上的BERT、XLNet等预训练语言模型,通常采用基于交叉熵损失函数的语言建模任务进行训练。模型的评价标准则采用困惑度或者模型在其他下游自然语言处理任务中的性能指标,存在损失函数和评测指标不匹配等问题。为解决这些问题,该文提出一种结合强化学习的对抗预训练语言模型RL-XLNet(Reinforcement Learning-XLNet)。RL-XLNet采用对抗训练方式训练一个生成器,基于上下文预测选定词,并训练一个判别器判断生成器预测的词是否正确。通过对抗网络生成器和判别器的相互促进作用,强化生成器对语义的理解,提高模型的学习能力。由于在文本生成过程中存在采样过程,导致最终的损失无法直接进行回传,故提出采用强化学习的方式对生成器进行训练。基于通用语言理解评估基准(GLUE Benchmark)和斯坦福问答任务(SQuAD 1.1)的实验,结果表明,与现有BERT、XLNet方法相比,RL-XLNet模型在多项任务中的性能上表现出较明显的优势: 在GLUE的六个任务中排名第1,一个任务排名第2,一个任务排名第3。在SQuAD 1.1任务中F1值排名第1。考虑到运算资源有限,基于小语料集的模型性能也达到了领域先进水平。
Abstract
For pre-trained language models built on large-scale unsupervised corpus, such as BERT and XLNet, cross-entropy loss is routinely utilized as the loss function and models are typically evaluated by perplexity or other task losses. To deal with such mismatch between the training and evaluation loss functions, an improved pre-trained language model named RL-XLNet using Generative Adversarial Networks (GAN) and Reinforcement Learning (RL) is proposed. A generative model is trained to predict selected words, and a discriminative model is trained to predict whether the predicted token is correct or not. The reinforcement learning is adopted to train the generator. Through the interaction of the generator and the discriminator, the learning of semantic information is enhanced. Experiments on GLUE Benchmark and SQuAD question-answering Benchmark show that RL-XLNet outperforms traditional BERT and XLNet models in multiple natural language processing tasks: top-ranked in six tasks in GLUE, and top-ranked according to F1 scores in SQuAD task.
关键词
自然语言处理 /
预训练 /
语言模型 /
强化学习
{{custom_keyword}} /
Key words
natural language processing /
pre-training /
language model /
reinforcement learning
{{custom_keyword}} /
{{custom_sec.title}}
{{custom_sec.title}}
{{custom_sec.content}}
参考文献
[1] Devlin J, Chang M W, Lee K, et al. BERT: pre-training of deep bidirectional transformers for language understanding[C]//Proceedings of the Conference of the North American Chapter of the Association for computational Linguistics. Stroudsburg, PA: ACL, 2019: 4171-4186.
[2] Radford A, Narasimhan K, Salimans T, et al. Improving language understanding by generative pre-training [EB/OL]. https://openai.com/blog/language-unsupervised/[2021-10-16].
[3] Zhang H, Zhao H. Minimum divergence vs. maximum margin: an empirical comparison on Seq2Seq models[C/OL]. Proceedings of the International Conference on Learning Representations. 2019: 1-13. https://openreview.net//forum?id=H1xD9sR5Fm. [2021-10-17].
[4] Li Z, Wang R, Chen K, et al. Data-dependent gaussian prior objective for language generation [C/OL]//Proceedings of the International Conference on Learning Representations.2020:1-18. https://openreview.net/forum?id=S1efxTVYDr. [2021-10-17].
[5] Konda V R, Tsitsiklis J N. Actor-critic algorithms[G]. Advances in Neural Information Processing Systems. Cambridge, MA: MIT Press, 2000: 1008-1014.
[6] Yang Z, Dai Z, Yang Y, et al.XLNet: generalized autoregressive pretraining for language understanding[G]. Advances in Neural Information Processing Systems. New York, NY: Curran Associates., 2019: 5754 -5764.
[7] Erhan D, Courville A, Bengio Y, et al. Why does unsupervised pre-training help deep learning?[C]//Proceedings of the 13th International Conference on Artificial Intelligence and Statistics, 2010: 1-8.
[8] Song K, Tan X, Qin T, et al. MASS: masked sequence to sequence pre-training for language generation[C]//Proceedings of the International Conference on Machine Learning. Cambridge MA: JMLR, 2019: 5926-5936.
[9] Kalchbrenner N, Blunsom P. Recurrent continuous translation models[C]//Proceedings of the Conference on Empirical Methods in Natural Language Processing. Stroudsburg, PA: ACL, 2013: 1700-1709.
[10] Dong L, Yang N, Wang W, et al. Unified language model pre-training for natural language understanding and generation[C]//Proceedings of the International Conference on Neural Information Processing Systems. New York, NY: Curran Associates., 2019: 13042-13054.
[11] Li J, Monroe W, Ritter A, et al. Deep reinforcement learning for dialogue generation[C]//Proceedings of the Conference on Empirical Methods in Natural Language Processing. Stroudsburg, PA: ACL, 2016: 1192-1202.
[12] Ranzato M, Chopra S, Auli M, et al. Sequence level training with recurrent neural networks[C/OL]. Proceedings of the International Conference on Learning Representations,2016:1-16. htt- p://arxiv.org/abs/1511.06732. [2021-10-17].
[13] Papineni K, Roukos S, Ward T, et al. BlEU: a method for automatic evaluation of machine translation[C]//Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics. Stroudsburg, PA: ACL, 2002: 311-318.
[14] Bahdanau D, Brakel P, XU K, et al. An actor-critic algorithm for sequence prediction [C/OL]. Proceedings of the International Conference on Learning Representations,2017:1-17. https://openreview.net/forum?id=SJDaqqveg. [2021-10-17].
[15] Borji A. Pros and cons of GAN evaluation measures[J].Computer Vision and Image Understanding, 2019, 179: 41-65.
[16] Fedus W, Goodfellow I, DAI A M. MaskGAN: better text generation via filling in the .[C/OL]//Proceedings of the International Conference on Learning Representations, 2018: 1-17. http- s://openreview.net/forum?id=ByOExmWA. [2021-10-17].
[17] Denton E L, Chintala S, Szlam A, et al. Deep generative image models using a Laplacian pyramid of adversarial networks[C]//Proceedings of the International Conference on Neural Information Processing Systems. New York, NY: Curran Associates., 2015: 1486-1494.
[18] Zhang Y, Gan Z, Carin L. Generating text via adversarial training[C]//Proceedings of the International Conference on Neural Information Processing Systems. New York, NY:Curran Associates., 2016: 21-32.
[19] Yu L, Zhang W, Wang J, et al. SeqGAN: sequence generative adversarial nets with policy gradient[C]//Proceedings of the 31st AAAI Conference on Artificial Intelligence. Menlo Park, CA: AAAI Press, 2017: 2852-2858.
[20] Müller R, Kornblith S, Hinton G E. When does label smoothing help?[C]//Proceedings of the International Conference on Neural Information Processing Systems. New York, NY: Curran Associates., 2019: 4696- 4705.
[21] Jawahar G, Sagot B, Seddah D. What does BERT learn about the structure of language?[C]//Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. Stroudsburg, PA: ACL, 2019: 3651-3657.
[22] Wang A, Singh A, Michael J, et al. GLUE: a multi-task benchmark and analysis platform for natural language understanding [C/OL]. Proceedings of the International Conference on Learning Representations. 2019:1-20. https://openreview.net/forum?id=rJ4km2R5t7. [2021-10-17].
[23] Rajpurkar P, Zhang J, Lopyrev K, et al. Squad: 100, 000+ questions for machine comprehension of text[C]//Proceedings of the Conference on Empirical Methods in Natural Language Processing. Stroudsburg, PA: ACL, 2016: 2383-2392.
[24] Zhu Y, Kiros R, Zemel R S, et al. Aligning books and movies: Towards story-like visual explanations by watching movies and reading books[C]//Proceedings of the IEEE International Conference on Computer Vision. Los Alamitos,CA:IEEE Computer Society, 2015:19-27.
[25] Merity S, Xiong C, Bradbury J, et al. Poin-ter sentinel mixture models [C/OL]. Proceedings of the International Conference on Learning Representations,2017: 1-15. https://openreview.net/forum?id=-Byj72udxe. [2021-10-17].
[26] Schrittwieser, J, Antonoglou, I, Hubert T, et al. Mastering atari, go, chess and shogi by planning with a learned model[J]. Nature, 2020, 588:604-609.
{{custom_fnGroup.title_cn}}
脚注
{{custom_fn.content}}
基金
国家自然科学基金(U1703261);国家社会科学基金(20BTQ066)
{{custom_fund}}