目前大多数中文预训练语言模型采用字级别编码,因为字符级编码序列长而产生大量计算开销。词级别编码尽管能够缓解这一问题,但也会带来其他问题,如词典外词、数据稀疏等。针对中文不同粒度的编码,该文提出使用混合粒度编码的中文预训练模型。这一编码所用的词表在大规模预训练语料上得到,因此缓解了词典外词和数据稀疏问题。为了更进一步增强模型性能,该文提出了一种选择性的遮蔽语言建模训练策略——IDF遮蔽。这一策略基于词在大规模预训练语料上统计的逆文档频率。实验表明,与之前的中文预训练语言模型相比,该文所提出方法预训练的模型在多个中文自然语言数据集上取得了更好或相当的性能,并且能更高效地编码文本。
Abstract
Currently, most Chinese pre-trained language models adopt character-level encoding, which has a large computational overhead for long sequences. Although word-level encoding can alleviate this issue, it also brings some other issues such as out-of-vocabulary words and data sparsity. In this paper, we improve Chinese pre-trained language models with mixed-grained tokenization. The vocabulary of our encoding is obtained from large-scale corpora and thereby can alleviate the issues of out-of-vocabulary and data sparsity. To further improve the pre-training efficiency, we introduce a selectively masked language modeling method: IDF-masking, based on the inverse document frequency (IDF) collected on the pre-training corpora. The extensive experiments show that, compared with previous Chinese pre-trained language models, the proposed model can achieve better or comparable performance on various Chinese natural language processing tasks, and encode text more efficiently.
关键词
中文预训练 /
混合粒度编码 /
IDF遮蔽
{{custom_keyword}} /
Key words
Chinese pre-training /
mixed-grained encoding /
IDF-masking
{{custom_keyword}} /
{{custom_sec.title}}
{{custom_sec.title}}
{{custom_sec.content}}
参考文献
[1] SUN Y, WANG S, LI Y K, et al. ERNIE: Enhanced representation through knowledge integration[J]. arXiv preprint arXiv: 1904.09223, 2019.
[2] CUI Y, CHE W, LIU T, et al. Pre-training with whole word masking for Chinese BERT[J]. arXiv preprint arXiv: 1906.08101, 2019.
[3] CUI Y, CHE W, LIU T, et al. Revisitingore-trained models for Chinese natural language processing[C]//Proceedings of EMNLP, 2020: 657-668.
[4] ZHANG Y, YANG J. Chinese NERusing lattice LSTM[C]//Proceedings of ACL, 2018: 1554-1564.
[5] LI X, YAN H,QIU X, et al. FLAT: Chinese NER using flat-lattice transformer[C]//Proceedings of ACL, 2020: 6836-6842.
[6] LI X, MENG Y, SUN X, et al. Is word segmentation necessary for deep learning of Chinese representations?[C]//Proceedings of ACL, 2019: 3242-3252.
[7] DIAO S, BAI J, SONG Y, et al. ZEN: Pre-training Chinese text encoder enhanced by n-gram representations[C]//Proceedings of the Conference on Empirical Methods in Natural Language Processing: Findings, 2020: 4729-4740.
[8] ZHANG X, LI H. AMBERT: Apre-trained language model with multi-grained tokenization[J]. arXiv preprint arXiv: 2008.11869, 2020.
[9] LAI Y, LIU Y, FENG Y, et al. Lattice-BERT: Leveraging multi-granularity representations in Chinese pre-trained language models[C]//Proceedings of NAACL-HLT, 2021: 1716-1731.
[10] JOSHI M, CHEN D, LIU Y, et al.SpanBERT: Improving pre-training by representing and predicting spans[J]. Transactions of the Association for Computational. Linguistics, 2020, 8: 64-77.
[11] LEVINE Y, LENZ B, LIEBER O, et al. PMI-masking: Principled masking of correlated spans[C]//Proceedings of 9th International Conference on Learning Representations, 2021.
[12] KUDO T.Subword regularization: Improving neural network translation models with multiple subword candidates[C]//Proceedings of ACL, 2018: 66-75.
[13] JONES K S. A statistical interpretation of term specificity and its application in retrieval[J]. Journal of Documentation, 1972,(28)1: 11-21.
[14] KUDO T, RICHARDSON J.SentencePiece: A simple and language independent subword tokenizer and detokenizer for neural text processing[C]//Proceedings EMNLP: System Demonstrations, 2018: 66-71.
[15] GUO W, ZHAO M, ZHANG L, et al. LICHEE: Improving language model pre-training with multi-grained tokenization[C]//Proceedings of the Association for Computational Linguistics, Online Event, 2021: 1383-1392.
[16] VITERBI A J. Error bounds for convolutional codes and an asymptotically optimum decoding algorithm[J]. IEEE Transactions on Information Theory, 1967, 13(2): 260-269.
[17] ZIPF G K. Human behavior and the principle of least effort[M]. Ravenio Books, 1949.
[18] XU L, HU H, ZHANG X, et al. CLUE: A Chinese language understanding evaluation benchmark NLP Chinese corpus: Large scale Chinese corpus for NLP[C]//Proceedings of COLING, 2020: 4762-4772.
[19] TAN S, ZHANG J. An empirical study of sentiment analysis for Chinese documents[J]. Expert Syst. Appl. 2008, 34(4): 2622-2629.
[20] LI J, SUN M. Scalableterm selection for text categorization[C]//Proceedings of EMNLP, 2007: 774-782.
[21] LIU X, CHEN Q, DENG C, et al. LCQMC: A large-scale Chinese question matching corpus[C]//Proceedings of COLING, 2018: 1952-1962.
[22] XIAO D, LI Y K, ZHANG H, et al. ERNIE-gram: Pre-training with explicitly n-gram masked language modeling for natural language understanding[C]//Proceedings of NAACL-HLT, 2021: 1702-1715.
[23] VASWANI A,SHAZEER N, PARMAR N, et al. Attention is all you need[C]//Proceedings of Annual Conference on Neural Information Processing Systems, 2017: 5998-6008.
[24] DEVLIN J, CHANG M W, LEE K, et al. BERT: Pre-training of deep bidirectional transformers for language understanding[C]//Proceedings of NAACL-HLT, 2019: 4171-4186.
{{custom_fnGroup.title_cn}}
脚注
{{custom_fn.content}}
基金
国家自然科学基金(62022027)
{{custom_fund}}