探索中文预训练模型的混合粒度编码和IDF遮蔽

PDF(2003 KB)

中文信息学报 ›› 2024, Vol. 38 ›› Issue (1) : 57-64.

语言分析与计算模型

探索中文预训练模型的混合粒度编码和IDF遮蔽

邵云帆,孙天祥,邱锡鹏

作者信息 +

Exploring Chinese Pre-Training with Mixed-Grained Encoding and IDF-Masking

SHAO Yunfan, SUN Tianxiang, QIU Xipeng

Author information +

History +

摘要

目前大多数中文预训练语言模型采用字级别编码,因为字符级编码序列长而产生大量计算开销。词级别编码尽管能够缓解这一问题,但也会带来其他问题,如词典外词、数据稀疏等。针对中文不同粒度的编码,该文提出使用混合粒度编码的中文预训练模型。这一编码所用的词表在大规模预训练语料上得到,因此缓解了词典外词和数据稀疏问题。为了更进一步增强模型性能,该文提出了一种选择性的遮蔽语言建模训练策略——IDF遮蔽。这一策略基于词在大规模预训练语料上统计的逆文档频率。实验表明,与之前的中文预训练语言模型相比,该文所提出方法预训练的模型在多个中文自然语言数据集上取得了更好或相当的性能,并且能更高效地编码文本。

Abstract

Currently, most Chinese pre-trained language models adopt character-level encoding, which has a large computational overhead for long sequences. Although word-level encoding can alleviate this issue, it also brings some other issues such as out-of-vocabulary words and data sparsity. In this paper, we improve Chinese pre-trained language models with mixed-grained tokenization. The vocabulary of our encoding is obtained from large-scale corpora and thereby can alleviate the issues of out-of-vocabulary and data sparsity. To further improve the pre-training efficiency, we introduce a selectively masked language modeling method: IDF-masking, based on the inverse document frequency (IDF) collected on the pre-training corpora. The extensive experiments show that, compared with previous Chinese pre-trained language models, the proposed model can achieve better or comparable performance on various Chinese natural language processing tasks, and encode text more efficiently.

导出引用

邵云帆,孙天祥,邱锡鹏. 探索中文预训练模型的混合粒度编码和IDF遮蔽. 中文信息学报. 2024, 38(1): 57-64

SHAO Yunfan, SUN Tianxiang, QIU Xipeng. Exploring Chinese Pre-Training with Mixed-Grained Encoding and IDF-Masking. Journal of Chinese Information Processing. 2024, 38(1): 57-64

参考文献

[1] SUN Y, WANG S, LI Y K, et al. ERNIE: Enhanced representation through knowledge integration[J]. arXiv preprint arXiv: 1904.09223, 2019.
[2] CUI Y, CHE W, LIU T, et al. Pre-training with whole word masking for Chinese BERT[J]. arXiv preprint arXiv: 1906.08101, 2019.
[3] CUI Y, CHE W, LIU T, et al. Revisitingore-trained models for Chinese natural language processing[C]//Proceedings of EMNLP, 2020: 657-668.
[4] ZHANG Y, YANG J. Chinese NERusing lattice LSTM[C]//Proceedings of ACL, 2018: 1554-1564.
[5] LI X, YAN H,QIU X, et al. FLAT: Chinese NER using flat-lattice transformer[C]//Proceedings of ACL, 2020: 6836-6842.
[6] LI X, MENG Y, SUN X, et al. Is word segmentation necessary for deep learning of Chinese representations?[C]//Proceedings of ACL, 2019: 3242-3252.
[7] DIAO S, BAI J, SONG Y, et al. ZEN: Pre-training Chinese text encoder enhanced by n-gram representations[C]//Proceedings of the Conference on Empirical Methods in Natural Language Processing: Findings, 2020: 4729-4740.
[8] ZHANG X, LI H. AMBERT: Apre-trained language model with multi-grained tokenization[J]. arXiv preprint arXiv: 2008.11869, 2020.
[9] LAI Y, LIU Y, FENG Y, et al. Lattice-BERT: Leveraging multi-granularity representations in Chinese pre-trained language models[C]//Proceedings of NAACL-HLT, 2021: 1716-1731.
[10] JOSHI M, CHEN D, LIU Y, et al.SpanBERT: Improving pre-training by representing and predicting spans[J]. Transactions of the Association for Computational. Linguistics, 2020, 8: 64-77.
[11] LEVINE Y, LENZ B, LIEBER O, et al. PMI-masking: Principled masking of correlated spans[C]//Proceedings of 9th International Conference on Learning Representations, 2021.
[12] KUDO T.Subword regularization: Improving neural network translation models with multiple subword candidates[C]//Proceedings of ACL, 2018: 66-75.
[13] JONES K S. A statistical interpretation of term specificity and its application in retrieval[J]. Journal of Documentation, 1972,(28)1: 11-21.
[14] KUDO T, RICHARDSON J.SentencePiece: A simple and language independent subword tokenizer and detokenizer for neural text processing[C]//Proceedings EMNLP: System Demonstrations, 2018: 66-71.
[15] GUO W, ZHAO M, ZHANG L, et al. LICHEE: Improving language model pre-training with multi-grained tokenization[C]//Proceedings of the Association for Computational Linguistics, Online Event, 2021: 1383-1392.
[16] VITERBI A J. Error bounds for convolutional codes and an asymptotically optimum decoding algorithm[J]. IEEE Transactions on Information Theory, 1967, 13(2): 260-269.
[17] ZIPF G K. Human behavior and the principle of least effort[M]. Ravenio Books, 1949.
[18] XU L, HU H, ZHANG X, et al. CLUE: A Chinese language understanding evaluation benchmark NLP Chinese corpus: Large scale Chinese corpus for NLP[C]//Proceedings of COLING, 2020: 4762-4772.
[19] TAN S, ZHANG J. An empirical study of sentiment analysis for Chinese documents[J]. Expert Syst. Appl. 2008, 34(4): 2622-2629.
[20] LI J, SUN M. Scalableterm selection for text categorization[C]//Proceedings of EMNLP, 2007: 774-782.
[21] LIU X, CHEN Q, DENG C, et al. LCQMC: A large-scale Chinese question matching corpus[C]//Proceedings of COLING, 2018: 1952-1962.
[22] XIAO D, LI Y K, ZHANG H, et al. ERNIE-gram: Pre-training with explicitly n-gram masked language modeling for natural language understanding[C]//Proceedings of NAACL-HLT, 2021: 1702-1715.
[23] VASWANI A,SHAZEER N, PARMAR N, et al. Attention is all you need[C]//Proceedings of Annual Conference on Neural Information Processing Systems, 2017: 5998-6008.
[24] DEVLIN J, CHANG M W, LEE K, et al. BERT: Pre-training of deep bidirectional transformers for language understanding[C]//Proceedings of NAACL-HLT, 2019: 4171-4186.

基金

国家自然科学基金(62022027)

PDF(2003 KB)

Accesses

Citation

Detail

段落导航

摘要
Abstract
关键词
Key words
引用本文
参考文献
基金

{{custom_sec.title}}

{{custom_sec.title}}

{{custom_fnGroup.title_cn}}

脚注

基金

Published
2024-03-25
Issue Date
2024-03-26

选择文件类型/文献管理软件名称

选择包含的内容

摘要

Abstract

关键词

Key words

引用本文

{{custom_sec.title}}

{{custom_sec.title}}

参考文献

{{custom_fnGroup.title_cn}}

脚注

基金