由于民族语言与汉语之间的词嵌入语义空间差距较大,导致预训练语言模型的跨语言迁移效果不佳。为此,该文设计了一个通过静态词嵌入对齐到上下文词嵌入空间中的新框架,以提升少数民族跨语言预训练语言模型在下游任务中的表现。该文首先将由大规模单语数据训练的静态词嵌入进行跨语言对齐;其次,基于民汉平行语料从少数民族预训练语言模型CINO中抽取上下文词嵌入,并设计两种损失函数: 双语词典归纳损失、对比学习损失,实现静态词嵌入对齐到上下文词嵌入的共享语义空间中;最后,我们将结合静态与上下文跨语言词嵌入对齐的CINO增强模型应用于双语词典归纳、文本分类以及机器翻译任务中,在多个语言对上的实验结果表明,相比鲁棒的基线系统,该文方法在标注语料匮乏的下游任务中均达到了显著的性能提升。
Abstract
The significant difference in the semantic space of word embedding between Chinese minority languages and Chinese leads to poor cross-lingual transfer of pre-trained language models. In this paper, we design a new framework for improving the performance of pre-trained language models in downstream tasks by aligning static word embeddings into contextual word embeddings space. Specifically, we first perform cross-linguistic alignment of static word embeddings trained on large-scale monolingual data. Then we extract the contextual word embeddings from CINO (Chinese Minority Pretrained Language Model) through minority language and Chinese parallel corpus. We design two loss functions: bilingual lexicon induction loss and contrast learning loss, to align the static word embeddings into the semantic space of contextual word embeddings. Finally, we apply the CINO enhanced model based on cross-lingual embedding alignment to downstream tasks such as bilingual lexicon induction, text classification, and machine translation. Experiments on multiple language pairs show that our proposed approach achieves significant improvements over robust baseline systems with limited annotation corpus.
关键词
词嵌入对齐 /
少数民族预训练语言模型 /
双语词典归纳 /
对比学习
{{custom_keyword}} /
Key words
embedding alignment /
minority pre-trained language model /
bilingual lexicon induction /
contrastive learning
{{custom_keyword}} /
{{custom_sec.title}}
{{custom_sec.title}}
{{custom_sec.content}}
参考文献
[1] QIU X, SUN T, XU Y, et al.Pre-trained models for natural language processing: A survey[J]. arXiv preprint arXiv: 2003.08271, 2020.
[2] MIKOLOV T, CHEN K,CORRADO G, et al. Efficient estimation of word representations in vector space[C]//Proceedings of the International Conference on Learning Representations, 2013: 1-12.
[3] PENNINGTON J, SOCHER R, MANNING CD. Glove: Global vectors for word representation[C]//Proceedings of the Conference on Empirical Methods in Natural Language Processing, 2014: 1532-1543.
[4] PETERS M, NEUMANN M, IYYER M, et al. Deep contextualized word representations[C]//Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics, 2018: 2227-2237.
[5] RADFORD A, NARASIMHAN K, SALIMANS T, et al. Improving language understanding by generative pre-training [EB/OL]. https://cdn.openai.com/research-covers/language-unsupervised/language_understanding_paper.pdf. [2023-04-07].
[6] DEVLIN J, CHANG M, LEE K, et al. Bert: Pre-training of deep bidirectional transformers for language understanding[C]//Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics, 2019: 4171-4186.
[7] CONNEAU A, KHANDELWAL K, GOYAL N, et al. Unsupervised cross-lingual representation learning at scale[C]//Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, 2020: 8440-8451.
[8] MIKOLOV T, LE Q V, SUTSKEVER I. Exploiting similarities among languages for machine translation[J]. arXiv preprint arXiv: 1309.4168, 2013.
[9] VULIC′ I, RUDER S, SΦGAARD A. Are all good word vector spaces isomorphic[C]//Proceedings of the Conference on Empirical Methods in Natural Language Processing, 2020: 3178-3192.
[10] YANG Z, XU Z, CUI Y, et al. CINO: A Chinese minority pre-trained language model[C]//Proceedings of the 29th International Conference on Computational Linguistics, 2022: 3937-3949.
[11] ARTETXE M, LABAKA G, AGIRRE E, et al. Learning principled bilingual mappings of word embeddings while preserving monolingual invariance[C]// Proceedings of the Conference on Empirical Methods in Natural Language Processing,2016: 2289-2294.
[12] MOHIUDDIN T, BARI M S, JOTY S. LNMAP: Departures from isomorphic assumption in bilingual lexicon induction through nonlinear mapping in latent space[C]//Proceedings of the Conference on Empirical Methods in Natural Language Processing, 2020: 2712-2723.
[13] CONNEAU A, LAMPLE G, RANZATO M, et al. Word translation without parallel data[C]//Proceedings of the International Conference on Learning Representations, 2018: 1-14.
[14] ARTETXE M, LABAKA G, AGIRRE E. Generalizing and improving bilingual word embedding mappings with a multi-step framework of linear transformations[C]//Proceedings of the 32nd AAAI Conference on Artificial Intelligence, 2018: 5012-5019.
[15] ALDARMAKI H, DIAB M. Context-aware cross-lingual mapping[C]//Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics, 2019: 3906-3911.
[16] NAGATA M, CHOUSA K, NISHINO M. A supervised word alignment method based on cross-language span prediction using multilingual BERT[C]//Proceedings of the Conference on Empirical Methods in Natural Language Processing, 2020: 555-565.
[17] GRITTA M, IACOBACCI I.XeroAlign: Zero-shot cross-lingual transformer alignment[C]//Proceedings of the Association for Computational Linguistics, 2021: 371-381.
[18] ZHANG J, JI B, XIAO N, et al. Combining static word embeddings and contextual representations for bilingual lexicon induction[C]//Proceedings of the Annual Meeting of the Association for Computational Linguistics, 2021: 2943-2955.
[19] HMMERL K, LIBOVICKY J, FRASER A. Combining static and contextualised multilingual embeddings[C]//Proceedings of the Annual Meeting of the Association for Computational Linguistics,2022: 2316-2329.
[20] 赖文. 低资源语言神经机器翻译关键技术研究[D].北京: 中央民族大学硕士学位论文, 2020.
[21] CHEN T, KORNBLITH S, NOROUZI M, et al. A simple framework for contrastive learning of visual representations[C]//Proceedings of the International Conference on Learning Representations, 2020: 1-20.
[22] OORD A, LI Y, VINYALS O. Representation learning with contrastive predictive coding[J].arXiv preprint arXiv: 1807.03748, 2018.
[23] GUPTA P, JAGGI M. Obtaining better static word embeddings using contextual embedding models[C]//Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics, 2021: 5241-5253.
[24] SCHONEMANN P H. A generalized solution of the orthogonal procrustes problem[J]. Psychometrika, 1966, 31(1): 1-10.
[25] JOULIN A, BOJANOWSKI P, MIKOLOV T, et al. Loss in translation: Learning bilingual word mapping with a retrieval criterion[C]//Proceedings of the Conference on Empirical Methods in Natural Language Processing, 2018: 2979-2984.
[26] VASWANI A, SHAZEER N, PARMAR N, et al. Attention is all you need[C]//Proceedings of the International Conference on Neural Information Processing Systems, 2017: 6000-6010.
[27] PAPINENI K, ROUKOS S, WARD T, et al. BLEU: A method for automatic evaluation of machine translation[C]//Proceedings of the Annual Meeting of the Association for Computational Linguistics, 2002: 311-318.
{{custom_fnGroup.title_cn}}
脚注
{{custom_fn.content}}
基金
国家社会科学基金(22&ZD035)
{{custom_fund}}