基于无监督对抗训练的跨语言文本分类方法

崔东虎,崔荣一,赵亚慧

PDF(2689 KB)
PDF(2689 KB)
中文信息学报 ›› 2023, Vol. 37 ›› Issue (9) : 55-62.
民族、跨境及周边语言信息处理

基于无监督对抗训练的跨语言文本分类方法

  • 崔东虎,崔荣一,赵亚慧
作者信息 +

Unsupervised Adversarial Training for Cross-lingual Text Classification

  • CUI Donghu, CUI Rongyi, ZHAO Yahui
Author information +
History +

摘要

针对目前大多数语言没有足够多标注数据的问题,该文提出了汉-朝跨语言多层语义对齐的文本分类模型,通过结合无监督词嵌入映射和对抗训练,使模型可以从资源丰富的语言资源中学习到文本分类知识并迁移到低资源语言中。首先,采用线性映射方法将预训练好的单语词向量映射到同一语义空间中;然后利用源语言和目标语言词信息进行对抗训练,指导分类模型学习语言无关特征,达到提升汉-朝跨语言文本分类性能的目的。实验结果表明,与直接使用汉-朝跨语言词向量相比,该文方法显著提高了汉-朝跨语言文本分类的准确率,在无监督的条件下准确率达到了84.1%。

Abstract

To deal with the insufficient annotated data issue, this paper proposes a cross-language multilayer semantic alignment text classification model for low resource languages. By combining unsupervised word embedding mapping and adversarial training, the model captures text classification knowledge from rich-resource languages and transfer it to low-resource languages. First, the pre-trained monolingual word vectors are mapped into the same semantic space by a linear mapping method. Adversarial training is conducted between the source language and the target language to guide the classification model in learning language-independent features, aiming to improve the performance of Chinese-Korean cross-lingual text classification. The experimental results show that the proposed method significantly improves the accuracy of Chinese-Korean cross-lingual text classification up to 84.1% under unsupervised conditions.

关键词

文本分类 / 跨语言词嵌入 / 对抗训练

Key words

text classification / cross-language word embedding / adversarial training

引用本文

导出引用
崔东虎,崔荣一,赵亚慧. 基于无监督对抗训练的跨语言文本分类方法. 中文信息学报. 2023, 37(9): 55-62
CUI Donghu, CUI Rongyi, ZHAO Yahui. Unsupervised Adversarial Training for Cross-lingual Text Classification. Journal of Chinese Information Processing. 2023, 37(9): 55-62

参考文献

[1] 高影繁,王惠临,徐红姣.跨语言文本分类技术研究进展[J].情报理论与实践,2010(11): 126-128.
[2] DE MELO G, SIERSDORFER S. Multilingual text classification using ontologies[C]//Proceedings of European Conference on Information Retrieval. Springer, Berlin, Heidelberg, 2007: 541-548.
[3] BEL N, KOSTER C H A, VILLEGAS M. Cross-lingual text categorization[C]//Proceedings of International Conference on Theory and Practice of Digital Libraries. Springer, Berlin, Heidelberg, 2003: 126-139.
[4] RIGUTINI L, MAGGINI M, LIU B. An EM based training algorithm for cross-language text categorization[C]//Proceedings of the IEEE/WIC/ACM International Conference on Web Intelligence. IEEE, 2005: 529-535.
[5] WEI B, PAL C. Cross lingual adaptation: An experiment on sentiment classifications[C]//Proceedings of the ACL Conference Short Papers, 2010: 258-262.
[6] ARTETXE M, SCHWENK H. Massively multilingual sentence embeddings for zero-shot cross-lingual transfer and beyond[J]. Transactions of the Association for Computational Linguistics, 2019, 7: 597-610.
[7] DEVLIN J, CHANG M W, LEE K, et al. BERT: Pre-training of deep bidirectional transformers for language understanding[J]. arXiv preprint arXiv: 1810.04805, 2018.
[8] LAMPLE G, CONNEAU A. Cross-lingual language model pretraining[C]//Proceedings of the 33rd International Conference on Neural Information Processing Systems. NY USA,2019: 7059-7069.
[9] PIRES T, SCHLINGER E, GARRETTE D. How multilingual is multilingual BERT?[J]. arXiv preprint arXiv: 1906.01502, 2019.
[10] LIBOVICKY′ J, ROSA R, FRASER A. How language-neutral is multilingual BERT?[J]. arXiv preprint arXiv: 1911.03310, 2019.
[11] LIU Z, SHIN J, XU Y, et al. Zero-shot cross-lingual dialogue systems with transferable latent variables[J]. arXiv preprint arXiv: 1911.04081, 2019.
[12] SINGH J, MCCANN B, SOCHER R, et al. BERT is not an interlingual and the bias of tokenization[C]//Proceedings of the 2nd Workshop on Deep Learning Approaches for Low-Resource NLP, 2019: 47-55.
[13] GOODFELLOW I J, SHLENS J, SZEGEDY C. Explaining and harnessing adversarial examples[J]. arXiv preprint arXiv: 1412.6572, 2014.
[14] SZEGEDY C, ZAREMBA W, SUTSKEVER I, et al. Intriguing properties of neural networks[J]. arXiv preprint arXiv: 1312.6199, 2013.
[15] DONG Y, SU H, ZHU J, et al. Towards interpretable deep neural networks by leveraging adversarial examples[J]. arXiv preprint arXiv: 1708.05493, 2017.
[16] 曹姗. 基于 TF-IDF 特征提取的短文本分类方法[J].工业控制计算机, 2018, 31(4): 109-110.
[17] LIU P, QIU X, HUANG X. Recurrent neural network for text classification with multi-task learning[J]. arXiv preprint arXiv: 1605.05101, 2016.
[18] KIM Y. Convolutional neural networks for sentence classification[C]// Proceedings of the Conference on Empirical Methods in Natural Language Processing, 2014: 1764-1751.
[19] VASWANI A, SHAZEER N, PARMAR N, et al. Attention is all you need[C]//Proceedings of the Advances in Neural Information Processing Systems, 2017: 6000-6010.
[20] MIKOLOV T, LE Q V, SUTSKEVER I. Exploiting similarities among languages for machine translation[J]. arXiv preprint arXiv: 1309.4168, 2013.
[21] ARTETXE M, LABAKA G, AGIRRE E. Learning principled bilingual mappings of word embeddings while preserving monolingual invariance[C]//Proceedings of the Conference on Empirical Methods in Natural Language Processing, 2016: 2289-2294.
[22] CONNEAU A, LAMPLE G, RANZATO M A, et al. Word translation without parallel data[J]. arXiv preprint arXiv: 1710.04087, 2017.
[23] ARTETXE M, LABAKA G, AGIRRE E. A robust self-learning method for fully unsupervised cross-lingual mappings of word embeddings[J]. arXiv preprint arXiv: 1805.06297, 2018.
[24] EBRAHIMI J, RAO A, LOWD D, et al. Hotflip: White-box adversarial examples for text classification[J]. arXiv preprint arXiv: 1712.06751, 2017.
[25] BELINKOV Y, BISK Y. Synthetic and natural noise both break neural machine translation[J]. arXiv preprint arXiv: 1711.02173, 2017.
[26] LI J, MONROE W, SHI T, et al. Adversarial learning for neural dialogue generation[J]. arXiv preprint arXiv: 1701.06547, 2017.
[27] XING C, WANG D, LIU Y. Normalized word embedding and orthogonal transform for bilingual word translation[C]//Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Denver, Colorado, 2015: 1006-1011.

基金

国家语委“十三五”科研规划项目(YB135-76);延边大学外国语言文学世界一流学科建设科研项目(18YLPY13,18YLPY14);国家社会科学基金(22&ZD305)
PDF(2689 KB)

836

Accesses

0

Citation

Detail

段落导航
相关文章

/