基于孪生XLM-R模型的机器翻译双语平行语料过滤方法

PDF(1137 KB)

中文信息学报 ›› 2025, Vol. 39 ›› Issue (2) : 63-71.

机器翻译

基于孪生XLM-R模型的机器翻译双语平行语料过滤方法

涂杰¹,李茂西^1,2,裘白莲¹

作者信息 +

Siamese XLM-R Based Bilingual Parallel Corpus Filtering Method for Machine Translation

TU Jie¹, LI Maoxi^1,2, QIU Bailian¹

Author information +

History +

摘要

在机器翻译中,模型训练使用的双语平行语料的数量和质量极大地影响了系统的性能,然而当前很多双语平行语料是从双语可比语料中利用自动过滤方法提取的。为了提高双语平行语料自动过滤的性能,该文提出基于孪生XLM-R模型的双语平行语料过滤方法,使用基于跨语言预训练语言模型XLM-R的孪生神经网络将源语言句子与目标语言句子映射到深层语义空间,利用平均池化操作获得它们相同维度的句子表征,根据句子表征间余弦距离提取相似度高的平行句对。在WMT18双语平行语料过滤任务上的实验结果表明,该文所提模型优于对比的基线模型,与参与该评测的系统具有较好的可比性。

Abstract

The quantity and quality of bilingual parallel corpora used for model training greatly affect the performance of the system. To improve the automatic filtering of bilingual parallel corpora, this paper proposes a Siamese XLM-R based filtering method. The Siamese neural network based on the cross-lingual pre-training language model XLM-R is used to map source sentences and target sentences to the deep semantic space. The average pooling operation is then used to obtain their sentence representations with the same dimension. Parallel sentence pairs with high similarity are extracted based on the cosine distance between sentence representations. The experimental results on the WMT18 bilingual parallel corpus filtering task demonstrate that the proposed model outperforms the baselines and exhibits good comparability with the participants in the evaluation campaign.

导出引用

涂杰,李茂西,裘白莲. 基于孪生XLM-R模型的机器翻译双语平行语料过滤方法. 中文信息学报. 2025, 39(2): 63-71

TU Jie, LI Maoxi, QIU Bailian. Siamese XLM-R Based Bilingual Parallel Corpus Filtering Method for Machine Translation. Journal of Chinese Information Processing. 2025, 39(2): 63-71

参考文献

[1] GRANT E, JEREMY G. Coverage and cynicism: The AFRL submission to the WMT 2018 parallel corpus filtering task[C]//Proceedings of the WMT, 2018: 872-876.
[2] NICK R, JAN R, YUNSU K, et al. The RWTH Aachen university filtering system for the WMT parallel corpus filtering task[C]//Proceedings of the WMT, 2018:946-954.
[3] HUDA K, HAINAN X, PHILIPP K. The JHU parallel corpus filtering systems for WMT[C]//Proceedings of the WMT, 2018: 896-899.
[4] RUI W, BENJAMIN M, MASAO U, et al. NICTs corpus filtering systems for the WMT parallel corpus filtering task[C]//Proceedings of the WMT, 2018:963-967.
[5] VCTOR M, SNCHEZ CARTAGENA M, SERGIO O R, et al. Prompsits submission to WMT parallel corpus filtering shared task[C]//Proceedings of the WMT, 2018:955-962.
[6] GABRIEL B C, CHI KIU L. NRC parallel corpus filtering system for WMT[C]//Proceedings of the WMT, 2019:252-260.
[7] MURATHAN K, ROBERT . Noisy parallel corpus filtering through projected word embeddings[C]//Proceedings of the WMT, 2019:277-281.
[8] VISHRAV C, TANG Y Q, FRANCISCO G, et al. Low-resource corpus filtering using multilingual sentence embeddings[C]//Proceedings of the WMT, 2019: 261-266.
[9] VIKTORH, ALEXANDER F. An unsupervised system for parallel corpus filtering[C]//Proceedings of the WMT, 2018:882-887.
[10] PHAM M Q, JOSEP C, JEAN S. SYSTRAN participation to the WMT shared task on parallel corpus filtering[C]//Proceedings of the WMT, 2018: 934-938.
[11] GUILLAUME L, ALEXIS C. Cross-lingual language model pretraining[J]. arXiv preprint arXiv:1901.07291, 2019.
[12] CONNEAU A, KHANDELWAL K, GOYAL N, et al. Unsupervised cross-lingual representation learning at scale[C]//Proceedings of the ACL, 2020: 8440-8451.
[13] ONDRˇEJ B, CHRISTIAN F, MARK F, et al. Findings of the conference on machine translation[C]//Proceedings of the WMT, 2018: 272-303.
[14] BANERJEE S, LAVIE A. METEOR: An automatic metric for MT evaluation with improved correlation with human judgments[C]//Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization, 2005: 65-72.
[15] VASWANI A, SHAZEER N, PARMAR N, et al. Attention is all you need[C]//Proceddings of the NIPS, 2017:5998-6008.
[16] PIOTR B, EDOUARD G, ARMAND J, et al. Enriching word vectors with subword information[J]. arXiv preprint arXiv:1607.04606,2017.
[17] CHOPRA S, HADSELL R, LECUN Y. Learning a similarity metric discriminatively, with application to face verification[C]//Proceddings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2005:539-536.
[18] BERTINETTO L, VALMADRE J, HENRIQUES J F, el al. Fully-convolutional Siamese networks for object tracking[C]//Proceddings of European Conference on Computer Vision Workshops, 2016:850-865.
[19] NECULOIU P, VERSTEEGH M, ROTARU M. Learning text similarity with Siamese recurrent networks[C]//Proceddings of the 1st Wrokshop on Representation Learning for NLP, 2016:148-157.
[20] DEVLIN J, CHANG M W, LEE K, et al. Bert: Pre-training of deep bidirectional transformers for language understanding[C]//Proceedings of the NAACL, 2018: 4171-4186.
[21] PETERS M, NEUMANN M,IYYER M, et al. Deep contextualized word representations[C]//Proceedings of the NAACL-HLT, 2018: 2227-2237.
[22] RADFORD A, NARASIMHAN K,SALIMANS T, et al. Improving language understanding by generative pre-training[OL].https://blog.openai.com/language-unsupervised, 2018.
[23] VLADIMIR I L. Binary codes capable of correcting deletions, insertions and reversals[J]. Soviet Physics Doklady, 1966, 163(4): 845-848.
[24] PHILIPP K, HIEU H, ALEXANDRA B, et al. Moses: Open source toolkit for statistical machine translation[C]//Proceedings of the ACL, 2007: 177-180.
[25] MARCIN J D, ROMAN G, TOMASZ D, et al. Marian: Fast neural machine translation in C++[C]//Proceedings of the ACL, 2018:116-121.
[26] KISHORE P, SALIM R, TODD W, et al. Bleu: A method for automatic evaluation of machine translation[C]//Proceedings of the ACL, 2002: 311-318.
[27] HALUK A, TALHA , PINAR E A H, et al. Filtering noisy parallel corpus using transformers with proxy task learning[C]//Proceedings of the WMT, 2020:940-946.
[28] LI M X, XIANG Q Y, CHEN Z M, et al. A unified neural network for quality estimation of machine translation[J]. Ieice Transactions on Information and Systems, 2018, 101(9): 2417-2421.
[29] YANG Y F, DANIEL C, AMIN A, et al. Multilingual universal sentence encoder for semantic retrieval[C]//Proceedings of the ACL, 2020:87-94.