语料循环推进低资源神经机器翻译

刘伍颖,王琳

PDF(2827 KB)
PDF(2827 KB)
中文信息学报 ›› 2023, Vol. 37 ›› Issue (6) : 89-95.
机器翻译

语料循环推进低资源神经机器翻译

  • 刘伍颖1,王琳2
作者信息 +

Boost Corpus for Low-Resource Neural Machine Translation

  • LIU Wuying1, WANG Lin2
Author information +
History +

摘要

双语句对资源稀缺导致一些基于深度学习的机器翻译算法无法在低资源机器翻译中取得更优的性能。因此该文针对低资源机器翻译中语言资源建设问题,提出语料循环推进思想,设计了多趟循环框架和半监督算法。这种框架是与具体机器翻译算法无关的元框架。而这种算法充分利用适当规模的双语种子资源和超大规模单语资源增量扩建双语句对资源,逐级训练机器翻译模型提高译文质量。多个语种的神经机器翻译实验结果证明,该文提出的语料循环推进能够利用源源不断的单语资源自我进化。其有效性不仅体现在易于实现高性能低资源机器翻译,更体现在是快速搭建精准领域机器翻译系统的实用选项。

Abstract

The scarcity of bilingual sentence pair resources prevents neural machine translation algorithms from better performance. To deal with the problem of language resource construction in low-resource machine translation, this paper proposes a corpus boosting strategy in a multi-loop framework and a semi-supervised algorithm. This framework is a meta-framework independent of specific machine translation algorithms. This algorithm makes full use of bilingual seed resources and large-scale monolingual resources to incrementally expand bilingual sentence pair resources. The experimental results of neural machine translation in multiple languages prove that our method can use a steady stream of monolingual resources to self-evolve.

关键词

语料循环推进 / 机器翻译 / 低资源语言 / 半监督学习 / 增量学习

Key words

corpus boosting / machine translation / low-resource language / semi-supervised learning / incremental learning

引用本文

导出引用
刘伍颖,王琳. 语料循环推进低资源神经机器翻译. 中文信息学报. 2023, 37(6): 89-95
LIU Wuying, WANG Lin. Boost Corpus for Low-Resource Neural Machine Translation. Journal of Chinese Information Processing. 2023, 37(6): 89-95

参考文献

[1] 宗成庆. 人类语言技术展望[J]. 中国人工智能学会通讯, 2020, (1): 1-5.
[2] 胡开宝, 田绪军. 语言智能背景下的MTI人才培养: 挑战、对策与前景[J]. 外语界, 2020, (2): 59-64.
[3] CHRISTOPHER C, MARK L. Tides language resources: A resource map for translingual information access[C]//Proceedings of the 3rd International Conference on Language Resources and Evaluation. Las Palmas: European Language Resources Association, 2002, 1334-1339.
[4] JOSEPH O, CAITLIN C, JOHN M. Handbook of natural language processing and machine translation: DARPA global autonomous language exploitation[M]. Berlin Springer, 2011.
[5] PHILIPP K. Statistical machine translation[M]. Cambridge: Cambridge University Press, 2009.
[6] CHRISTOPHER C, MIKE M, STEPHANIE S, et al. Selection criteria for low resource language programs[C]//Proceedings of the 10th International Conference on Language Resources and Evaluation. Portoro: European Language Resources Association, 2016, 4543-4549.
[7] KNILL K M, GALES M J F, RAGNI A, et al. Language independent and unsupervised acoustic models for speech recognition and keyword spotting[C]//Proceedings of the 15th Annual Conference of the International Speech Communication Association. Singapore: ISCA, 2014, 16-20.
[8] GRAHAM N. Neural machine translation and sequence-to-sequence models: A tutorial[J]. arXiv preprint arXiv: 1703.01619v1,2017.
[9] VASWANI A, SHAZEER N, PARMAR N, et al. Attention is all you need[C]//Proceedings of the 31st International Conference on Neural Information Processing Systems, 2017: 6000-6010.
[10] 赵会军, 林国滨. 机器翻译智能化的语言学路径研究[J]. 外语电化教学, 2020, (2): 42-47.
[11] JACOB D, CHANG M W, LEE K, et al. BERT: Pre-training of deep bidirectional transformers for language understanding[C]//Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2019: 4171-4186.
[12] YANG Z, DAI Z, YANG Y M, et al. XLNet: Generalized autoregressive pretraining for language understanding[C]//Proceedings of the 33rd International Conference on Neural Information Processing Systems, 2019: 5753-5763.
[13] NIKITA K, UKASZ K, ANSELM L. Reformer:The efficient transformer[C]//Proceedings of ICLK,2020: 1-12.
[14] 冯志伟. 罗塞塔石碑与机器翻译[J]. 外语学刊, 2020(1):1-17.
[15] LIU W Y, XIAO L X, JIANG SH Y, et al. Language resource extension for indonesian-chinese machine translation[C]//Proceedings of the 22nd International Conference on Asian Language Processing. Bandung: IEEE, 2018, 221-225.
[16] LIU W Y, WANG L. How does dictionary size influence performance of Vietnamese word segmentation[C]//Proceedings of the 10th International Conference on Language Resources and Evaluation. Portoro: European Language Resources Association, 2016: 1079-1083.
[17] LIU W Y. Supervised ensemble learning for Vietnamese tokenization[J]. International Journal of Uncertainty, Fuzziness and Knowledge-based Systems, 2017, 25(2):285-299.
[18] LIU W Y, WANG L. Unsupervised ensemble learning for vietnamese multisyllabic word extraction[C]//Proceedings of the 20th International Conference on Asian Language Processing. Tainan: IEEE, 2016, 353-357.

基金

教育部人文社会科学研究规划基金(20YJAZH069);上海市哲学社会科学“十三五”规划课题(2019BYY028);教育部人文社会科学研究青年基金(20YJC740062);广州市科技计划项目(202201010061)
PDF(2827 KB)

731

Accesses

0

Citation

Detail

段落导航
相关文章

/