Phrase Substitution Based Pseudo-parallel Sentence Pair Generation Between Chinese and Vietnamese
JIA Chengxun1,2, LAI Hua1,2, YU Zhengtao1,2, WEN Yonghua1,2, YU Zhiqiang1,2
1.School of Information Engineering and Automation, Kunming University of Science and Technology, Kunming, Yunnan 650500, China; 2.Yunnan Key Laboratory of Artificial Intelligence, Kunming University of Science and Technology, Kunming, Yunnan 650500, China
Abstract:Pseudo-parallel sentences which may generated by word-level replacement of existing small-scale bilingual data are expected to alleviate the low resources language pairs Chinese-Vietnamese. Considering the multi-word translation in Chinese-Vietnamese, a phrase substitution based substitution method is proposed for Chinese-Vietnamese pseudo-parallel sentence pair generation method based on is proposed. A phrase alignment table is extracted from the small-scale bilingual data and then expanded by entity phrases collected from Wikipedia. Chinese and Vietnamese sentences in a bilingual corpus are then identified for the most similar phrases in the phrase table, and then replaced by such phrases. With such phrase-level expanded pseudo-parallel sentence pairs, experiment confirmed and improved performance for Chinese-Vietnamese neural machine translation.
[1] Kalchbrenner N, Blunsom P. Recurrent continuous translation models[C]//Proceedings of the Conference on Empirical Methods in Natural Language Processing, 2013: 1700-1709. [2] 刘洋. 神经机器翻译前沿进展[J].计算机研究与发展, 2017, 54(6): 1144-1149. [3] 刘群. 统计机器翻译综述[J].中文信息学报, 2003, 17(4): 1-12. [4] Wieting J, Mallinson J, Gimpel K. Learning paraphrastic sentence embeddings from back-translated bitext[C]//Proceedings of the Conference on Empirical Methods in Natural Language Processing, 2017: 274-285. [5] Wang R, Finch A, Utiyama M, et al. Sentence embedding for neural machine translation domain adaptation[C]//Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics, 2017: 560-566. [6] Rauf S A, Schwenk H. Exploiting comparable corpora with TER and TERp[C]//Proceedings of the 2nd Workshop on Building and Using Comparable Corpora: from Parallel to Non-Parallel Corpora, 2009: 46-54. [7] Benjamin Marie, Atsushi Fujita. Efficient extraction of pseudo-parallel sentences from raw monolingual data using word embeddings[C]//Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics, 2017: 392-398. [8] Sennrich R, Haddow B, Birch A. Improving neural machine translation models with monolingual data[C]//Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, 2016: 86-96. [9] Zahabi S T, Bakhshaei S, Khadivi S. Using context vectors in improving a machine translation system with bridge language[C]//Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics, 2013: 318-322. [10] 李强, 王强, 肖桐,等. 稀缺资源机器翻译中改进的语料级和短语级中间语言方法研究[J]. 计算机学报, 2017, 40(4): 925-938. [11] Fadaee M, Bisazza A, Monz C. Data augmentation for low-resource neural machine translation[C]//Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics, 2017: 567-573. [12] Wei J W, Zou K. EDA: Easy data augmentation techniques for boosting performance on text classification tasks[C]//Proceedings of the Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, Hong Kong, China, 2019. [13] Kobayashi S. Contextual augmentation: data augmentation by words with paradigmatic relations[C]//Proceedings of the 16th Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2018(2): 452-457. [14] 蔡子龙, 杨明明, 熊德意. 基于数据增强技术的神经机器翻译[J]. 中文信息学报, 2018, 32(7): 30-36. [15] Vu T, Nguyen D Q, Nguyen D Q, et al. VnCoreNLP: A Vietnamese natural language processing toolkit[C]//Proceedings of the 16th Conference of the North American Chapter of the Association for Computational Linguistics: Demonstrations, 2018: 56-60. [16] Levy R, Manning C. Is it harder to parse Chinese, or the Chinesetreebank?[C]//Proceedings of the 41st Annual Meeting on Association for Computational Linguistics, 2003, 1: 439-446. [17] 李英, 郭剑毅, 余正涛,等. 融合越南语语言特征与改进PCFG的越南语短语树库构建[J]. 南京大学学报(自然科学版), 2017 (02): 155-165. [18] Mikolov T, Sutskever I, Chen K, et al. Distributedrepresentations of words and phrases and their compositionality[C]//Proceedings of the 26th Advances in Neural Information Processing Systems, 2013: 3111-3119. [19] Mikolov T, Le Q V, Sutskever I. Exploiting similarities among languages for machine translation[OL]. arXiv Preprint arXiv: 1309.4168v1. 2013. [20] Vaswani A, Shazeer N, Parmar N, et al. Attentionis all you need[C]//Proceedings of the 30th Advances in Neural Information Proceeding Systems, 2017: 5998-6008.