标记语言翻译相比于纯文本类型翻译任务来说,存在标记格式复杂多样造成的译文质量低和译文端格式难以保持等技术难题。针对这些难题,该文提出基于组合泛化的标记语言建模方法。同时,针对标记语言的格式还原问题,该文提出使用标签位置准确率、正确率、召回率和F1值等指标来衡量标记语言格式还原效果。实验发现,该文所提出的泛化方法相较于基于截断、基于词对齐和已有的泛化方法,BLEU均有较大提升,格式还原率接近100%。
Abstract
Compared with plain text translation tasks, the markup language translation is obstructed by low translation quality caused by complex and diverse markup formats. This paper proposes a combined generalization-based markup language translation method. As for the format restoration of markup language, this paper proposes to measure its quality by tag position precision, accuracy, recall rate and F1 value. Compared with truncation-based, word alignment-based and existing generalization methods, the proposed method has significant improvement in BLEU, and the format restoration rate is close to 100%.
关键词
标记语言 /
机器翻译 /
基于泛化的建模方法
{{custom_keyword}} /
Key words
markup language /
machine translation /
generalization-based modeling approach
{{custom_keyword}} /
{{custom_sec.title}}
{{custom_sec.title}}
{{custom_sec.content}}
参考文献
[1] O'SULLIVAN C. New and improved subtitle translation: Representing translation in film paratexts[J]. Linguistic and Cultural Representation in Audiovisual Translation, 2018: 265-279.
[2] ZHANG Y, WANG Z, CAO R, et al. The NiuTrans machine translation systems for WMT[C]//Proceedings of the 5th Conference on Machine Translation, 2020: 338-345.
[3] FOTIADE, R. Translating the perception of the text: Literary translation and phenomenology[J]. French Studies, 2014: 143-144.
[4] HASSAN H, AUE A, CHEN C, et al. Achieving human parity on automatic chinese to English news translation[J]. arXiv preprint arXiv:1803.05567, 2018.
[5] FADAEE M, BISAZZA A, MONZ C. Data augmentation for low-resource neural machine translation[J]. arXiv preprint arXiv:1705.00440, 2017.
[6] WANG X, PHAM H, DAI Z, et al. Switchout: An efficient data augmentation algorithm for neural machine translation[J]. arXiv preprint arXiv:1808.07512, 2018.
[7] EDUNOV S, OTT M, AULI M, et al. Understanding back-translation at scale[J]. arXiv preprint arXiv:1808.09381, 2018.
[8] SENNRICH R, HADDOW B, BIRCH A. Improving neural machine translation models with monolingual data[J]. arXiv preprint arXiv:1511.06709, 2015.
[9] HOANG V C D, KOEHN P, HAFFARI G, et al. Iterative back-translation for neural machine translation[C]//Proceedings of the 2nd Workshop on Neural Machine Translation and Generation, 2018: 18-24.
[10] POST M, VILAR D. Fast lexically constrained decoding with dynamic beam allocation for neural machine translation[J]. arXiv preprint arXiv:1804.06609, 2018.
[11] CHATTERJEE R, NEGRI M, TURCHI M, et al. Guiding neural machine translation decoding with external knowledge[C]//Proceedings of the 2nd Conference on Machine Translation, 2017.
[12] SENNRICH R, HADDOW B. Linguistic input features improve neural machine translation[J]. arXiv preprint arXiv:1606.02892, 2016.
[13] WU S, ZHOU M, ZHANG D. Improved neural machine translation with source syntax[C]//Proceedings of the International Joint Conference on Artificial Intelligence, 2017: 4179-4185.
[14] AHARONI R, GOLDBERG Y. Towards string-to-tree neural machine translation[J]. arXiv preprint arXiv:1704.04743, 2017.
[15] JOANIS E, STEWART D, LARKIN S, et al. Transferring markup tags in statistical machine translation: A two-stream approach[C]//Proceedings of MT Summit XIV Workshop on Post-editing Technology and Practice, 2013: 73-81.
[16] MATHIAS M. Treatment of markup in statistical machine translation[C]//Proceedings of 3rd Workshop on Discourse in Machine Translation, 2017.
[17] HANNEMAN, DINU G, G. How should markup tags be translated?[C]//Proceedings of the 5th Conference on Machine Translation, 2020.
[18] BRAY T, PAOLI J, SPERBERG-MCQUEEN C M, et al. Extensible markup language[J]. World Wide Web Journal, 1997, 2(4): 27-66.
[19] DYER C, CHAHUNEAU V, SMITH N A. A simple, fast and effective reparameterization of ibm model 2[C]//Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2013: 644-648.
[20] JOULIN A, GRAVE E, BOJANOWSKI P, et al. Bag of tricks for efficient text classification[J]. arXiv preprint arXiv:1607.01759, 2016.
[21] SABET M J, DUFTER P, SCHüTZE H. Simalign: High quality word alignments without parallel training data using static and contextualized embeddings[J]. arXiv preprint arXiv:2004.08728, 2020.
[22] PAPINENI K, ROUKOS S, Ward T, et al. Bleu: A method for automatic evaluation of machine translation[C]//Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, 2002: 311-318.
[23] XIAO T, ZHU J, ZHANG H, et al. NiuTrans: An open source toolkit for phrase-based and syntax-based machine translation[C]//Proceedings of the ACL System Demonstrations, 2012: 19-24.
{{custom_fnGroup.title_cn}}
脚注
{{custom_fn.content}}
基金
国家自然科学基金(61876035,61732005)
{{custom_fund}}