

PDF(1737 KB)
PDF(1737 KB)
中文信息学报 ›› 2021, Vol. 35 ›› Issue (8) : 16-27.


  • 黄佳跃,熊德意
作者信息 +

A Survey of Sentence Alignment

  • HUANG Jiayue, XIONG Deyi
Author information +
History +




Neural machine translation has achieved good translation results on languages with abundant corpus, but it has poor performance on languages with scarce bilingual corpus resources such as Chinese-Vietnamese, this problem can be better alleviated by generating pseudo-parallel sentence pairs through word-level replacement of existing small-scale bilingual data. Considering the problem of multiple translations of one word in Chinese-Vietnamese word-level substitutions, so we studied the replacement based on larger granularity, and proposed the Chinese-Vietnamese pseudo-parallel sentence pair generation method based on phrase substitution. Use small-scale bilingual data for phrase extraction to construct a phrase alignment table, and expand it with entity phrases extracted from Wikipedia, after performing phrase recognition on bilingual data for Chinese and Vietnamese, use the phrase pair in the phrase alignment table that is more similar to the recognized phrase to replace, to achieve the phrase-level data enhancement, and train the final neural machine translation model together with the generated pseudo-parallel sentence pairs and the original data. Experimental results on Chinese-Vietnamese translation tasks show that pseudo-parallel sentence pairs generated by phrase substitution can effectively improve the performance of Chinese-Vietnamese neural machine translation.


神经机器翻译 / 句对齐

Key words

neural machine translation / sentence alignment


黄佳跃,熊德意. 句对齐研究综述. 中文信息学报. 2021, 35(8): 16-27
HUANG Jiayue, XIONG Deyi. A Survey of Sentence Alignment. Journal of Chinese Information Processing. 2021, 35(8): 16-27


[1] Hany Hassan, Anthony Aue, Chang Chen,et al. Achieving human parity on automatic Chinese to English news translation[J].arXiv preprint arXiv: 1803.05567, 2018.
[2] Philipp Koehn, Rebecca Knowles. Six challenges for neural machine translation[C]//Proceedings of the 1st Workshop on Neural Machine Translation, 2017: 28-39.
[3] Simard M. The BAF: A corpus of English-French bitext[C]//Proceedings of the International Conference on Language Resources and Evaluation, 1998.
[4] Koehn P. Europarl: A parallel corpus for statistical machine translation[C]//Proceedings of Mt Summit, 2008.
[5] Khayrallah H, Koehn P. On the impact of various types of noise on neural machine translation[C]//Proceedings of ACL, 2018: 74.
[6] Gale W A, Church K W. A program for aligning sentences in bilingual corpora[C]//Proceedings of ACL, 2019: 177-184.
[7] Philip Resnik. Parallel strands: A preliminary investigation into mining the web for bilingual text[C]//Proceedings of the 3rd Conference of the Association for Machine Translation in the Americas, 1998.
[8] Kishore Papineni, Salim Roukos, Todd Ward,et al. Bleu: A method for automatic evaluation of machine translation[C]//Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, 2002: 311-318.
[9] George Doddington. Automatic evaluation of machine translation quality using n-gram co-occurrence statistics [C]//Proceedings of the 2nd International Conference on Human Language Technology Research, 2002.
[10] Michel Simard, George F. Foster, Pierre Isabelle. Using cognates to align sentences in bilingual corpora[C]//Proceedings of the 4th International Conference on Theoretical and Methodological Issues in Machine Translation, 1992: 67-81.
[11] Wu D. Aligning aparallel English-Chinese corpus statistically with lexical criteria[J]. Computer Science, 1994, 4(4): 80-87.
[12] Robert C. Moore. Fast and accurate sentence alignment of bilingual corpora[C]//Proceedings of Association for Machine Translation in the Americas, 2012.
[13] Brown P F, Pietra S D A, Pietra V D J, et al. The mathematics of statistical machine translation: Parameter Estimation[J]. Computational Linguistics, 1993, 19(2): 263-311.
[14] Varga Dániel, Péter Halácsy, András Kornai, et al. Parallel corpora for medium density languages[C]//Proceedings of Recent Advances in Natural Language Processing. Borovets, Bulgaria, 2005: 590-596.
[15] Munteanu D S, Daniel Marcu. Improving machine translation performance by exploiting non-parallel corpora[M]. MIT Press, 2005.
[16] Xiaoyi Ma. Champollion: A robust parallel text sentence aligner [C]//Proceedings of the International Conference on Language Resources and Evaluation. 2006: 489-492.
[17] Peng Li, Maosong Sun, Ping Xue. Fast-Champollion: A fast and robust sentence alignment algorithm[C]//Proceedings of COLING, 2010: 710-718.
[18] S F Adafre, M de Rijke. Finding similar sentences across multiple languages in Wikipedia[C]//Proceedings of the EACL Workshop on New Text, Trento, Italy, 2006.
[19] Mohammadi M, Ghasemaghaee N. Building bilingual parallel corpora based on Wikipedia[C]//Proceedings of the and International Conference on Computer Engineering and Applications. IEEE, 2010.
[20] Senrich R, Volk M. MT-based Sentence alignment for OCR-generated parallel texts.[C]//Proceedings of Association for Machine Translation in the Americas, Denver. 2010.
[21] Sennrich R, Volk M. Iterative, mt-based sentence alignment of parallel texts[C]//Proceedings of the 18th Nordic Conference of Computational Linguistics, 2011: 175-182.
[22] Kutuzov A. Improving English-Russian sentence alignment through POS tagging and Damerau-Levenshtein distance[C]//Proceedings of the 4th Biennial International Workshop on Balto-Slavic Natural Language Processing, 2013.
[23] Thierry Etchegoyhen, Andoni Azpeitia. Set-theoretic alignment for comparable corpora[C]//Proceedings of ACL, 2016: 2009-2018.
[24] Och F J, Ney H. A systematic comparison of various statistical alignment models[J]. Computational Linguistics, 2003, 29(1): 19-51.
[25] Andoni Azpeitia, Thierry Etchegoyhen, Eva Mart inez Garcia. Weightedset-theoretic alignment of comparable sentences[C]//Proceedings of the 10th Workshop on Building and Using Comparable Corpora, 2017: 41-45.
[26] Houda Bouamor, Hassan Sajjad. H2@BUCC18: parallel sentence extraction from comparable corpora using multilingual sentence embeddings[C]//Proceedings of the 11th Workshop on Building and Using Comparable Corpora, 2018.
[27] Francis Grégoire, Philippe Langlais. BUCC 2017 shared task: A first attempt toward a deep learning framework for identifying parallel sentences in comparable corpora[C]//Proceedings of the 10th Workshop on Building and Using Comparable Corpora, 2017: 46-50.
[28] Micha Ziemski, Marcin Junczys Dowmunt, Bruno Pouliquen. Theunited nations parallel corpus v1.0[C]//Proceedings of LREC, 2016.
[29] Johnson M, Schuster M, Le Q V, et al. Google's multilingual neural machine translation system: Enabling zero-shot translation[J]. arXiv 1611.04558 v2.2017.
[30] Schwenk H. Filtering and mining parallel data in a joint multilingual space[C]//Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, 2018.
[31] Mikel Artetxe, Holger Schwenk. Margin-based parallel corpus mining with multilingual sentence embeddings[J/OL]. https: //arxiv.org/abs/1811.01136. 2018.
[32] Mikel Artetxe, Holger Schwenk. Massively multilingual sentence embeddings for zero-shot cross-lingual transfer and beyond [J]. arxivorg/abs/1812.10464v2. 2019.
[33] Holger Schwenk, Vishrav Chaudhary, Shuo Sun, et al.WikiMatrix: Mining 135M parallel sentences in 1620 language pairs from Wikipedia [J]. arxivpreprint arXiv: 1907.05791v2, 2019.
[34] Ye Qi, Devendra Sachan, Matthieu Felix, et al. When and why are pre-trained word embeddings useful for neural machine translation?[C]//Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2018: 529-535.
[35] Francis Grégoire, Philippe Langlais. BUCC 2017 shared task: A first attempt toward a deep learning framework for identifying parallel sentences in comparable corpora[C]//Proceedings of the 10th Workshop on Building and Using Comparable Corpora, 2017: 46-50.
[36] Mandy Guo, Qinlan Shen, Yinfei Yang, et al. Effective parallel corpus mining using bilingual sentence embeddings[C]//Proceedings of the 3rd Conference on Machine Translation: Research Papers, Association for Computational Linguistics, 2018: 165-176.
[37] Mohit Iyyer, Varun Manjunatha, Jordan Boyd Graber, et al. Deep unordered composition rivals syntactic methods for text classifification[C]//Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing, 2015: 1681-1691.
[38] Yinfei Yang, Gustavo Hernandez Abrego, Steve Yuan. Improving multilingual sentence embedding using bidirectional dual encoder with additive margin softmax[J]. arxivpreprint arXiv: 1902.08564v2, 2019.
[39] Jacob Devlin, Ming-Wei Chang, Kenton Lee, et al. BERT: Pre-training of deep bidirectional transformers for language understanding[J]. arxivpreprint arXiv: 1810.04805v2, 2018.
[40] Conneau A, Lample G, Ranzato M, et al. Wordtranslation without parallel data[J]. arxivpreprint arXiv: 1710.04087v3, 2017.
[41] Viktor Hangya, Fabienne Braunel, Yuliya Kalasouskaya, et al. Unsupervised parallel sentence extraction from comparable corpora[C]//Proceedings of the 15th International Workshop on Spoken Language Translation Bruges, 2018: 29-30.
[42] Zweigenbaum P, Sharoff S, Rapp R. Towards preparation of the second BUCC shared task: Detecting parallel sentences in comparable corpora[C]//Proceedings of the 9th Workshop on Building and Using Comparable Corpora, 2016: 38-43.
[43] Serge Sharoff, Pierre Zweigenbaum, Reinhard Rapp. BUCC shared task: Cross-language document similarity[C]//Proceedings of the 8th Workshop on Building and Using Comparable Corpora. Association for Computational Linguistics, 2015: 74-78.
[44] Utiyama M, Isahara H. Reliable measures for aligning Japanese-English news articles and sentences[C]//Proceedings of the 41st Annual Meeting of the Association for Computational Linguistics, 2003: 72-79.
[45] Sainik Mahata, Dipankar Das,Sivaji Bandyopadhyay. BUCC2017: A hybrid approach for identifying parallel sentences in comparable corpora[C]//Proceedings of the 10th Workshop on Building and Using Comparable Corpora. Association for Computational Linguistics, 2017: 61-64.
[46] Philipp Koehn, Hieu Hoang, Alexandra Birch, et al. Moses: Open source toolkit for statistical machine translation[C]//Proceedings of Annual Meeting of the Association for Computational Linguistics, 2007: 177-180.
[47] Zheng Zhang, Pierre Zweigenbaum. zNLP: Identifying parallel sentences in Chinese-English comparable corpora[C]//Proceedings of the 10th Workshop on Building and Using Comparable Corpora, 2017, 51-55.
[48] Andoni Azpeitia, Thierry Etchegoyhen, Eva Mart'nez Garcia. Extracting parallel sentences from comparable corpora with STACC variants[C]//Proceedings of the 11th Workshop on Building and Using Comparable Corpora, 2018.


PDF(1737 KB)






