基于语义自适应编码的汉-越伪平行句对抽取方法

郭军军,田应飞,余正涛,高盛祥,闫婉莹

PDF(2808 KB)
PDF(2808 KB)
中文信息学报 ›› 2021, Vol. 35 ›› Issue (9) : 58-65.
机器翻译

基于语义自适应编码的汉-越伪平行句对抽取方法

  • 郭军军1,2,田应飞1,2,余正涛1,2,高盛祥1,2,闫婉莹1,2
作者信息 +

Pseudo-Parallel Sentence Pair Extraction for Chinese-Vietnamese Based on Semantic Adaptive Coding

  • GUO Junjun1,2, TIAN Yingfei1,2, YU Zhengtao1,2, GAO Shengxiang1,2, YAN Wanying1,2
Author information +
History +

摘要

伪平行句对抽取是缓解汉-越低资源机器翻译中数据稀缺问题的关键任务,同时也是提升机器翻译性能的重要手段。传统的伪平行句对抽取方法都是基于语义相似性度量,但是传统基于深度学习框架的语义表征方法没有考虑不同词语语义表征的难易程度,因此导致句子语义信息不充分,提取到的句子质量不高,噪声比较大。针对此问题,该文提出了一个双向长短期记忆网络加语义自适应编码的语义表征网络框架,根据句子中单词表征难易的不确定性,引导模型使用更深层次的计算。具体思路为: 首先,对汉语和越南语句子进行编码,基于句子中单词语义表征的难易程度,自适应地进行表征,深度挖掘句子中不同单词的语义信息,实现对汉语和越南语句子的深度表征;然后,在解码端将深度表征的向量映射到统一的公共语义空间中,最大化表示句子之间的语义相似度,从而提取更高质量的汉-越伪平行句子。实验结果表明,相比于基线模型,该文提出的方法在F1得分上提升5.09%,同时将提取到的句子对用于训练机器翻译模型,实验结果表明翻译性能的显著提升。

Abstract

Pseudo-parallel sentence pair extraction is a key method to improve the performance of low-resource machine translation such Chinese -Vietnamese. Existing methods based on deep learning framework do not consider the difficulty of semantic representation of different words, which leads to insufficient semantic information of sentences, low quality of extracted sentences and high noise. To solve this problem, this paper proposes a semantic representation network framework of bidirectional LSTM plus semantic adaptive coding. The specific idea is to encode Chinese and Vietnamese sentences first, and adaptive representation is carried out to deeply mine the semantic information of different words in the sentence to realize the depth representation of Chinese and Vietnamese sentences. Then the vector of depth representation is mapped to a unified common semantic space to maximize the semantic similarity between the sentences for higher quality Chinese-Vietnamese pseudo-parallel sentences. The experimental results show that the model improves F1 score by 5.09%, which is better than the baseline model.

关键词

数据稀缺 / 语义表征 / 自适应编码

Key words

data scarcity / semantic representation / adaptive encoding

引用本文

导出引用
郭军军,田应飞,余正涛,高盛祥,闫婉莹. 基于语义自适应编码的汉-越伪平行句对抽取方法. 中文信息学报. 2021, 35(9): 58-65
GUO Junjun, TIAN Yingfei, YU Zhengtao, GAO Shengxiang, YAN Wanying. Pseudo-Parallel Sentence Pair Extraction for Chinese-Vietnamese Based on Semantic Adaptive Coding. Journal of Chinese Information Processing. 2021, 35(9): 58-65

参考文献

[1] Benjamin M,Atsushi F.Efficient extraction of psedo-parallel sentence from raw monolingual data using word embedding[C]//Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics,Vancouver,Canada,2017: 392-398.
[2] Minh Thang Luong,Hieu Pham,Christopher D.Manning Bilingual word representations with monolingual quality in mind[C]//Proceedings of NAACL Workshop on Vector Space Modeling for NLP,Denver,United States,2015: 151-159.
[3] Sanjika Hewavitharana,Stephan Vogel.Extracting parallel phrases from comparable data[C]//Proceedings of the 4th Workshop on Building and Using Comparable Corpora, 49th Annual Meeting of the Association for Computational Linguistics,Portland, Oregon, Association for Computational Linguistics,2011: 61-68.
[4] Xiang Zhang,Shizhu He,Kang Liu,et al.AdaNSP: Uncertainty-driven adaptive decoding in neural sema-ntic parsing[C]//Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics,2019: 4265-4270.
[5] Rauf S A,Schwenk H.Parallel sentence generation from comparable corpora for improved SMT[J]. Machine Translation,2011,25(4): 341-375.
[6] Sadaf Abdul Rauf,Holger Schwenk. On the use of comparable corpora to improve SMT performance[C]//Proceedings of the 12th Conference of the European Chapter of the Association for Computational Linguistics,Stroudsburg,PA,USA,2009: 16-23.
[7] Alberto Barron Cedeno,Cristina Espana Bonet, Josu Boldoba.A factory of comparable corpora from Wikipedia[C]//Proceedings of the 8th Workshop on Building and Using Comparable Corpora,2015: 3-13.
[8] Chenhui Chu,Raj Dabre,Sadao Kurohashi.Parallel sentence extraction from comparable corpora with neural network features[C]//Proceedings of the 10th International Conference on Language Resources and Evaluation, Paris, France, European Language Resources Association,2016.
[9] Francis Gregoire,Philippe Langlais. Extracting parallel sentences with bidirectional recurrent neural networks to improve machine tranlation[J].arXiv preprint arXiv: 1806. 05559v2,2018.
[10] Cristina Espana Bonet,Adam Csaba Varga,et al.An empirical analysis of NMT-derived interlingual embeddings and their use in parallel sentence identifification[J].IEEE Journal of Selected Topics in Signal Processing,2017,11(8): 1340-1350.
[11] Juryong Cheon, Member, nonmember: Automatically extracting parallel sentences from wikipedia using sequential matching of language resources[J]. IEICE Transactions on Information and Sytems E100, 2017: 405-408.
[12] Resnik P,Smith N A.The web as a parallel corpus[J]. Computational Linguistics,1994,29(3) 349-380.
[13] Talvensaari T.Effects of aligned corpus quality and size in corpus-based CLIR[C]//Proceedings of the 30th Europear Conforerce on Advances in Information Retrieval, 2008: 114-125.
[14] Ashish Vaswani, Noam Shazeer, Niki Parmar,et al. Attention is all you need[J].arXiv preprint arXiv: 1706.03762v 56, 2017.
[15] Alexis Conneau, Guillaume Lample,et al.Word translation without parallel data[J]. arXiv preprint arXiv: 1710.04087,2018.
[16] Waleed Ammar,George Mulcaire,Yulia Tsvetkov,et al. Massively multilingual word embeddings[J].arXiv preprint arXiv: 1602.01925v2,2016.
[17] Armand Joulin,Piotr Bojanowski,Tomas Mikolov, et al. Loss in translation: Learning bilingual word mapping with a retrieval criterion[J].arXiv preprint arXiv: 1804.07745v3,2018.
[18] Guillaume Lample,Myle Ott,Alexis Conneau,et al.Phrase based and neural unsupervised machine translation[J].arXiv preprint arXiv: 1804.07755v2,2018.
[19] Tomas Mikolov,Quoc V Le,Ilya Sutskever.Exploiting similarities among languages for machine translation[J]. arXiv preprint arXiv: 1309.4168v1,2013.
[20] Po Yao Huang,Frederick Liu,Sz Rung Shiang,et al. Attention based multimodal neural machine translation[C]//Proceedings of the 5th Confererce on Machire Translation ACL.Berlin,Germany,2016: 639-645.
[21] Hideki Nakayama,Noriki Nishida.Zero-resource mchinne translation by multimodal encoder-decoder nework with multimedia piot[J].arXiv preprint arXiv: 1611.04503v1,2016.
[22] Ozan Caglayan,Fethi Bougares.Multimodal attention for neural machine translation[J].arXiv preprint arXiv: 1609.03976 v1, 2016.
[23] Guillaume Lample,Alexis Conneau,Ludovic Denoyer, et al. Unsupervised machin e translation using monolingual corpra only[J].arXiv preprint arXiv: 1711.00043v1,2017.

基金

国家自然科学基金(61732005,61672271,61761026,61762056,61866020);国家重点研发计划(2019QY1802)
PDF(2808 KB)

989

Accesses

0

Citation

Detail

段落导航
相关文章

/