基于半监督的汉缅双语词典构建方法

毛存礼,陆杉,王红斌,余正涛,吴霞,王振晗

PDF(2121 KB)
PDF(2121 KB)
中文信息学报 ›› 2021, Vol. 35 ›› Issue (7) : 47-53.
语言资源建设

基于半监督的汉缅双语词典构建方法

  • 毛存礼1,2,陆杉1,2,王红斌1,2,余正涛1,2,吴霞1,2,王振晗1,2
作者信息 +

Semi-supervised Chinese-Burmese Bilingual Dictionary Construction

  • MAO Cunli1,2, LU Shan1,2, WANG Hongbin1,2, YU Zhengtao1,2, WU Xia1,2, WANG Zhenhan1,2
Author information +
History +

摘要

汉缅双语词典是开展机器翻译、跨语言检索等研究的重要数据资源。当前在种子词典的基础上使用迭代自学习的方法在平行语料中抽取双语词典取得了较好的效果,然而针对低资源语言汉语-缅语的双语词典抽取任务,由于双语平行资源匮乏,基于迭代自学习的方法不能得到有效的双语词向量表示,致使双语词典抽取模型准确度较低。研究表明,可比语料中相似词语往往具有相似的上下文,为此,该文提出了一种基于半监督的汉缅双语词典构建方法,通过利用预训练语言模型来构建双语词汇的上下文特征向量,对基于可比语料和小规模种子词典的迭代自学习方法得到的汉缅双语词汇进行语义增强。实验结果表明,该文提出的方法相较于基线方法有明显的性能提升。

Abstract

Chinese-Burmese bilingual dictionary is an important data resource for research on machine translation and cross-language retrieval, etc. At present, the iterative self-learning method based on small-scale seed dictionary has achieved good results in extracting bilingual dictionaries from parallel corpus. However, for low-resource languages like Chinese-Burmese bilingual dictionary extraction task, due to the lack of bilingual parallel resources, the method based on iterative self-learning can not get effective bilingual word vector representation, resulting in the low accuracy of bilingual dictionary extraction model. Recent studies suggest that similar words in comparable corpora often have similar contexts. Therefore, this paper proposes a semi-supervised method for constructing Chinese- Burmese bilingual dictionary. By using the pre training language model, the context feature vector of bilingual vocabulary is constructed. The Chinese-Burmese bilingual vocabulary obtained by the iterative self-learning method of comparable corpus and small-scale seed dictionary is semantically enhanced. The experimental results show that the proposed method has a significant improvement comparing with the baseline method.

关键词

汉缅双语 / 种子词典 / 迭代自学习 / 预训练语言模型 / 上下文特征 / 半监督

Key words

Chinese-Burmese bilingual / seed dictionary / iterative self-learning / pre-trained language model / contextual feature / semi-supervised

引用本文

导出引用
毛存礼,陆杉,王红斌,余正涛,吴霞,王振晗. 基于半监督的汉缅双语词典构建方法. 中文信息学报. 2021, 35(7): 47-53
MAO Cunli, LU Shan, WANG Hongbin, YU Zhengtao, WU Xia, WANG Zhenhan. Semi-supervised Chinese-Burmese Bilingual Dictionary Construction. Journal of Chinese Information Processing. 2021, 35(7): 47-53

参考文献

[1] 张檬, 刘洋, 孙茂松. 基于非平行语料的双语词典构建[J]. 中国科学: 信息科学, 2018, 48(05): 84-93.
[2] Artetxe M, Labaka G, Agirre E. A robust self-learning method for fully unsupervised cross-lingual mappings of word embeddings[C]//Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics,2018:789-798.
[3] Devlin J, Chang M W, Lee K, et al. Bert: Pre-training of deep bidirectional transformers for language understanding [C]//Proceedings of the Annual Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Minnesota:Association for Computational Linguistics, 2019: 4171-4186.
[4] Vulic I, Moens M F. Bilingual word embeddings from non-parallel document-aligned data applied to bilingual lexiconinduction [C]//Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing, Beijing, 2015: 719-725.
[5] Chandar A P S, Lauly S, Larochelle H, et al. An autoencoder approach to learning bilingual word representations [C]//Proceedings of Advances in Neural Information Processing Systems, 2014, 27: 1853-1861.
[6] Gouws S, Bengio Y, Corrado G. Bilbowa: Fast bilingual distributed representations without word alignments [C]//Proceedings of the 32nd International Conference on Machine Learning, 2015: 748-756.
[7] Wick M,Kanani P, Pocock A. Minimally-constrained multilingual embeddings via artificial code-switching [C]//Proceedings of the AAAI Conference on Artificial Intelligence, 2016: 30(1).
[8] Duong L, Kanayama H, Ma T, et al. Learningcrosslingual word embeddings without bilingual corpora[C]//Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, 2016: 1285-1295.
[9] Cao H, Zhao T, Zhang S, et al. A distribution-based model to learn bilingual word embeddings [C]//Proceedings of the 26th International Conference on Computational Linguistics. Osaka, Japan, 2016: 1818-1827.
[10] Zhang M, Liu Y, Luan H, et al. Adversarial training for unsupervised bilingual lexicon induction [C]//Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics. Vancouver, Canada, 2017: 1959-1970.
[11] Conneau A, Lample G, Ranzato M A, et al. Word translation without parallel data [C]//Proceedings of the International Conference on Learning Representations, 2018: 74-88.
[12] Artetxe M, Labaka G, Agirre E. Bilingual lexicon induction through unsupervised machine translation [C]//Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. Florence, Italy, 2019: 5002-5007.
[13] Riley P, Gildea D. Unsupervised bilingual lexicon induction across writing systems[J]. arXiv preprint arXiv:2002.00037, 2020.
[14] Mohiuddin M T, Bari M S,Joty S. LNMap: Departures from isomorphic assumption in bilingual lexicon induction through non-linear mapping in latent space [C]//Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2020: 2712-2723.
[15] AUNG HLA MOE.基于汉-缅双语语料的双语实体抽取方法研究[D]. 昆明: 昆明理工大学硕士学位论文, 2018.
[16] Thu Y K, Pa W P,Utiyama M, et al. Introducing the asian language treebank (alt). [C]//Proceedings of the 10th International Conference on Language Resources and Evaluation, 2016: 1574-1578.
[17] Mikolov T, Chen K, Corrado G, et al. Efficient estimation of word representations in vector space[J]. arXiv preprint arXiv:1301.3781, 2013.

基金

国家自然科学基金(61732005,61662041,61761026, 61866019,61972186);云南省应用基础研究计划重点项目(2019FA023);云南省中青年学术和技术带头人后备人才项目( 2019HB006)
PDF(2121 KB)

1311

Accesses

0

Citation

Detail

段落导航
相关文章

/