融合句子结构特征的汉老双语句子相似度计算方法

李炫达,周兰江,张建安

PDF(3437 KB)
PDF(3437 KB)
中文信息学报 ›› 2022, Vol. 36 ›› Issue (2) : 58-68.
民族、跨境及周边语言信息处理

融合句子结构特征的汉老双语句子相似度计算方法

  • 李炫达,周兰江,张建安
作者信息 +

Sentence Similarity Metirc Between Chinese and Laotian Based onSyntax Feature

  • LI Xuanda, ZHOU Lanjiang, ZHANG Jian'an
Author information +
History +

摘要

在低资源神经机器翻译中,双语平行句对是重要的数据资源,融合语言结构特点能够较好地解决双语句子由于语言差异性导致的句子相似度计算不准确问题。该文提出一种融合句子结构特征的汉老双语句子相似度计算方法。首先,通过该文提出的特征模板获取汉语和老挝语对应的句子结构特征,预训练含有句子结构特征的汉老双语词向量分布式表示,并使用双语词典将其映射到共享的语义空间,然后通过带有自注意力(self-attention)机制的双向长短时记忆网络(BiLSTM)获取句子的特征向量表示,最后分别计算双语向量的相对差和相对积,将结果拼接后传输到全连接网络层计算出相似度分数。实验结果表明,相比目前主流研究方法,该文方法在有限的语料下取得了更好的效果(F1值为70.24%)。

Abstract

To construct bilingual parallel sentence pairs, this paper proposes a Chinese-Lao sentence similarity metric incorporating syntactic information. Firstly, the corresponding sentence structure of Chinese and Lao are obtained by the template proposed in this article. Secondly, the pre-trained representation of Chinese-Lao bilingual words with syntactic characteristics is mapped to a shared semantic space using a bilingual dictionary. Thirdly, the sentence representation is obtained through a Bi-directional Long Short-Term Memory (BiLSTM) network with a Self-Attention mechanism. Finally, the relative difference and relative product of the bilingual vectors are calculated and transmitted to the fully connected network layer to calculate the similarity score. Experimental results show that compared with the current mainstream research methods, the proposed method has achieved better results with limited corpus (F1=70.24%).

关键词

汉语-老挝语 / 资源稀缺型语言 / 句子结构特征 / 双向长短期记忆网络 / 自注意力机制

Key words

Chinese-Laotian / resource scarce language / sentence structure characteristics / BiLSTM / self-attention mechanism

引用本文

导出引用
李炫达,周兰江,张建安. 融合句子结构特征的汉老双语句子相似度计算方法. 中文信息学报. 2022, 36(2): 58-68
LI Xuanda, ZHOU Lanjiang, ZHANG Jian'an. Sentence Similarity Metirc Between Chinese and Laotian Based onSyntax Feature. Journal of Chinese Information Processing. 2022, 36(2): 58-68

参考文献

[1] 石杰,周兰江,线岩团,等.基于WordNet的中泰文跨语言文本相似度计算[J].中文信息学报,2016, 30(4): 65-70.
[2] 闫红,李付学,周云.基于HowNet句子相似度的计算[J].计算机技术与发展,2015,25(11): 53-57.
[3] Tian J, Zhou Z, Lan M, et al.Ecnu at semeval-2017 task 1: leverage kernel-based traditional NLP features and neural networks to build a universal model for multilingual and cross-lingual semantic textual similarity[C]//Proceedings of the 11th International Workshop on Semantic Evaluation, 2017: 191-197.
[4] 黄洪, 陈德锐.基于语义依存的汉语句子相似度改进算法[J]. 浙江工业大学学报, 2017, 045(001): 6-9.
[5] Mueller J,Thyagarajan A. Siamese recurrent architectures for learning sentence similarity[C]//Proceedings of the 30th AAAI Conference on Artificial Intelligence, 2016: 2786-2792.
[6] 李霞,刘承标,章友豪,等.基于局部和全局语义融合的跨语言句子语义相似度计算模型[J].中文信息学报,2019,33(06): 18-26.
[7] Chi Z, Zhang B. A sentence similarity estimation method based on improved Siamese network[J]. Journal of Intelligent Learning Systems and Applications, 2018, 10(4): 121-134.
[8] Chien C Y, Chang C H, Wei C P. Bilingual parallel sentence extraction from comparable corpora[C]//Proceedings of the Conference on Computational Linguistics and Speech Processing, 2019: 167-181.
[9] 李卫疆, 李涛, 漆芳. 基于多特征自注意力BiLSTM的中文实体关系抽取[J]. 中文信息学报, 2019, 33(10): 47-56.
[10] Erdmann M, Finch A, Nakayama K, et al. Calculating wikipedia article similarity using machine translation evaluation metrics[C]//Proceedings of the IEEE Workshops of International Conference on Advanced Information Networking and Applications. 2011: 620-625.
[11] Wu H, Huang H, Jian P, et al. BIT at SemEval-2017 task 1: using semantic information space to evaluate semantic textual similarity[C]//Proceedings of the 11th International Workshop on Semantic Evaluation, 2017: 77-84.
[12] Zhuang W L, Chang E.Neobility at SemEval-2017 Task 1: an attention-based sentence similarity model[J]. arXiv preprint arXiv: 1703.05465, 2017.
[13] Shao Y. HCTI at SemEval-2017 Task 1: use convolutional neural network to evaluate semantic textual similarity[C]//Proceedings of the 11th International Workshop on Semantic Evaluation, 2017: 130-133.
[14] He H,Gimpel K, Lin J. Multi-perspective sentence similarity modeling with convolutional neural networks[C]//Proceedings of the Conference on Empirical Methods in Natural Language Processing, 2015: 1576-1586.
[15] 罗芳玲.汉语和老挝语主谓宾成分的特点及比较[J]. 出国与就业: 就业教育, 2011, (016): 220-221.
[16] Gers F. Long short-term memory in recurrent neural networks[D].Verlag nicht ermittelbar, 2001.
[17] Vaswani A, Shazeer N, Parmar N, et al. Attention is all you need[C]//Proceedings of the Advances in Neural Information Processing Systems, 2017: 5998-6008.
[18] 何力,周兰江,周枫,等.基于双向长短期记忆神经网络的老挝语分词方法[J].计算机工程与科学,2019,41(07): 1312-1317.
[19] 王兴金,周兰江,张建安,等. 融合词结构特征的多任务老挝语词性标注方法[J].中文信息学报, 2019, 33(11): 39-45.
[20] Artetxe M, Labaka G, Agirre E. Learning bilingual word embeddings with (almost) no bilingual data[C]//Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics, 2017: 451-462.
[21] Li S, Zhao Z, Hu R, et al. Analogical reasoning on chinese morphological and semantic relations[J]. arXiv preprint arXiv: 1805.06504, 2018.
[22] Joulin A, Bojanowski P, Mikolov T, et al. Loss in translation: learning bilingual word mapping with a retrieval criterion[J]. arXiv preprint arXiv: 1804.07745, 2018.
[23] Barnes J, Klinger R,Walde S S. Projecting embeddings for domain adaptation: joint modeling of sentiment analysis in diverse domains[J]. arXiv preprint arXiv: 1806.04381, 2018.
[24] Artetxe M, Labaka G, Agirre E. A robust self-learning method for fully unsupervised cross-lingual mappings of word embeddings[J]. arXiv preprint arXiv: 1805.06297, 2018.
[25] Artetxe M, Labaka G, Agirre E. Learning principled bilingual mappings of word embeddings while preserving monolingual invariance[C]//Proceedings of the Conference on Empirical Methods in Natural Language Processing, 2016: 2289-2294.
[26] Grégoire F, Langlais P. Extracting parallel sentences with bidirectional recurrent neural networks to improve machine translation[C]//Proceedings of the 27th International Conference on Computational Linguistics, 2018: 1442-1453.

基金

国家自然科学基金(61662040)
PDF(3437 KB)

947

Accesses

0

Citation

Detail

段落导航
相关文章

/