融合语法及结构特征的汉老双语句子相似度计算方法

雷歆,周蕾越,周兰江

PDF(2587 KB)
PDF(2587 KB)
中文信息学报 ›› 2023, Vol. 37 ›› Issue (9) : 73-82.
民族、跨境及周边语言信息处理

融合语法及结构特征的汉老双语句子相似度计算方法

  • 雷歆1,2,周蕾越3,周兰江1,2
作者信息 +

Sentence Similarity Measure Between Chinese and Lao Based on the Syntax Structure

  • LEI Xin1,2, ZHOU Leiyue3, ZHOU Lanjiang1,2
Author information +
History +

摘要

跨语言句子相似度计算是自然语言处理的核心任务之一。标记是老挝语的重要语言特征,汉语中的特定结构也能起到标记的作用,时态与定语广泛存在于汉老双语中。通过分辨时态和定位定语,融入语言特征,能有效提升句子相似度计算的准确率。该文提出融合语法及结构特征的句子相似度计算方法,通过添加特征标签,使用CNN和BiGRU获取含有更多语义信息的双语句子语义表征,连接局部推理的交互聚合结构使双语信息交互,计算交互后序列的相对差和相对积,将其结果拼接并输入到全连接层以获得汉老双语的句子相似度分数。实验结果表明,该文的方法在当下主流方法中有着突出的表现,F1值达到了77.67%。

Abstract

Cross-language sentence similarity calculation is one important task of natural language processing. Marking is an important linguistic feature of Lao. The specific structure in Chinese can also play the role of marking. Tenses and attributives are widely present in both Chinese and Lao. By distinguishing tenses and positioning attributives and incorporating language features, it can effectively improve the accuracy of sentence similarity calculations. This paper proposes a sentence similarity calculation method based on syntax structure. By adding feature tags, CNN and BiGRU are used to obtain the semantic representation of bilingual sentence, capturing the interactive and aggregated structure of local inferencing. The relative difference and relative product of the sequence are concatenated as the input to the fully connected layer to obtain the sentence similarity score of the Chinese and Lao. The experimental results show that the method achieves a F1 score of 77.67%.

关键词

老挝语 / 句子相似度 / 卷积神经网络 / 双向门控循环单元 / 局部推理

Key words

Lao / sentence similarity / CNN / BiGRU / local inference

引用本文

导出引用
雷歆,周蕾越,周兰江. 融合语法及结构特征的汉老双语句子相似度计算方法. 中文信息学报. 2023, 37(9): 73-82
LEI Xin, ZHOU Leiyue, ZHOU Lanjiang. Sentence Similarity Measure Between Chinese and Lao Based on the Syntax Structure. Journal of Chinese Information Processing. 2023, 37(9): 73-82

参考文献

[1] 俞婷婷,徐彭娜,江育娥,等.基于改进的Jaccard系数文档相似度计算方法[J].计算机系统应用,2017,26(12): 137-142.
[2] 李圣文,凌微,龚君芳,等.一种基于熵的文本相似性计算方法[J].计算机应用研究,2016,33(03): 665-668.
[3] 石杰,周兰江,线岩团,等.基于WordNet的中泰文跨语言文本相似度计算[J].中文信息学报,2016,30(04): 65-70.
[4] FAROUK M. Sentence semantic similarity based on word embedding and WordNet[C]//Proceedings of the 13th International Conference on Computer Engineering and Systems. IEEE, 2018: 33-37.
[5] ZHANG L, SUN Y, LUO T. Calculate semantic similarity based on large scale knowledge repository[J]. Journal of Computer Research and Development,2017, 54(11): 2576.
[6] 荆琪,段利国,李爱萍,等. 基于维基百科的短文本相关度计算[J].计算机工程,2018,44(02): 197-202.
[7] 程蔚,线岩团,周兰江,等.基于双语LDA的跨语言文本相似度计算方法研究[J].计算机工程与科学,2017,39(05): 978-983.
[8] YUAN S, QIAN Z. Tibetan-Chinese cross language text similarity calculation based on LDA topic model[J]. The Open Cybernetics & Systemics Journal, 2015, 9(1): 2911-2919.
[9] MUELLER J,THYAGARAJAN A. Siamese recurrent architectures for learning sentence similarity[C]//Proceedings of the AAAI Conference on Artificial Intelligence, 2016.
[10] PONTES E L,HUET S, LINHARES A C, et al. Predicting the semantic textual similarity with Siamese CNN and LSTM[J]. arXiv preprint arXiv: 1810.10641, 2018.
[11] ZHANG J, ZHU Q X, HE Y L. Hierarchical attention-based BiLSTM network for document similarity calculation[C]//Proceedings of the 4th International Symposium on Computer Science and Intelligent Control, 2020: 1-5.
[12] 何力,周兰江,周枫,等.基于双向长短期记忆神经网络的老挝语分词方法[J].计算机工程与科学,2019,41(07): 1312-1317.
[13] 王兴金,周兰江,张建安,等.融合词结构特征的多任务老挝语词性标注方法[J].中文信息学报,2019,33(11): 39-45.
[14] MIKOLOV T, LE Q V, SUTSKEVER I. Exploiting similarities among languages for machine translation[J]. arXiv preprint arXiv: 1309.4168, 2013.
[15] FARUQUI M, DYER C. Improving vector space word representations using multilingual correlation[C]//Proceedings of the 14th Conference of the European Chapter of the Association for Computational Linguistics, 2014: 462-471.
[16] ARTETXE M, LABAKA G, AGIRRE E. A robust self-learning method for fully unsupervised cross-lingual mappings of word embeddings[J]. arXiv preprint arXiv: 1805.06297, 2018.
[17] CHO K, VAN MERRINBOER B, GULCEHRE C, et al. Learning phrase representations using RNN encoder-decoder for statistical machine translation[J]. arXiv preprint arXiv: 1406.1078, 2014.
[18] CHUNG J,GULCEHRE C, CHO K H, et al. Empirical evaluation of gated recurrent neural networks on sequence modeling[J]. arXiv preprint arXiv: 1412.3555, 2014.
[19] CHEN Q, ZHU X, LING Z, et al. Enhanced lstm for natural language inference[J]. arXiv preprint arXiv: 1609.06038, 2016.
[20] SHAO Y. HCTI at SemEval-2017 Task 1: Use convolutional neural network to evaluate semantic textual similarity[C]//Proceedings of the 11th International Workshop on Semantic Evaluation, 2017: 130-133.
[21] LI S, ZHAO Z, HU R, et al. Analogical reasoning on Chinese morphological and semantic relations[J]. arXiv preprint arXiv: 1805.06504, 2018.
[22] KINGMA D P, BA J. Adam: a method for stochastic optimization[J]. arXiv preprint arXiv: 1412.6980, 2014.

基金

国家自然科学基金(61662040,62166023)
PDF(2587 KB)

Accesses

Citation

Detail

段落导航
相关文章

/