跨语言句子相似度计算是自然语言处理的核心任务之一。标记是老挝语的重要语言特征,汉语中的特定结构也能起到标记的作用,时态与定语广泛存在于汉老双语中。通过分辨时态和定位定语,融入语言特征,能有效提升句子相似度计算的准确率。该文提出融合语法及结构特征的句子相似度计算方法,通过添加特征标签,使用CNN和BiGRU获取含有更多语义信息的双语句子语义表征,连接局部推理的交互聚合结构使双语信息交互,计算交互后序列的相对差和相对积,将其结果拼接并输入到全连接层以获得汉老双语的句子相似度分数。实验结果表明,该文的方法在当下主流方法中有着突出的表现,F1值达到了77.67%。
Abstract
Cross-language sentence similarity calculation is one important task of natural language processing. Marking is an important linguistic feature of Lao. The specific structure in Chinese can also play the role of marking. Tenses and attributives are widely present in both Chinese and Lao. By distinguishing tenses and positioning attributives and incorporating language features, it can effectively improve the accuracy of sentence similarity calculations. This paper proposes a sentence similarity calculation method based on syntax structure. By adding feature tags, CNN and BiGRU are used to obtain the semantic representation of bilingual sentence, capturing the interactive and aggregated structure of local inferencing. The relative difference and relative product of the sequence are concatenated as the input to the fully connected layer to obtain the sentence similarity score of the Chinese and Lao. The experimental results show that the method achieves a F1 score of 77.67%.
关键词
老挝语 /
句子相似度 /
卷积神经网络 /
双向门控循环单元 /
局部推理
{{custom_keyword}} /
Key words
Lao /
sentence similarity /
CNN /
BiGRU /
local inference
{{custom_keyword}} /
{{custom_sec.title}}
{{custom_sec.title}}
{{custom_sec.content}}
参考文献
[1] 俞婷婷,徐彭娜,江育娥,等.基于改进的Jaccard系数文档相似度计算方法[J].计算机系统应用,2017,26(12): 137-142.
[2] 李圣文,凌微,龚君芳,等.一种基于熵的文本相似性计算方法[J].计算机应用研究,2016,33(03): 665-668.
[3] 石杰,周兰江,线岩团,等.基于WordNet的中泰文跨语言文本相似度计算[J].中文信息学报,2016,30(04): 65-70.
[4] FAROUK M. Sentence semantic similarity based on word embedding and WordNet[C]//Proceedings of the 13th International Conference on Computer Engineering and Systems. IEEE, 2018: 33-37.
[5] ZHANG L, SUN Y, LUO T. Calculate semantic similarity based on large scale knowledge repository[J]. Journal of Computer Research and Development,2017, 54(11): 2576.
[6] 荆琪,段利国,李爱萍,等. 基于维基百科的短文本相关度计算[J].计算机工程,2018,44(02): 197-202.
[7] 程蔚,线岩团,周兰江,等.基于双语LDA的跨语言文本相似度计算方法研究[J].计算机工程与科学,2017,39(05): 978-983.
[8] YUAN S, QIAN Z. Tibetan-Chinese cross language text similarity calculation based on LDA topic model[J]. The Open Cybernetics & Systemics Journal, 2015, 9(1): 2911-2919.
[9] MUELLER J,THYAGARAJAN A. Siamese recurrent architectures for learning sentence similarity[C]//Proceedings of the AAAI Conference on Artificial Intelligence, 2016.
[10] PONTES E L,HUET S, LINHARES A C, et al. Predicting the semantic textual similarity with Siamese CNN and LSTM[J]. arXiv preprint arXiv: 1810.10641, 2018.
[11] ZHANG J, ZHU Q X, HE Y L. Hierarchical attention-based BiLSTM network for document similarity calculation[C]//Proceedings of the 4th International Symposium on Computer Science and Intelligent Control, 2020: 1-5.
[12] 何力,周兰江,周枫,等.基于双向长短期记忆神经网络的老挝语分词方法[J].计算机工程与科学,2019,41(07): 1312-1317.
[13] 王兴金,周兰江,张建安,等.融合词结构特征的多任务老挝语词性标注方法[J].中文信息学报,2019,33(11): 39-45.
[14] MIKOLOV T, LE Q V, SUTSKEVER I. Exploiting similarities among languages for machine translation[J]. arXiv preprint arXiv: 1309.4168, 2013.
[15] FARUQUI M, DYER C. Improving vector space word representations using multilingual correlation[C]//Proceedings of the 14th Conference of the European Chapter of the Association for Computational Linguistics, 2014: 462-471.
[16] ARTETXE M, LABAKA G, AGIRRE E. A robust self-learning method for fully unsupervised cross-lingual mappings of word embeddings[J]. arXiv preprint arXiv: 1805.06297, 2018.
[17] CHO K, VAN MERRINBOER B, GULCEHRE C, et al. Learning phrase representations using RNN encoder-decoder for statistical machine translation[J]. arXiv preprint arXiv: 1406.1078, 2014.
[18] CHUNG J,GULCEHRE C, CHO K H, et al. Empirical evaluation of gated recurrent neural networks on sequence modeling[J]. arXiv preprint arXiv: 1412.3555, 2014.
[19] CHEN Q, ZHU X, LING Z, et al. Enhanced lstm for natural language inference[J]. arXiv preprint arXiv: 1609.06038, 2016.
[20] SHAO Y. HCTI at SemEval-2017 Task 1: Use convolutional neural network to evaluate semantic textual similarity[C]//Proceedings of the 11th International Workshop on Semantic Evaluation, 2017: 130-133.
[21] LI S, ZHAO Z, HU R, et al. Analogical reasoning on Chinese morphological and semantic relations[J]. arXiv preprint arXiv: 1805.06504, 2018.
[22] KINGMA D P, BA J. Adam: a method for stochastic optimization[J]. arXiv preprint arXiv: 1412.6980, 2014.
{{custom_fnGroup.title_cn}}
脚注
{{custom_fn.content}}
基金
国家自然科学基金(61662040,62166023)
{{custom_fund}}