融合词性句法位置特征的汉老双语句子相似度计算

郭雷,周兰江,周蕾越

PDF(3739 KB)
PDF(3739 KB)
中文信息学报 ›› 2023, Vol. 37 ›› Issue (12) : 76-86.
民族、跨境及周边语言信息处理

融合词性句法位置特征的汉老双语句子相似度计算

  • 郭雷1,周兰江1,周蕾越2
作者信息 +

Sentence Similarity Computation between Chinese and Lao with Part-of-Speech and Syntactic Position Information

  • GUO Lei1, ZHOU Lanjiang1, ZHOU Leiyue2
Author information +
History +

摘要

汉语和老挝语句子表达存在较大的词序差异,在汉老平行句对语料中融入名词、形容词、量词、数词等词性的位置特征能有效提高句子相似度量的准确性。该文提出一种基于词性句法位置特征的相似度计算方法,首先向汉老双语句子添加特征词标签和特征词性标签使得句子的分布式表示包含更丰富的语义信息,然后利用含有自注意力机制(Self-Attention)的3个不同卷积核尺度的门控线性卷积网络(GCN)和双向长短时记忆网络(BiLSTM)分别挖掘汉老双语句子的深层语义信息,将两个网络输出的特征语义向量拼接,最后计算特征语义向量的相对差和相对积,将二者拼接并输入到全连接层得到汉老双语句子的相似度分数。实验结果表明,该文提出的方法在有限的语料下取得了更好的效果,F1值达到了77.19%。

Abstract

Chinese and Lao sentences is different in word order. This paper proposes a similarity calculation method for Chinese and Lao based on part-of-speech and syntactic position information. First, the feature word tags and part-of-speech tags are inserted into the Chinese-Lao bilingual sentences before word embeddings are derived. Then the word embeddings of the any language are fed into three Gated Linear Convolutional Network (GCN) with different convolution kernel and a BiLSTM to get the sentence representation, respectively. The final sentence representation is the concatenation of the outputs of the self-attention mechanism applied respectively on the BiLSTM outputs and the concatenation of three GCN outputs. The sentence representations for each language are performed subtraction and dot production, respectively. The results are again concatenated as the input of the a fully connected layer for final decision. Experimental results show that the method proposed in this paper achieves better results under limited corpus, achieving 77.19% F1 value.

关键词

汉语-老挝语 / 词性句法位置特征 / 门控线性卷积网络 / 双向长短期记忆网络 / 自注意力机制

Key words

Keywords:Chinese-Lao / part of speech syntactic position features / GCN / BiLSTM / self-attention mechanism

引用本文

导出引用
郭雷,周兰江,周蕾越. 融合词性句法位置特征的汉老双语句子相似度计算. 中文信息学报. 2023, 37(12): 76-86
GUO Lei, ZHOU Lanjiang, ZHOU Leiyue. Sentence Similarity Computation between Chinese and Lao with Part-of-Speech and Syntactic Position Information. Journal of Chinese Information Processing. 2023, 37(12): 76-86

参考文献

[1] 石杰,周兰江,线岩团,等. 基于WordNet的中泰文跨语言文本相似度计算[J]. 中文信息学报,2016,30(04): 65-70.
[2] 程传鹏,吴志刚. 一种基于知网的句子相似度计算方法[J]. 计算机工程与科学,2012,34(02): 172-175.
[3] 周艳平,李金鹏,蔡素. 基于同义词词林的句子语义相似度方法及其在问答系统中的应用[J]. 计算机应用与软件,2019,36(08): 65-68.
[4] 程蔚,线岩团,周兰江,等. 基于双语LDA的跨语言文本相似度计算方法研究[J]. 计算机工程与科学,2017,39(05): 978-983.
[5] PREISS J. Identifying comparable corpora using LDA[C]//Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2012: 558-562.
[6] 李彬,刘挺,秦兵,等. 基于语义依存的汉语句子相似度计算[J]. 计算机应用研究,2003(12): 15-17.
[7] 李茹,王智强,李双红,等. 基于框架语义分析的汉语句子相似度计算[J]. 计算机研究与发展,2013,50(08): 1728-1736.
[8] 黄洪,陈德锐. 基于语义依存的汉语句子相似度改进算法[J]. 浙江工业大学学报,2017,45(01): 6-9.
[9] 殷耀明,张东站. 基于关系向量模型的句子相似度计算[J]. 计算机工程与应用,2014,50(02): 198-203.
[10] LI Y, MCLEAN D, BANDAR Z A, et al. Sentence similarity based on semantic nets and corpus statistics[J]. IEEE Transactions on Knowledge and Data Engineering, 2006, 18(8): 1138-1150.
[11] 张俊飞. 改进TF-IDF结合余弦定理计算中文语句相似度[J]. 现代计算机(专业版),2017(32): 20-23,27.
[12] 李晓,解辉,李立杰. 基于Word2Vec的句子语义相似度计算研究[J]. 计算机科学,2017,44(09): 256-260.
[13] YIN W, SCHTZE H. Convolutional neural network for paraphrase identification[C]//Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2015: 901-911.
[14] MUELLER J,THYAGARAJAN A. Siamese recurrent architectures for learning sentence similarity[C]//Proceedings of the 30th AAAI Conference on Artificial Intelligence, 2016: 2786-2792.
[15] 李霞,刘承标,章友豪,等. 基于局部和全局语义融合的跨语言句子语义相似度计算模型[J]. 中文信息学报,2019,33(06): 18-26.
[16] 江燕,侯霞,杨鸿波. 基于Siamese网络的句子相似度计算方法[J]. 北京信息科技大学学报(自然科学版),2020,35(03): 54-58.
[17] 郭浩,许伟,卢凯,等. 基于CNN和BiLSTM的短文本相似度计算方法[J]. 信息技术与网络安全,2019,38(06): 61-64.
[18] 安慕婉. 汉老量词对比分析[D]. 苏州: 苏州大学硕士学位论文,2015.
[19] 博恩(SISOUMANG BOUANGEUN). 老-汉双语语料库系统构建研究[D]. 昆明: 昆明理工大学硕士学位论文,2018.
[20] 何力,周兰江,周枫,等. 基于双向长短期记忆神经网络的老挝语词方法[J]. 计算机工程与科学,2019,41(07): 1312-1317.
[21] 王兴金,周兰江,张建安,等. 融合词结构特征的多任务老挝语词性标注方法[J]. 中文信息学报,2019,33(11): 39-45.
[22] CONNEAU A, LAMPLE G, RANZATO M A, et al. Word translation without parallel data[J]. arXiv, 2017: arXiv: 1710.04087.
[23] MIKOLOV T, LE Q V, SUTSKEVER I. Exploiting similarities among languages for machine translation[J]. arXiv preprint arXiv:1309.4168, 2013.
[24] ARTETXE M, LABAKA G, AGIRRE E. Learning bilingual word embeddings with (almost) no bilingual data[C]//Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics, 2017: 451-462.
[25] DAUPHIN Y N, FAN A,AULI M, et al. Language modeling with gated convolutional networks[C]//Proceedings of the International Conference on Machine Learning, 2017: 933-941.
[26] SHAO Y. Hcti at semeval-2017 task 1: Use convolutional neural network to evaluate semantic textual similarity[C]//Proceedings of the 11th International Workshop on Semantic Evaluation, 2017: 130-133.
[27] LI S, ZHAO Z, HU R, et al. Analogical reasoning conchinese morphological and semantic relations[J]. arXiv preprint arXiv:1805.06504, 2018.

基金

国家自然科学基金(61662040)
PDF(3739 KB)

454

Accesses

0

Citation

Detail

段落导航
相关文章

/