双语句子对齐能够为机器翻译、信息检索等跨语言领域任务提供高质量的平行语料,在低资源的老挝语自然语言处理研究中显得尤为重要。由于汉老双语文本中存在非单调对齐(交叉对齐和空对齐)的情况,容易影响汉老句子对齐的效果。此外,人名、地名作为新闻要素,大多属于未登录词,也给汉老句子对齐研究增加了难度。该文提出了一种融合局部和全局语义信息的汉老双语句子对齐方法。首先,将汉老双语句长特征和人名地名特征融入Glove词向量,然后利用双向门控循环单元对特征词向量进行编码,以得到更细粒度的句子局部信息。其次,引入交互注意力机制,提取双语句子中的全局信息,保证对上下文语义特征的有效利用。最后,在多层感知机的基础上引入KM算法,该方法可以处理非单调对齐文本,提高模型的泛化能力。实验结果表明,该方法显著提高了汉老双语新闻语料的对齐性能。
Abstract
To deal with the non-monotonic alignment (cross alignment and sentence missing) in Chinese-Lao bilingual texts, this paper presents a bilingual sentence alignment methods with local and global semantic information. Firstly, we integrate the Chinese and Lao sentence-length features, person names and place names into Glove word vectors as the input of bidirectional gated recurrent unit. Secondly, we introduce interactive attention mechanism to extract the global information in bilingual sentences to ensure the effective use of contextual semantic features. Finally, we introduce the KM algorithm on multilayer perceptron to process non-monotonic aligned text. The experimental results show that this method significantly improve the alignment performance of Chinese-Lao bilingual news corpora.
关键词
汉老双语句子对齐 /
语义信息 /
双向门控循环单元 /
注意力机制
{{custom_keyword}} /
Key words
Chinese-Lao bilingual sentence alignment /
semantic information /
BiGRU /
attention mechanism
{{custom_keyword}} /
{{custom_sec.title}}
{{custom_sec.title}}
{{custom_sec.content}}
参考文献
[1] WILLIAM A G,KENNETH W. C. A program for aligning sentences in bilingual corpora[C]//Proceedings of ACL,1991: 177-184.
[2] WU D K. Aligning a parallel English-Chinese texts statistically with lexical criteria[C]//Proceedings of ACL,1993: 80-87.
[3] 倪耀群,许洪波,程学旗. 基于多特征融合和图匹配的维汉句子对齐[J].中文信息学报,2016,30(04): 124-133.
[4] MOORE R C. Fast and accurate sentence alignment of bilingual corpora[C]//Proceedings of Conference of Association for Machine Translation in the Americas. Springer,Berlin,Heidelberg,2002: 135-144.
[5] MA X. Champollion: A robust parallel text sentence aligner[C]//Proceedings of the 5th International Conference on Language Resources and Evaluation, 2006: 489-492.
[6] PENG L,SUN M S,XUE P. Fast-Champollion: A fast and robust sentence alignment algorithm[C]//Proceedings of COLING,2010: 710-718.
[7] 让子强. 汉老双语句子对齐方法研究[D]. 昆明: 昆明理工大学硕士学位论文,2017.
[8] CRISTINA E B,DM C V,ALBERTO B C,et al. An empirical analysis of NMT-derived interlingual embeddings and their use in parallel sentence identification[J]. Selected Topics in Signal Processing,IEEE,2017,11(8): 1340-1350.
[9] 程淑玉,郭泽颖,刘威,等.融合Attention多粒度句子交互自然语言推理研究[J].小型微型计算机系统,2019,40(06): 1215-1220.
[10] GRGOIRE F,LANGLAIS P. Extracting parallel sentences with bidirectional recurrent neural networks to improve machine translation[C]//Proceedings of the 27th International Conference on Computational Linguistics. 2018: 1442-1453.
[11] ARTETXE M,SCHWENK H. Margin-based parallel corpus mining with multilingual sentence embeddings[C]//Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, 2019: 3197-3203.
[12] GUO M,SHEN Q,YANG Y,et al. Effective parallel corpus mining using bilingual sentence embeddings[C]//Proceedings of the 3rd Conference on Machine Translation. 2018: 165-176.
[13] CHO K,VAN M B,GULCEHRE C,et al. Learning phrase representations using RNN encoder-decoder for statistical machine translation[C]//Proceedings of the Conference on Empirical Methods in Natural Language Processing, 2014: 1724-1734.
[14] 王家乾,龚子寒,薛云,等. 基于混合多头注意力和胶囊网络的特定目标情感分析[J]. 中文信息学报,2020,34(05): 100-110.
[15] MIKOLOV T,CHEN K,CORRADO G,et al. Efficient estimation of word representations in vectorspace[J].arXiv preprint arXiv: 1301.3781,2013.
[16] PENNINGTON J,SOCHER R,MANNING C D. Glove: Global vectors for word representation[C]//Proceedings of the Conference on Empirical Methods in Natural Language Processing, 2014: 1532-1543.
[17] SCHUSTER M,PALIWAL K K. Bidirectional recurrent neural networks[J].IEEE Transactions on Signal Processing,2002,45(11): 2673-2681.
[18] HOCHREITER S,SCHMIDHUBER J. Long short-term memory[J].Neural Computation,1997,9(8): 1735-1780.
[19] KUHN H W. The Hungarian method for the assignment problem[J]. Naval Research Logistics,2005,52(1): 7-21.
[20] SUN J. Jieba Chinese word segmentationtool[J].Accessed,2012,25(6): 2018.
[21] 何力,周兰江,周枫,等. 基于双向长短期记忆神经网络的老挝语分词方法[J]. 计算机工程与科学,2019,41(07): 1312-1317.
{{custom_fnGroup.title_cn}}
脚注
{{custom_fn.content}}