该文针对藏文语料稀缺的问题,在藏汉双语、藏文单语文本改写检测任务中使用数据增强的方法,在一定程度上解决了低资源语言训练语料规模小的问题。在藏汉跨语言文本改写检测任务中,该文使用数据增强方法,有效利用目前公开的藏汉平行语料,扩充藏汉跨语言文本改写检测训练语料,当扩充至20万句对时,藏汉改写检测模型的皮尔森系数(pearson correlation)达到0.547 6,比基线系统的皮尔森系数提升了0.397 1,表明藏汉改写检测模型检测出的句对相似度值与人工标注的相似度值已达到中等程度相关。在藏文单语言任务中,该文采用训练藏文音节向量的方法,以缓解语料稀缺带来的词向量稀疏问题。实验结果表明,基于藏文音节向量的藏文改写检测模型的皮尔森系数可达到0.678 0,比相应的基于藏文词向量实验的结果提升了0.1,使得藏文单语言文本改写检测模型的检测结果与人工标注的结果达到了强相关程度。
Abstract
To alleviate the scarcity of Tibetan language corpus, this paper proposed data augment method for Tibetan-Chinese bilingual paraphrase and Tibetan paraphrase detection. In Tibetan-Chinese bilingual paraphrase detection task, this paper proposed to augment the parallel corpora available by the Tibetan monolingual texts. When the training set is expanded to 200,000 pairs, the Pearson coefficient of the experiment is increased from 0.397 1 to 0.547 6 for the baseline system. In Tibetan text paraphrasing detection task, Tibetan syllable vectors is adopted to alleviate the insufficient training corpus for the word vector. Experimental results show that the Pearson correlation based on the Tibetan syllable vector experiment reaches 0.678 0, which is 0.1 higher than the corresponding word vector based method.
关键词
改写检测 /
数据增强 /
孪生网络 /
低资源语言
{{custom_keyword}} /
Key words
paraphrasing detection /
data augment /
siamese network /
low-resource language
{{custom_keyword}} /
{{custom_sec.title}}
{{custom_sec.title}}
{{custom_sec.content}}
参考文献
[1] Marelli M, Bentivogli L, Baroni M, et al. Semeval-2014 task 1: Evaluation of compositional distributional semantic models on full sentences through semantic relatedness and textual entailment[C]//Proceedings of the 8th International Workshop on Semantic Evaluation@COLING, 2014: 1-4.
[2] 珠杰, 欧珠, 格桑多吉, 等. 藏文音节规则库的建立与应用分析[J]. 中文信息学报, 2013, 27(2): 103-112.
[3] 龙从军, 刘汇丹, 吴健. 藏语音节标注研究[J]. 中文信息学报, 2017, 31(4): 89-93.
[4] 何向真, 李亚超, 马宁, 等. 基于音节标注的藏文自动分词研究[J]. 计算机应用研究, 2015, 32(7): 1989-1991.
[5] 于洪志, 李亚超, 汪昆, 等. 融合音节特征的最大熵藏文词性标注研究[J]. 中文信息学报, 2013, 27(5): 160-166.
[6] Wen L, Xiaobing Z, Xiaqing L. Research on Chinese-Tibetan neural machine translation[C]//Proceedings of the Seventh China National Conference on Computational Linguistics & The Sixth International Symposium on Natural Language Processing based on Naturally Annotated Big Data, 2018: 99-108.
[7] He H, Gimpel K, Lin J J. Multi-perspective sentence similarity modeling with convolutional neural networks[C]//Proceedings of the Empirical Methods in Natural Language Processing, 2015: 1576-1586.
[8] Tai K S, Socher R, Manning C D. Improved semantic representations from tree-structured long short-term memory networks[C]//Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing, 2015: 1556-1566.
[9] Ataman D, Jose G C de Souza, Marco Turchi, et al. Cross-lingual semantic similarity measurement using quality estimation features and compositional bilingual word embeddings[C]//Proceedings of the 10th International Workshop on Semantic Evaluation 2016: 570-576.
[10] Mueller J, Thyagarajan A. Siamese recurrent architectures for learning sentence similarity[C]//Proceedings of the 30th AAAI Conference on Artificial Intelligence, 2016: 2786-2792.
[11] Barrow J, Peskov D. End-to-end shared weight LSTM model for semantic textual similarity[C]//Proceedings of the 11th International Workshop on Semantic Evaluation, 2017: 180-184.
[12] Wei B, Wugede B, Jinhua D, et al. Attentive siamese LSTM network for semantic textual similarity measure[C]//Proceedings of the International Conference on Asian Language Processing, 2018: 312-317.
[13] Marzieh F, Arianna B, Christof M. Data augmentation for low-resource neural machine translation[C]//Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics, 2017: 567-573.
[14] Lample G, Denoyer L, Ranzato M. Unsupervised machine translation using monolingual corpora only[J].arXiv preprint arXiv:1711.00043.2017.
[15] Sennrich R, Haddow B, Birch A. Improving neural machine translation models with monolingual data[C]//Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics, 2016: 86-96.
[16] 蔡子龙,杨明明,熊德意.基于数据增强技术的神经机器翻译[J].中文信息学报,2018,32(07): 30-36.
{{custom_fnGroup.title_cn}}
脚注
{{custom_fn.content}}
基金
国家语委重点项目(ZDl135-39);国家重点研发计划项目子课题(2017YFB1002103-1);国家社会科学基金(17CYY044)
{{custom_fund}}