基于数据增强的藏文改写检测研究

PDF(2340 KB)

中文信息学报 ›› 2019, Vol. 33 ›› Issue (12) : 83-90.

民族语言及周边语言信息处理

基于数据增强的藏文改写检测研究

赵小兵¹,鲍薇²,董建²,包乌格德勒³

作者信息 +

Tibetan Paraphrase Detection Based on Data Augment

ZHAO Xiaobing¹, BAO Wei², DONG Jian², BAO Wugedele³

Author information +

History +

摘要

该文针对藏文语料稀缺的问题,在藏汉双语、藏文单语文本改写检测任务中使用数据增强的方法,在一定程度上解决了低资源语言训练语料规模小的问题。在藏汉跨语言文本改写检测任务中,该文使用数据增强方法,有效利用目前公开的藏汉平行语料,扩充藏汉跨语言文本改写检测训练语料,当扩充至20万句对时,藏汉改写检测模型的皮尔森系数(pearson correlation)达到0.547 6,比基线系统的皮尔森系数提升了0.397 1,表明藏汉改写检测模型检测出的句对相似度值与人工标注的相似度值已达到中等程度相关。在藏文单语言任务中,该文采用训练藏文音节向量的方法,以缓解语料稀缺带来的词向量稀疏问题。实验结果表明,基于藏文音节向量的藏文改写检测模型的皮尔森系数可达到0.678 0,比相应的基于藏文词向量实验的结果提升了0.1,使得藏文单语言文本改写检测模型的检测结果与人工标注的结果达到了强相关程度。

Abstract

To alleviate the scarcity of Tibetan language corpus, this paper proposed data augment method for Tibetan-Chinese bilingual paraphrase and Tibetan paraphrase detection. In Tibetan-Chinese bilingual paraphrase detection task, this paper proposed to augment the parallel corpora available by the Tibetan monolingual texts. When the training set is expanded to 200,000 pairs, the Pearson coefficient of the experiment is increased from 0.397 1 to 0.547 6 for the baseline system. In Tibetan text paraphrasing detection task, Tibetan syllable vectors is adopted to alleviate the insufficient training corpus for the word vector. Experimental results show that the Pearson correlation based on the Tibetan syllable vector experiment reaches 0.678 0, which is 0.1 higher than the corresponding word vector based method.

导出引用

赵小兵,鲍薇,董建,包乌格德勒. 基于数据增强的藏文改写检测研究. 中文信息学报. 2019, 33(12): 83-90

ZHAO Xiaobing, BAO Wei, DONG Jian, BAO Wugedele. Tibetan Paraphrase Detection Based on Data Augment. Journal of Chinese Information Processing. 2019, 33(12): 83-90

参考文献

[1] Marelli M, Bentivogli L, Baroni M, et al. Semeval-2014 task 1: Evaluation of compositional distributional semantic models on full sentences through semantic relatedness and textual entailment[C]//Proceedings of the 8th International Workshop on Semantic Evaluation@COLING, 2014: 1-4.
[2] 珠杰, 欧珠, 格桑多吉, 等. 藏文音节规则库的建立与应用分析[J]. 中文信息学报, 2013, 27(2): 103-112.
[3] 龙从军, 刘汇丹, 吴健. 藏语音节标注研究[J]. 中文信息学报, 2017, 31(4): 89-93.
[4] 何向真, 李亚超, 马宁, 等. 基于音节标注的藏文自动分词研究[J]. 计算机应用研究, 2015, 32(7): 1989-1991.
[5] 于洪志, 李亚超, 汪昆, 等. 融合音节特征的最大熵藏文词性标注研究[J]. 中文信息学报, 2013, 27(5): 160-166.
[6] Wen L, Xiaobing Z, Xiaqing L. Research on Chinese-Tibetan neural machine translation[C]//Proceedings of the Seventh China National Conference on Computational Linguistics & The Sixth International Symposium on Natural Language Processing based on Naturally Annotated Big Data, 2018: 99-108.
[7] He H, Gimpel K, Lin J J. Multi-perspective sentence similarity modeling with convolutional neural networks[C]//Proceedings of the Empirical Methods in Natural Language Processing, 2015: 1576-1586.
[8] Tai K S, Socher R, Manning C D. Improved semantic representations from tree-structured long short-term memory networks[C]//Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing, 2015: 1556-1566.
[9] Ataman D, Jose G C de Souza, Marco Turchi, et al. Cross-lingual semantic similarity measurement using quality estimation features and compositional bilingual word embeddings[C]//Proceedings of the 10th International Workshop on Semantic Evaluation 2016: 570-576.
[10] Mueller J, Thyagarajan A. Siamese recurrent architectures for learning sentence similarity[C]//Proceedings of the 30th AAAI Conference on Artificial Intelligence, 2016: 2786-2792.
[11] Barrow J, Peskov D. End-to-end shared weight LSTM model for semantic textual similarity[C]//Proceedings of the 11th International Workshop on Semantic Evaluation, 2017: 180-184.
[12] Wei B, Wugede B, Jinhua D, et al. Attentive siamese LSTM network for semantic textual similarity measure[C]//Proceedings of the International Conference on Asian Language Processing, 2018: 312-317.
[13] Marzieh F, Arianna B, Christof M. Data augmentation for low-resource neural machine translation[C]//Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics, 2017: 567-573.
[14] Lample G, Denoyer L, Ranzato M. Unsupervised machine translation using monolingual corpora only[J].arXiv preprint arXiv:1711.00043.2017.
[15] Sennrich R, Haddow B, Birch A. Improving neural machine translation models with monolingual data[C]//Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics, 2016: 86-96.
[16] 蔡子龙,杨明明,熊德意.基于数据增强技术的神经机器翻译[J].中文信息学报,2018,32(07): 30-36.

基金

国家语委重点项目(ZDl135-39);国家重点研发计划项目子课题(2017YFB1002103-1);国家社会科学基金(17CYY044)

PDF(2340 KB)

782

Accesses

Citation

Detail

段落导航

摘要
Abstract
关键词
Key words
引用本文
参考文献
基金

Received	Published
2019-05-15	2019-12-16
Issue Date
2019-12-16

选择文件类型/文献管理软件名称

选择包含的内容

摘要

Abstract

关键词

Key words

引用本文

{{custom_sec.title}}

{{custom_sec.title}}

参考文献

{{custom_fnGroup.title_cn}}

脚注

基金