阅读理解问答系统是利用语义理解等自然语言处理技术,根据输入问题,对非结构化文档数据进行分析,生成一个答案,具有很高的研究和应用价值。在垂直领域应用过程中,阅读理解问答数据标注成本高且用户问题表达复杂多样,使得阅读理解问答系统准确率低、鲁棒性差。针对这一问题,该文提出一种面向垂直领域的阅读理解问答数据的增强方法,基于真实用户问题,构造阅读理解训练数据,一方面降低标注成本,另一方面增加训练数据多样性,提升模型的准确率和鲁棒性。该文用汽车领域数据对本方法进行实验验证,其结果表明,该方法对垂直领域中阅读理解模型的准确率和鲁棒性均得到有效提升。
Abstract
Reading comprehension as an advanced form of question answering system develops semantic understanding to analyze unstructured documents and generate answers, which has important research value and vast application prospects. Due to the high cost of obtaining training samples, reading comprehension for specific domain suffers from poor accuracy and robustness. In this paper we propose a data augmentation method for domain specific reading comprehension modeling, which constructs training samples based on real user questions. The experiments in the automobile field show that the method can effectively improve the accuracy and robustness of reading comprehension model.
关键词
阅读理解 /
数据增强 /
问答系统
{{custom_keyword}} /
Key words
reading comprehension /
data augmentation /
question answering system
{{custom_keyword}} /
{{custom_sec.title}}
{{custom_sec.title}}
{{custom_sec.content}}
参考文献
[1] Qu Y, Liu J, Kang L, et al. Question answering over freebase via attentive RNN with similarity matrix based CNN[J]. arXiv preprint arXiv: 1804.03317, 2018, 38.
[2] 安波, 韩先培, 孙乐. 融合知识表示的知识库问答系统[J]. 中国科学(信息科学), 2018, 48(11): 59-70.
[3] Wang W, Yang N, Wei F, et al. Gated self-matching networks for reading comprehension and question answering[C]//Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics, 2017: 189-198.
[4] Chen D, Fisch A, Weston J, et al.Reading wikipedia to answer open-domain questions[C]//Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics, 2017: 1870-1879.
[5] Yu A W, Dohan D, Luong M T, et al. QANet: Combining local convolution with global self-attention for reading comprehension[J]. arXiv preprint arXiv: 1804.09541, 2018.
[6] Rajpurkar P, Zhang J, Lopyrev K, et al. SQuAD: 100,000+ questions for machine comprehension of text [C]//Proceedings of the Conference on Empirical Methods in Natural Language Processing, 2016: 2383-2392.
[7] Jia R, Rajpurkar P, Liang P, et al. Know what you don‘t know: Unanswerable questions for SQuAD[C]//Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, 2018: 784-789.
[8] He W, Liu K, Liu J, et al. Dureader: a Chinese machine reading comprehension dataset from real-world applications[J]. arXiv preprint arXiv: 1711.05073, 2017.
[9] Cui Y, Liu T, Che W, et al. A span-extraction dataset for Chinese machine reading comprehension[J]. arXiv preprint arXiv: 1810.07366, 2018.
[10] 白龙,靳小龙,席鹏弼,等.基于远程监督的关系抽取研究综述[J].中文信息学报, 2019, 33(10): 10-17.
[11] Kim Y, Lee H, Shin J, et al. Improving neural question generation using answer separation[C]//Proceedings of the AAAI Conference on Artificial Intelligence, 2019, 33: 6602-6609.
[12] Zhao Y, Ni X, Ding Y, et al. Paragraph-level neural question generation with maxout pointer and gated self-attention networks[C]//Proceedings of the Conference on Empirical Methods in Natural Language Processing, 2018: 3901-3910.
[13] Devlin J, Chang M, Lee K, et al. BERT: Pre-training of deep bidirectional transformers for language understanding[C]//Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2019: 4171-4186.
[14] Liu Y, Ott M, Goyal N, et al. RoBERTA: A robustly optimized bert pretraining approach[J]. arXiv preprint arXiv: 1907.11692, 2019.
[15] Lan Z, Chen M, Goodman S, et al. AlBERT: A lite BERT for self-supervised learning of language representations[J]. arXiv preprint arXiv: 1909.11942, 2019.
[16] Zhang H, Liang X, Xu G, et al. Factoid question answering with distant supervision[J]. Entropy, 2018, 20(6): 439.
[17] Zhu H, Dong L, Wei F, et al. Learning to ask unanswerable questions for machine reading comprehension[C]//Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, 2019: 4238-4248.
[18] Gan W C, Ng H T. Improving the robustness of question answering systems to question paraphrasing[C]//Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, 2019: 6065-6075.
[19] Subramanian S, Wang T, Yuan X, et al. Neural models for key phrase extraction and question generation.[C]//Proceedings of the Workshop on Machine Reading for Question Answering, 2018: 78-88.
[20] Puri R, Spring R, Patwary M, et al. Training question answering models from synthetic data[J]. arXiv preprint arXiv: 2002.09599, 2020.
[21] Radford A, Wu J, Child R, et al. Language models are unsupervised multitask learners[J]. OpenAI Blog, 2019, 1(8): 9.
[22] Wei J, Zou K. EDA: Easy data augmentation techniques for boosting performance on text classification tasks[C]//Proceedings of the Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, 2019: 6381-6387.
[23] Fadaee M, Bisazza A, Monz C, et al. Data augmentation for low-resource neural machine translation[C]//Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics, 2017: 567-573.
[24] Reimers N, Gurevych I. Sentence-BERT: Sentence embeddings using siamese bert-networks[J]. arXiv preprint arXiv: 1908.10084, 2019.
[25] Manning C D, Surdeanu M, Bauer J, et al. The Stanford CoreNLP natural language processing toolkit[C]//Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics: System Demonstrations, 2014: 55-60.
[26] Song Y, Shi S, Li J, et al. Directional skip-gram: Explicitly distinguishing left and right context for word embeddings[C]//Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2018: 175-180.
{{custom_fnGroup.title_cn}}
脚注
{{custom_fn.content}}