数据库中文查询对偶学习式生成SQL语句研究

赵志超,游进国,何培蕾,李晓武

PDF(2306 KB)
PDF(2306 KB)
中文信息学报 ›› 2023, Vol. 37 ›› Issue (3) : 164-172.
自然语言理解与生成

数据库中文查询对偶学习式生成SQL语句研究

  • 赵志超1,游进国1,2,何培蕾1,李晓武1,2
作者信息 +

Generating SQL Statement from Chinese Query Based on Dual Learning

  • ZHAO Zhichao1, YOU Jinguo1,2, HE Peilei1, LI Xiaowu1,2
Author information +
History +

摘要

针对当前中文NL2SQL (Natural language to SQL)监督学习中需要大量标注数据问题,该文提出基于对偶学习的方式在少量训练数据集上进行弱监督学习,将中文查询生成SQL语句。该文同时使用两个任务来训练自然语言转化到SQL,再从SQL转化到自然语言,让模型学习到任务之间的对偶约束性,获取更多相关的语义信息。同时在训练时使用不同比例带有无标签的数据进行训练,验证对偶学习在NL2SQL解析任务上的有效性。实验表明,在不同中英文数据集ATIS、GEO以及TableQA中,本文模型与基准模型Seq2Seq、Seq2Tree、Seq2SQL、以及-dual等相比,百分比准确率至少增加2.1%,其中在中文TableQA数据集上采用对偶学习执行准确率(Execution Accuracy)至少提升5.3%,只使用60%的标签数据就能取得和监督学习使用90%的标签数据相似的效果。

Abstract

To address the current challenges of requiring large amounts of annotated data for Chinese NL2SQL (Natural language to SQL) methods, this paper introduces a dual learning NL2SQL model, DualSQL, for weakly supervised learning on a small number of trained datasets to generate SQL statements from Chinese queries. Specifically, two tasks as dual tasks are used simultaneously to train the natural language to SQL and vice versa, so that the model learns the dual constraints between tasks and obtains more relevant semantic information. To verify the effectiveness of dual learning on the NL2SQL parsing task, we use different proportions of data without labels during training. Experimental results show that the percentage accuracy of the proposed model is increased by at least 2.1% compared with the benchmark models such as Seq2Seq, Seq2Tree, Seq2SQL, SQLNet, -dual etc., in different Chinese and English datasets including ATIS, GEO, and TableQA, and execution accuracy by at least 5.3% on the Chinese TableQA dataset. Further, we show that using only 60% of labelled data can achieve similar effects to those with 90% of labelled data for supervised learning.

关键词

NL2SQL / 对偶学习 / 语义解析 / 半监督学习

Key words

NL2SQL / dual learning / semantic parsing / semi-supervised learning

引用本文

导出引用
赵志超,游进国,何培蕾,李晓武. 数据库中文查询对偶学习式生成SQL语句研究. 中文信息学报. 2023, 37(3): 164-172
ZHAO Zhichao, YOU Jinguo, HE Peilei, LI Xiaowu. Generating SQL Statement from Chinese Query Based on Dual Learning. Journal of Chinese Information Processing. 2023, 37(3): 164-172

参考文献

[1] 潘璇, 徐思涵, 蔡祥睿, 等. 基于深度学习的数据库自然语言接口综述 [J]. 计算机研究与发展, 2021, 58(09): 1925-1950.
[2] 欧杨磊. 基于BERT的中文NL2SQL任务的技术研究 [D]. 杭州: 杭州电子科技大学硕士学位论文, 2021.
[3] 陈程. 基于自然语言接口的数据库查询系统的研究 [D]. 北京: 华北电力大学硕士学位论文, 2014.
[4] ZHU Y, ZHANG Y, YANG H, et al. GANCoder: An automatic natural language-to-programming language translation approach based on GAN [C]//Proceedings of the CCF International Conference on Natural Language Processing and Chinese Computing. Springer, Cham, 2019: 529-539.
[5] RABINOVICH M, STERN M, KLEIN D. Abstract syntax networks for code generation and semantic parsing[C]//Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics, 2017: 1139-1149.
[6] YIN P, NEUBIG G. TRANX: A transition-based neural abstract syntax parser for semantic parsing and code generation[C]//Proceedings of the Conference on Empirical Methods in Natural Language Processing: System Demonstrutions, 2018:7-12.
[7] XU X, LIU C, SONG D. SQLNet: Generating structured queries from natural language without reinforcement learning[C]//Proceedings of ICIR 2018,2018:1-15.
[8] HWANG W, YIM J, PARK S, et al. A comprehensive exploration on WikiSQL with table-aware word context ualization [C]//Proceedings of NeurIPS 2019, 2019:1-34.
[9] HE P, MAO Y, CHAKRABARTI K, et al. X-SQL: Reinforce schema representation with context [J]. arXiv preprint arXiv: 1908.08113, 2019.
[10] 夏应策. 对偶学习的理论和实验研究 [D]. 中国科学技术大学硕士学位论文, 2018.
[11] 李保利, 周锡令, 胡景凡. 数据库汉语查询接口WTCDIS系统的设计与实现 [J]. 中文信息学报, 1999, 13(06): 26-33,60.
[12] 崔宗军, 唐世渭, 杨冬青. 基于ER模型的数据库受限汉语查询界面RChiQL的文法分析系统研究 [J]. 中文信息学报, 2000, 15(04): 9-16.
[13] 孟小峰, 王珊. 数据库自然语言查询系统Nchiql中语义依存树向SQL的转换 [J]. 中文信息学报, 2001, 15(05): 40-45.
[14] 孟小峰, 王珊. 中文数据库自然语言查询系统Nchiql设计与实现 [J]. 计算机研究与发展, 2001, 09): 1080-1086.
[15] 李虎, 田金文, 王缓缓, et al. 基于Ontology的数据库自然语言查询接口的研究 [J]. 计算机科学, 2010, 37(06): 200-205.
[16] GUO T, GAO H. Table2answer: Read the database and answer without SQL [J]. arXiv preprint arXiv: 1902.04260, 2019.
[17] HERZIG J, NOWAK P K, MLLER T, et al. TaPas: Weakly supervised table parsing via pre-training[C]//Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, 2020: 4320-4333.
[18] ZHONG V, XIONG C, SOCHER R. Seq2SQL: Generating structured queries from natural language using reinforcement learning [J]. arXiv preprint arXiv: 1709.00103, 2017.
[19] YU T, LI Z, ZHANG Z, et al. TypeSQL: Knowledge-based type-aware neural text-to-SQL generation [C]//Proceedings of NAALL 2018, 2018:588-594.
[20] LIU X, HE P, CHEN W, et al. Multi-task deep neural networks for natural language understanding[C]//Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, 2019: 4487-4496.
[21] ZHANG X, YIN F, MA G, et al. M-SQL: Multi-task representation learning for single-table Text2SQL generation [J]. IEEE Access, 2020, 8: 43156-43167.
[22] LIANG C, NOROUZI M, BERANT J, et al. Memory augmented policy optimization for program synthesis with generalization [C]//Proceedings of the 32nd Conference on Neural Information Processing Systems, 2018: 1-13.
[23] AGARWAL R, LIANG C, SCHUURMANS D, et al. Learning to generalize from sparse and underspecified rewards [C]//Proceeding of the International Conference on Machine Learning. PMLR, 2019: 130-140.
[24] CAO R, ZHU S, LIU C, et al. Semantic parsing with dual learning[C]//Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, 2019: 51-64.
[25] HEMPHILL C T, GODFREY J J, DODDINGTON G R. The ATIS spoken language systems pilot corpus [C]//Proceedings of the workshop on speech and natural language, 1990: 96-101.
[26] Welcome to Geoquery !: A learned natural language interface to a US geography database [EB/OL].[2022-01-18].https://www.cs.utexas.edu/users/ml/geo.html.
[27] SUN N, YANG X, LIU Y. TableQA: A large-scale Chinese text-to-SQL dataset for table-aware SQL generation [J]. arXiv preprint arXiv: 2006.06434, 2020.
[28] DONG L, LAPATA M. Language to logical form with neural attention[C]//Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, 2016: 33-43.
[29] PENNINGTON J, SOCHER R, MANNING C D. Glove: Global vectors for word representation [C]//Proceedings of the Conference on Empirical Methods in Natural Language Processing, 2014: 1532-1543.
[30] GloVe word-embedding from Chinese Wiki corpus[EB/OL]. https://github.com/YingZhuY/GloVe_Chinese_word_embedding.[2022-01-18].

基金

国家自然科学基金(62062046)
PDF(2306 KB)

1011

Accesses

0

Citation

Detail

段落导航
相关文章

/