句法分析中基于词汇化树邻接语法的数据增强方法

PDF(5412 KB)

中文信息学报 ›› 2022, Vol. 36 ›› Issue (10) : 27-37,44.

语言分析与计算

句法分析中基于词汇化树邻接语法的数据增强方法

陈鸿彬,张玉洁,徐金安,陈钰枫

作者信息 +

Lexicalized Tree Adjoining Grammar Based Data Augmentation for Parsing

CHEN Hongbin, ZHANG Yujie, XU Jin'an, CHEN Yufeng

Author information +

History +

摘要

句法分析是自然语言处理的基础技术,主流的由数据驱动的神经网络句法分析模型需要大规模的标注数据,但是通过人工标注扩展树库成本很高,因此如何利用现有标注树库进行数据增强成为研究焦点。在汉语句法分析的数据增强任务中,对于给定的标注树库,要求数据增强所生成的句子满足如下条件: 第一,要求生成句具有多样化且完整的句法树结构;第二,要求生成句具有合理的语义。对此,我们首次提出基于词汇化树邻接语法的数据增强方法。针对第一个需求,该文设计实现基于词汇化树邻接语法的词汇化树抽取算法与句法树合成算法,基于该语法可以在句法树之间进行“接插”和“替换”的操作,从而推导生成新的句法树,并且用语言学的知识保证生成句符合语法规则且具有完整的句法树结构。针对第二个需求,该文利用语言模型对生成句进行语义合理性评估,选取语义合理的句子作为最终的增强数据,从而获取高质量的标注树库。我们以汉语为例开展研究,在汉语树库CTB5上进行句法分析的数据增强评测实验。实验结果显示,在小样本(CTB5的20%)实验中,通过该方法得到的增强数据使依存句法分析和成分句法分析的精度分别提高1.39%和2.14%。在鲁棒性实验中,该文通过构建扩展测试集进行评测实验,在扩展测试集上,通过该方法得到的增强数据使依存句法分析和成分句法分析的精度分别提高1.43%和0.44%,表现出更好的鲁棒性。

Abstract

Parsing is a key technology in natural language processing. The neural network based parsing models require large-scale annotated data, and data augmentation technology is demanded to extend the exiting treebank. This paper proposes a data augmentation approach based on a lexicalized tree adjoining grammar for parsing. To generate sentences with various expressions of correct syntax structure, we design and implement a lexicalized tree extraction algorithm and a parse tree synthesis algorithm, in which "adjoining" and "substitution" operations are utilized to derive new syntactic trees. To generate the semantically correct sentences, we use language model to evaluate the derived sentences. Experiments on Chinese treebank CTB5 shows that dependency and constituency parsing accuracy could be improved by 1.39% and 2.14% on the 20% of CTB5 data show that the accuracy of strained on the derived data are increased, respectively.

导出引用

陈鸿彬,张玉洁,徐金安,陈钰枫. 句法分析中基于词汇化树邻接语法的数据增强方法. 中文信息学报. 2022, 36(10): 27-37,44

CHEN Hongbin, ZHANG Yujie, XU Jin'an, CHEN Yufeng. Lexicalized Tree Adjoining Grammar Based Data Augmentation for Parsing. Journal of Chinese Information Processing. 2022, 36(10): 27-37,44

参考文献

[1] Kim S N, Baldwin T. Interpreting semantic relations in noun compounds via verb semantics.[C]//Proceedings of International Conference on Computational Linguistics and 44th Annual Meeting of the Association for Computational Linguistics, Sydney, Australia, 2006: 17-21.
[2] 胡禹轩. 基于依存句法分析的语义角色标注[D].哈尔滨: 哈尔滨工业大学硕士学位论文, 2009: 14-29.
[3] Gzde Gül ahin, Steedman M. Data augmentation via dependency tree morphing for low-resource languages[C]//Proceedings of the Conference on Empirical Methods in Natural Language Processing. 2018: 5004-5009.
[4] MinJ, Mccoy R T, Das D, et al. Syntactic data augmentation increases robustness to inference heuristics[C]// Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics.2020: 2339-2352.
[5] WeiJ, Zou K . EDA: Easy data augmentation techniques for boosting performance on text classification tasks[C]// Proceedings of the Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, 2019: 6382-6388.
[6] SENNRICH R, HADDOW B, BIRCH A. Improving neural machine translation models with monolingual data[C]// Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, 2016: 86-96.
[7] Edunov S, Ott M, Auli M, et al. Understanding back-translation at scale[C]// Proceedings of the Conference on Empirical Methods in Natural Language Processing, 2018: 489-500.
[8] ZhengX, Zeng J, Zhou Y, et al. Evaluating and enhancing the robustness of neural network-based dependency parsing models with adversarial examples[C]// Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Online, 2020: 6600-6610.
[9] Xie Q, Dai Z, Hovy E, et al. Unsupervised data augmentation for consistency training[J]. arXiv eprint arXiv: 1904.12848, 2020.
[10] 姚天顺. 自然语言理解: 一种让机器懂得人类语言的研究[M]. 北京: 清华大学出版社, 2002: 214-240.
[11] ZhangJ, Zhai F, Zong C . Syntax-based translation with bilingually lexicalized synchronous tree substitution grammars[J]. IEEE Transactions on Audio, Speech, and Language Processing, 2013, 21(8): 1586-1597.
[12] YingX, Ringlstetter C, Kim M Y, et al. A lexicalized tree kernel for open information extraction[C]//Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing. 2015: 279-284.
[13] 冯志伟. 树邻接语法[J]. 外语研究, 2012(3): 1-6.
[14] 许云,樊孝忠,张锋.自动提取词汇化树邻接文法[J].计算机应用,2005,25(01): 4-6.
[15] 冯志伟.自然语言计算机形式分析的理论与方法[M]. 合肥: 中国科学技术大学出版社, 2017: 172-173.
[16] FEI X. Extracting Tree adjoining grammars from bracketed corpora[C]// Proceedings of the 5th Natural Language Processing Pacific Rim Symposium 1999, 1999: 398-403.
[17] 宗成庆. 统计自然语言处理[M]. 北京: 清华大学出版社, 2013: 83-95.
[18] Arisoy E, Sainath T N, Kingsbury B, et al. Deep neural network language models[C]// Proceedings of the NAACL-HLT Workshop: Will We Ever Really Replace the N-gram Model? On the Future of Language Modeling for HLT. Association for Computational Linguistics, 2012: 20-28.
[19] Kitaev N, Dan K . Multilingual constituency parsing with self-attention and pre-training[C]//Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, 2019: 3499-3505.
[20] Yan H,Qiu X, Huang X. A graph-based model for joint Chinese word segmentation and dependency parsing[J]. Transactions of the Association for Computational Linguistics, 2020, 8: 78-92.

基金

国家自然科学基金(61876198,61976015,61976016)

PDF(5412 KB)

1405

Accesses

Citation

Detail

段落导航

摘要
Abstract
关键词
Key words
引用本文
参考文献
基金

Received	Published
2021-11-10	2022-12-30
Issue Date
2022-12-30

选择文件类型/文献管理软件名称

选择包含的内容

摘要

Abstract

关键词

Key words

引用本文

{{custom_sec.title}}

{{custom_sec.title}}

参考文献

{{custom_fnGroup.title_cn}}

脚注

基金