基于指针网络融入混淆集知识的中文语法纠错

李嘉诚,沈嘉钰,龚晨,李正华,张民

PDF(3325 KB)
PDF(3325 KB)
中文信息学报 ›› 2022, Vol. 36 ›› Issue (4) : 29-38.
语言分析与计算

基于指针网络融入混淆集知识的中文语法纠错

  • 李嘉诚,沈嘉钰,龚晨,李正华,张民
作者信息 +

Incorporating Confusion Set Knowledge with Pointer Network for Chinese Grammatical Error Correction

  • LI Jiacheng, SHEN Jiayu, GONG Chen, LI Zhenghua, ZHANG Min
Author information +
History +

摘要

在中文语法纠错(CGEC)任务上,虽然替换类错误在数据集中占比最多,但还没有研究者尝试过将音近和形近知识融入基于神经网络的语法纠错模型中。针对这一问题,该文做了两方面的尝试。首先,该文提出了一种基于指针网络融入混淆集知识的语法纠错模型。具体而言,该模型在序列到编辑(Seq2Edit)语法纠错模型基础上,利用指针网络融入汉字之间的音近和形近知识。其次,在训练数据预处理阶段,即从错误-正确句对抽取编辑序列过程中,该文提出一种混淆集指导的编辑距离算法,从而更好地抽取音近和形近字的替换类编辑。实验结果表明,该文提出的两点改进均能提高模型性能,且作用互补;该文所提出的模型在NLPCC 2018评测数据集上达到了目前最优性能。实验分析表明,与基准Seq2Edit语法纠错模型相比,该文模型的性能提升大部分来自于替换类错误的纠正。

Abstract

For Chinese Grammatical Error Correction (CGEC) task, although substitution errors account for the largest proportion of all the errors in the data set, no researcher has tried to incorporate phonological and visual similarity knowledge into the neural network-based GEC model. To tackle this problem, the article makes two attempts. First, this paper proposes a GEC model which incorporates with the confusion set knowledge based on the pointer network. Specifically, this model is Seq2Edit-based GEC model and use the pointer network to incorporate phonological and visual similarity knowledge. Second, during the training data pre-process stage, i.e., in the process of extracting edit sequences from wrong-correct sentence pairs, this paper proposes a confusion set guided edit distance algorithm to better extract substitution edit of phonological and visual similarity characters. The experimental results show that the two proposed methods can both improve the performance of the model and can provide complementary contributions; and the proposed model achieves the current state-of-the-art results in the NLPCC 2018 evaluation data set. Experimental analysis shows that compared with the baseline Seq2Edit GEC model, the overall performance gain of our proposed model is mostly contributed by correction of substitution errors.

关键词

语法纠错 / 混淆集 / 指针网络

Key words

grammatical error correction / confusion set / pointer network

引用本文

导出引用
李嘉诚,沈嘉钰,龚晨,李正华,张民. 基于指针网络融入混淆集知识的中文语法纠错. 中文信息学报. 2022, 36(4): 29-38
LI Jiacheng, SHEN Jiayu, GONG Chen, LI Zhenghua, ZHANG Min. Incorporating Confusion Set Knowledge with Pointer Network for Chinese Grammatical Error Correction. Journal of Chinese Information Processing. 2022, 36(4): 29-38

参考文献

[1] 王辰成,杨麟儿,王莹莹,等. 基于Transformer增强架构的中文语法纠错方法[J]. 中文信息学报, 2020, 34(6): 106-114.
[2] Zewei Zhao, Houfeng Wang. MaskGEC: improving neural grammatical error correction via dynamic masking[C]//Proceedings of the 34th AAAI Conference on Artificial Intelligence, 2020: 1226-1233.
[3] Kostiantyn Omelianchuk, Vitaliy Atrasevych, Artem N, et al. GECToR: grammatical error correction: tag, not rewrite[C]//Proceedings of 15th Workshop on Innovative Use of NLP for Building Educational Applications, 2020: 163-170.
[4] Deng Liang, Chen Zheng, Lei Guo, et al. BERT enhanced neural machine translation and sequence tagging model for chinese grammatical error diagnosis[C]//Proceedings of the 6th Workshop on Natural Language Processing Techniques for Educational Applications, 2020: 57-66.
[5] Jacob Devlin, Ming Wei Chang, Kenton Lee, et al. BERT: pre-training of deep bidirectional transformers for language understanding[C]//Proceedings of the NAACL-HLT, 2019: 4171-4186.
[6] 查雨斯. 基于异质回收式生成的中文文法错误纠正[D]. 台北: 台湾大学, 2020.
[7] Chao Lin Liu, Min Hua Lai, YiHsuan Chuang, et al. Visually and phonologically similar characters in incorrect[C]//Proceedings of the 23th International Conference on Computational Linguistics, 2010: 739-747.
[8] Dingmin Wang, Yi Tay, Li Zhong. Confusionset-guided pointer networks for Chinese spelling check[C]//Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, 2019: 5780-5785.
[9] Xingyi Cheng, Weidi Xu, Kunlong Chen, et al. SpellGCN: incorporating phonological and visual similarities into language models for Chinese spelling check[C]//Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, 2020: 871-881.
[10] Oriol Vinyals, Meire Fortunato, Navdeep Jaitly. Pointer networks[C]//Proceedings of the 28th International Conference on Neural Information Processing Systems, 2015: 2692-2700.
[11] Christopher Bryant, Mariano Felice, Ted Briscoe. Automatic annotation and evaluation of error types for grammatical error correction[C]//Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics, 2017: 793-805.
[12] Zhao Y, Jiang N, Sun W, et al. Overview of the NLPCC shared task: grammatical error correction[C]//Proceedings of the CCF International Conference on Natural Language Processing and Chinese Computing, Springer, Cham, 2018: 439-445.
[13] Charles Hinson, Hen-Hsen Huang, Hsin-Hsi Chen. Heterogeneous recycle generation for chinese grammatical error correction[C]//Proceedings of the 28th International Conference on Computational Linguistics, 2020: 2191-2201.
[14] Eric Malmi, Sebastian Krause, Sascha Rothe. Encode, tag, realize: high-precision text editing[C]//Proceedings of the Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, 2019: 5054-5065.
[15] Gaoqi Rao, Erhong Yang, Baolin Zhang. Overview of NLPTEA-2020 shared task for chinese grammatical error diagnosis[C]//Proceedings of the 6th Workshop on Natural Language Processing Techniques for Educational Applications, 2020: 25-35.
[16] Christopher Bryant, Ted Briscoe. Language model based grammatical error correction without annotated training Data[C]//Proceedings of the 13th Workshop on Innovative Use of NLP for Building Educational Applications, 2018: 247-253.
[17] Alla Rozovskaya, Dan Roth. Generating confusion sets for context-sensitive error correction[C]//Proceedings of the Conference on Empirical Methods in Natural Language Processing, 2010: 961-970.
[18] Abigail See, Peter J. Liu, Christopher D Manning. Get to the point: summarization with pointer-generator networks[C]//Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics, 2017: 1073-1083.
[19] Wei Zhao, Liang Wang,Kewei Shen, et al. Improving grammatical error correction via pre-training a copy-augmented architecture with unlabeled data[C]//Proceedings of the Conference of the North American Chapter of the Association for Computational, 2019.
[20] Abhijeet Awasthi, SunitaSarawagi, Rasna Goyal, et al. Parallel iterative edit models for local sequence transduction[C]//Proceedings of the Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, 2019: 4260-4270.
[21] 张宝林. “HSK 动态作文语料库”的标注问题[C]. 第五届中文电化教学国际研讨会, 2006: 363-370.
[22] Shih Hung Wu, Chao Lin Liu, Lung Hao Lee. Chinese spelling check evaluation at SIGHAN bake-off 2013[C]//Proceedings of the 7th SIGHAN Workshop on Chinese Language Processing, 2013: 35-42.
[23] Daniel Dahlmeier,Hwee Tou Ng. Better evaluation for grammatical error correction[C]//Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2012: 568-572.
[24] Yiming Cui, Wanxiang Che, Ting Liu, et al. Pre-training with whole word masking for Chinese BERT[J]. IEEE/ACM Transactions on Audio, Speech and Language Procsessing, 2021, 29: 3504-3514.

基金

国家自然科学基金(62176173,61876116)
PDF(3325 KB)

1660

Accesses

0

Citation

Detail

段落导航
相关文章

/