文本语法错误检测与纠正旨在自动识别并纠正文本中的语法错误。与汉语、英语等语言不同,该任务在一些泰语语言的文本上受制于数据规模问题,仍然只能针对简单规则进行识别和校正。该文结合相应的语言学及错误类型特点,基于人工启发式规则,利用单语数据构建了一定规模的泰语文本语法错误检测与纠正语料库。基于该语料库,该文提出一种融合语言学特征的泰语文本语法错误检测方法,在多语言BERT序列标注模型的基础上融合字符、词与词性的深层语义表达。实验结果表明,该文方法的错误检测性能比仅依赖于多语言BERT的基线模型提升了1.37%的F1值,并且模型性能会随着训练数据规模的增大而提高,证明了该文语料库构建方法的有效性。
Abstract
Text grammatical error detection and correction aims to automatically identify and correct grammatical errors in text. In contrast to Chinese, English and other languages, this task for Thai texts remains rule based method due to the limited data. This paper constructs a large-scale Thai text grammatical error detection and correction corpus based on artificial heuristic rules using monolingual data. Based on this corpus, this paper proposes a grammatical error detection method of Thai text that integrates linguistic features. It integrates the deep semantic expression of characters, words and parts of speech via the multilingual BERT. The results show that the proposed method improves by 1.37% F1 value than the baseline model that only relies on multilingual BERT.
关键词
文本语法错误检测 /
泰语 /
语料库 /
特征融合
{{custom_keyword}} /
Key words
text grammatical error detection /
Thai /
corpus /
feature fusion
{{custom_keyword}} /
{{custom_sec.title}}
{{custom_sec.title}}
{{custom_sec.content}}
参考文献
[1] KANEKO M, MITA M, KIYONO S, et al. Encoder-decoder models can benefit from pre-trained masked language models in grammatical error correction[C]//Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, 2020: 4248-4254.
[2] ROTHE S, MALLINSON J, MALMI E, et al. A simple recipe for multilingual grammatical error correction[C]//Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, 2021: 702-707.
[3] SUN X, GE T, WEI F R, et al. Instantaneous grammatical error correction with shallow aggressive decoding[C]//Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, 2021: 5937-5947.
[4] MEKNAVIN S, KIJSIRIKUL B, CHOTIMONGKOL A, et al. Combining trigram and winnow in thai OCR error correction[C]//Proceedings of the 36th Annual Meeting of the Association for Computational Linguistics and 17th International Conference on Computational Linguistics, 1998:836-842.
[5] RODPHON M, SIRIBOON K, KRUATRACHUE B. Thai OCR error correction using token passing algorithm[C]//Proceedings of IEEE Pacific Rim Conference on Communications, Computers and Signal Processing. Piscataway, NJ: IEEE, 2001: 599-602.
[6] WATCHARABUTSARAKHAM S. Spell checker for Thai document [C]//Proceedings of the IEEE Region Conference, 2005: 1-4.
[7] NG H T, WU S M, BRISCOE T, et al. The CoNLL-2014 shared task on grammatical error correction[C]//Proceedings of the 18th Conference on Computational Natural Language Learning: Shared Task, 2014: 1-14.
[8] LEE L H, RAO G Q, YU L C, et al. Overview of NLP-TEA shared task for chinese grammatical error diagnosis[C]//Proceedings of the 3rd Workshop on Natural Language Processing Techniques for Educational Applications, 2016: 40-48.
[9] RAO G Q, ZHANG B L, XUN E D, et al. IJCNLP-2017 Task 1: Chinese grammatical error diagnosis[C]//Proceedings of the IJCNLP, Shared Tasks, 2017: 1-8.
[10] RAO G Q, GONG Q, ZHANG B L, et al. Overview of NLPTEA share task Chinese grammatical error diagnosis[C]//Proceedings of the 5th Workshop on Natural Language Processing Techniques for Educational Applications, 2018: 42-51.
[11] RAO G Q, YANG E, ZHANG B L. Overview of NLPTEA shared task for Chinese grammatical error diagnosis [C]//Proceedings of the 6th Workshop on Natural Language Processing Techniques for Educational Applications, 2020b: 25-35.
[12] WANG L, ZHAO W, JIA R Y, et al. Denoising based sequence-to-sequence pre-training for text generation [C]//Proceedings of the Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, 2019: 4003-4015.
[13] LICHTARGE J, ALBERTI C, KUMAR S, et al. Corpora generation for grammatical error correction[C]//Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2019: 3291-3301.
[14] ZHOU W C S, GE T, MU C, et al. Improving grammatical error correction with machine translation pairs[C]//Proceedings of the Association for Computational Linguistics: EMNLP, 2020: 318-328.
[15] KANEKO M, KOMACHI M. Multi-head multi-layer attention to deep language representations for grammatical error detection[J]. Computacióny Sistemas, 2019, 23(3): 883-891.
[16] PISLAR M, REI M. Seeing both the forest and the trees: Multi-head attention for joint classification on different compositional levels[C]//Proceedings of the 28th International Conference on Computational Linguistics, 2020: 3761-3775.
[17] YUAN Z, TASLIMIPOOR S, DAVIS C, et al. Multi-class grammatical error detection for correction: A tale of two systems[C]//Proceedings of the Conference on Empirical Methods in Natural Language Processing, 2021:8722-8736.
[18] CLARK K, LUONG M T, LE Q V, et al. ELECTRA: Pre-training text encoders as discriminators rather than generators[C]//Proceedings of the International Conference on Learning Representations, 2020: 1-18.
[19] BRYANT C, FELICE M, BRISCOE T. Automatic annotation and evaluation of error types for grammatical error correction[C]//Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics, 2017: 793-805.
[20] 裴晓睿. 泰语语法新编[M]. 北京: 北京大学出版社, 2001.
[21] COLLIER N, KIM J D. Introduction to the bio-entity recognition task at JNLPBA[C]//Proceedings of the International Joint Workshop on Natural Language Processing in Biomedicine and its Applications, 2004: 73-78.
[22] SORNLERTLAMVANICH V, TAKAHASHI N, ISAHARA H. Building a thai part-of-speech tagged corpus (ORCHID)[J]. Journal of the Acoustical Society of Japan, 1999, 20(3):189-198.
[23] PIRES T, SCHLINGER E, GARRETTE D. How multilingual is multilingual BERT? [C]//Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, 2019: 4996-5001.
[24] CUI L Y, ZHANG Y. Hierarchically-refined label attention network for sequence labeling[C]//Proceedings of the Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, 2019: 4115-4128.
{{custom_fnGroup.title_cn}}
脚注
{{custom_fn.content}}
基金
国家自然科学基金(62166022,61732005);云南省科技厅面上项目(202101AT070077);云南省人培项目(KKSY201903018)
{{custom_fund}}