基于BiLSTM-CRF的中文分组单字错误识别方法研究

曹阳,曹存根,资康莉,王石

PDF(1338 KB)
PDF(1338 KB)
中文信息学报 ›› 2023, Vol. 37 ›› Issue (4) : 156-165.
自然语言处理应用

基于BiLSTM-CRF的中文分组单字错误识别方法研究

  • 曹阳1,2,曹存根1,资康莉1,2,王石1
作者信息 +

Chinese Typos Recognitionby Character Grouping and BiLSTM-CRF

  • CAO Yang1,2, CAO Cungen1, ZI Kangli1,2, WANG Shi1
Author information +
History +

摘要

近十多年来,中文自动校对取得了许多重要进展,但是单字错别字识别精度和召回率低一直是该领域的一个重要问题。该文提出一种基于BiLSTM-CRF的神经网络模型和单字分组策略识别中文错别字的方法。首先,该文提出一种构建分组单字混淆集的方法,并根据采集的分组单字混淆集自动生成错别字识别训练语料,构造了一个含有13组的汉字单字错别字识别训练语料。其次,针对传统的错别字识别方法在单字错别字召回率较低的问题,该文对错别字识别训练语料中错别字采用多标签标记的策略。再次,针对训练样本存在的数据稀疏问题,该文对训练数据集中的人名、地名、时间和机构名称这四类词语进行抽象。最后,该文利用BiLSTM-CRF的模型在错别字识别训练语料上进行训练。实验结果表明,该文提出的单字错别字识别方法在13组单字上的平均识别精确率为87.30%,平均召回率为84.36%。

Abstract

Important progress has been made in Chinese automatic proofreading in recent years. Aiming at the situation where existing methods have low precision and recall rates for single-character recognition, this paper proposes a grouping strategy for Chinese characters to identify typos via BiLSTM-CRF. First, this paper proposes a method for constructing confusion sets for each grouped Chinese characters, and automatically generates training corpus. Then, the strategy of multi label marking is adopted for typos in the training corpus of typo recognition. Finally, to deal with data sparse issue, we abstract the words into four types, i.e. person name, place name, time, and organization name. The BiLSTM-CRF model trained on the the established training corpus achieves 87.30% recognition precision and 84.36% recall on all grouped words in the experiment.

关键词

BiLSTM-CRF / 分组策略 / 分组单字混淆集 / 错别字识别训练语料

Key words

BiLSTM-CRF / grouping strategy / grouped single-character confusion sets / typo recognition training corpus

引用本文

导出引用
曹阳,曹存根,资康莉,王石. 基于BiLSTM-CRF的中文分组单字错误识别方法研究. 中文信息学报. 2023, 37(4): 156-165
CAO Yang, CAO Cungen, ZI Kangli, WANG Shi. Chinese Typos Recognitionby Character Grouping and BiLSTM-CRF. Journal of Chinese Information Processing. 2023, 37(4): 156-165

参考文献

[1] 曹阳,曹存根,王石. 基于Transformer网络的中文单字词检错方法研究[J]. 中文信息学报,2021,35(01): 135-142.
[2] CHANG C H. A new approach for automatic Chinese spelling correction[C]//Proceedings of Natural Language Processing Pacific Rim Symposium, 1995, 95: 278-283.
[3] HUANG C M, WU M C, CHANG C C. Error detection and correction based on Chinese phonemic alphabet in Chinese text[C]//Proceedings of the International Conference on Modeling Decisions for Artificial Intelligence. Springer, Berlin, Heidelberg, 2007: 463-476.
[4] JIN P, CHEN X,GUO Z, et al. Integrating pinyin to improve spelling errors detection for Chinese language[C]//Proceedings of the IEEE/WIC/ACM International Joint Conferences on Web Intelligence and Intelligent Agent Technologies. IEEE, 2014, 1: 455-458.
[5] XIONG J, ZHANG Q, ZHANG S, et al. HANSpeller: A unified framework for Chinese spelling correction[C]//Proceedings of the International Journal of Computational Linguistics & Chinese Language Processing, 2015: 1-22.
[6] YEH J F, CHANG L T, LIU C Y, et al. Chinese spelling check based on N-gram and string matching algorithm[C]//Proceedings of the 4th Workshop on Natural Language Processing Techniques for Educational Applications, 2017: 35-38.
[7] WANG Y R, LIAO Y F, WU Y K, et al.Conditional random field-based parser and language model for traditional Chinese spelling checker[C]//Proceedings of the 7th SIGHAN Workshop on Chinese Language Processing, 2013: 69-73.
[8] CHIU H, WU J, CHANG J S. Chinese spelling checkerbased on statistical machine translation[C]//Proceedings of the 7th SIGHAN Workshop on Chinese Language Processing, 2013: 49-53.
[9] HSIEH Y M, BAI M H, HUANg S L, et al. Correcting Chinese spelling errors with word lattice decoding[J]//Proceedings of the ACM Transactions on Asian and Low-Resource Language Information Processing, 2015, 14(4): 1-23.
[10] GUO Z, CHEN X, JIN P, et al. Chinese spelling errors detection based on CSLM[C]//Proceedings of the IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology. IEEE, 2015, 3: 173-176.
[11] WANG Q, LIU M, ZHANG W, et al. Automatic proofreading in Chinese: Detect and correct spelling errors in character-level with deep neural networks[C]//Proceedings of the CCF International Conference on Natural Language Processing and Chinese Computing. Springer, Cham, 2019: 349-359.
[12] HAN Z,LV C, WANG Q, et al. Chinese spelling check based on sequence labeling[C]//Proceedings of the International Conference on Asian Language Processing. IEEE, 2019: 373-378.
[13] WANG D,TAY Y, ZHONG L.Confusionset-guided pointer networks for Chinese spelling check[C]//Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics,2019: 5780-5785.
[14] XIE H, LI A, LI Y, et al. Automatic Chinese spelling checking and correction based on character-based pre-trained contextual representations[C]//Proceedings of the CCF International Conference on Natural Language Processing and Chinese Computing. Springer, Cham, 2019: 540-549.
[15] HONG Y, YU X,HE N, et al. FASPell: a fast, adaptable, simple, powerful Chinese spell checker based on DAE-decoder paradigm[C]//Proceedings of the 5th Workshop on Noisy User-generated Text, 2019: 160-169.
[16] CHENG X,XU W, CHEN K, et al. SpellGCN: Incorporating phonological and visual similarities into language models for Chinese spelling check[C]//Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, 2020: 871-881.
[17] ZHANG S, HUANG H, LIU J, et al. Spelling error correction with soft-masked BERT[C]//Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, 2020: 882-890.
[18] CHEN J J, TU H L, YANG C Y, et al. Chinese spelling Check based on neural machine translation[J]. 中文计算语言学期刊, 2020, 25(1): 1-27.
[19] 韩子嘉. 基于序列标注模型的汉语错别字校对方法研究[D]. 哈尔滨: 黑龙江大学硕士学位论文,2020.
[20] LIN C J, CHU W C. A study on Chinese spelling check using confusion sets and N-gram statistics[C]//Proceedings of the International Journal of Computational Linguistics and Chinese Language Processing,2015,20(1): 23-48.
[21] LIU L, CAO C. A Seed-based method for generating Chinese confusion Sets[J]. ACM Transactions on Asian and Low-Resource Language Information Processing, 2016, 16(1): 1-16.
[22] 黄伯荣, 廖序东. 现代汉语[M]. 增订版. 北京: 高等教育出版社, 1991.
[23] DEVLIN J, CHANG M W, LEE K, et al. Bert: Pre-training of deep bidirectional transformers for languageunderstanding[J]. arXiv preprint arXiv:1810.04805, 2018.

基金

科技部重点研发课题(2017YFC1700302)
PDF(1338 KB)

985

Accesses

0

Citation

Detail

段落导航
相关文章

/