藏文文本校对评测集构建

才智杰,三毛措,才让卓玛

PDF(3218 KB)
PDF(3218 KB)
中文信息学报 ›› 2023, Vol. 37 ›› Issue (11) : 15-22.
民族、跨境及周边语言信息处理

藏文文本校对评测集构建

  • 才智杰1,2,三毛措1,2,3,才让卓玛4
作者信息 +

Construction of Testset for Tibetan Text Proofreading

  • CAI Zhijie1,2, SAN Maocuo1,2,3, CAIRANG Zhuoma4
Author information +
History +

摘要

文本校对评测集是拼写检查研究的基础,包括传统文本校对评测集和标准文本校对评测集。传统文本校对评测集是对正确的数据集通过主观经验人工伪造而得到的评测集,标准文本校对评测集是通过选择研究对象获取可信度强的真实数据集而得到的评测集。该文在分析英、汉文文本校对评测集构建方法的基础上,结合藏文的特点研究了藏文文本校对评测集的构建方法,构建了用于评价藏文文本校对性能的标准文本校对评测集,并统计分析了评测集中的错误类型及分布,验证了构建的标准文本校对评测集的有效性和可用性。

Abstract

Testset for text proofreading evaluation is the basis of spell checking research, including traditional and standard text proofreading testset. The traditional testset for text proofreading is obtained by artificially forging the correct data through subjective experience. The standard testset for text proofreading is obtained from the real dataset with strong reliability. Based on the analysis of the construction methods of English and Chinese text proofreading testsets, combined with the characteristics of Tibetan language, this paper studies the testset construction for Tibetan text proofreading, and completes a standard text proofreading testset with statistical analysis of the types and distribution of errors. The validity and usability of the testset are verified.

关键词

自然语言处理 / 藏文 / 文本校对 / 评测集

Key words

natural language processing / Tibetan / text proofreading / evaluation set

引用本文

导出引用
才智杰,三毛措,才让卓玛. 藏文文本校对评测集构建. 中文信息学报. 2023, 37(11): 15-22
CAI Zhijie, SAN Maocuo, CAIRANG Zhuoma. Construction of Testset for Tibetan Text Proofreading. Journal of Chinese Information Processing. 2023, 37(11): 15-22

参考文献

[1] NG H T, WU S M, BRISCOE T, et al. The CoNLL-2014 shared task on grammatical error correction[C]//Proceedings of the 18th Conference on Computational Natural Language Learning: Shared Task, 2014.
[2] LEE L H, YU L C, CHANG L P.Overview of the NLP-TEA shared task for Chinese grammatical error diagnosis[C]//Proceedings of the Workshop on Natural Language Processing Techniques for Educational Applications. Annual Meeting of the Association for Computational Linguistics; International Joint Conference on Natural Language Processing, 2015.
[3] RAO G, ZHANG B, XUN E, et al. IJCNLP Task 1: Chinese grammatical error diagnosis[C]//Proceedings of the IJCNLP, Shared Tasks, Asian Federation of Natural Language Processing, Taipei, 2017:1-8.
[4] ZHAO Y, JIANG N, SUN W, et al. Overview of the NLPCC shared task: Grammatical error correction[C]//Proceedings of the CCF International Conference on Natural Language Processing and Chinese Computing. Springer, Cham, 2018: 439-445.
[5] 才智杰,孙茂松,才让卓玛.一种基于向量模型的藏文字拼写检查方法[J].中文信息学报,2018,32(09): 47-55.
[6] 色差甲,贡保才让,才让加.藏文音节拼写检查的CNN模型[J].中文信息学报,2019,33(01): 111-117.
[7] 华旦扎西,才智杰,班玛宝.一种基于TC_LSTM的藏文词拼写检查方法[J].中文信息学报,2020,34(05): 50-55.
[8] SAN M, CAI Z, CAI R, et al. Analysis on types of spelling errors in true Tibetan characters[J]. MATEC Web of Conferences,2021: 336.

基金

国家自然科学基金(61866032,61966031);青海省科技厅资助项目(2019-SF-129);“长江学者和创新团队发展计划”创新团队资助项目(IRT1068);青海省重点实验室项目(2013-Z-Y17,2014-Z-Y32,2015-Z-Y03);藏文信息处理与机器翻译重点实验室项目(2013-Y-17,2020-ZJ-Y05)
PDF(3218 KB)

607

Accesses

0

Citation

Detail

段落导航
相关文章

/