第一届古代汉语分词和词性标注国际评测

李斌,袁义国,芦靖雅,冯敏萱,许超,曲维光,王东波

PDF(1298 KB)
PDF(1298 KB)
中文信息学报 ›› 2023, Vol. 37 ›› Issue (3) : 46-53,64.
语言资源建设与应用

第一届古代汉语分词和词性标注国际评测

  • 李斌1,袁义国1,芦靖雅1,冯敏萱1,许超1,曲维光2,王东波3
作者信息 +

Review of the First International Ancient Chinese Word Segmentation and POS Tagging Bakeoff

  • LI Bin1, YUAN Yiguo1, LU Jingya1, FENG Minxuan1, XU Chao1, QU Weiguang2, WANG Dongbo3
Author information +
History +

摘要

中文古籍数量庞大,亟待智能处理方法进行自动处理。古文的自动分词和词性标注,是古汉语信息处理的基础任务。而大规模词库和标注语料库的缺失,导致古汉语自动分析技术发展较慢。该文介绍了第一届古代汉语分词和词性标注国际评测的概况,评测以人工标校的精加工语料库作为统一的训练数据,以F1值作为评测指标,比较了古汉语词法分析系统在测试数据(基测集和盲测集)上的优劣。评测还根据是否使用外部资源,区分出开放和封闭两种测试模式。该评测在第十三届语言资源与评测会议的第二届历史和古代语言技术研讨会上举办,共有14支队伍参赛。在基测集上,封闭测试模式分词和词性标注的F1值分别达到了96.16%和92.05%,开放测试模式分词和词性标注的F1值分别达到了96.34%和92.56%。在盲测集上,封闭测试分词和词性标注的F1值分别达到93.64%和87.77%,开放测试分词和词性标注F1值则分别达到95.03%和89.47%。未登录词依然是古代汉语词法分析的瓶颈。该评测的最优系统把目前古汉语词法分析提高到新的水平,深度学习和预训练模型有力地提高了古汉语自动分析的效果。

Abstract

Automatic word segmentation and part-of-speech tagging of ancient texts are the basic tasks of ancient Chinese information processing. The lack of large-scale vocabulary and annotated corpus leads to the slow development of ancient Chinese processing technology. The paper summrizes the First International Ancient Chinese Word Segmentation and POS Tagging Bakeoff, which provies manually annotated corpus as unified training data and basic test set and blind test set. The bakeoff also distinguishes open and close test mode according to whether external resources are used. The bakeoff was held at the Second Workshop on Language Technologies for Historical and Ancient Languages (LT4HALA), which is in the context of the 13th Edition of the Language Resources and Evaluation Conference (LREC). A total of 14 teams participated in the bakeoff. On the basic test set, the F1-scores of word segmentation and POS tagging reaches 96.16% and 92.05%, respectively, in the close test, while 96.34% and 92.56%, respectively, in the open test. On the blind test set, the F1-scores of word segmentation and POS tagging reaches 93.64% and 87.77%, respectively, in the close test, while 95.03% and 89.47%, respectively, in the open test. The out-of-vocabulary words are still the barrier of ancient Chinese lexical analysis, and the deep learning and pre-training model effectively improve the performance of automatic ancient Chinese processing.

关键词

古汉语 / 评测 / 自动分词 / 词性标注 / 古文信息处理

Key words

ancient Chinese / evaluation / word segmentation / POS tagging / ancient language information processing

引用本文

导出引用
李斌,袁义国,芦靖雅,冯敏萱,许超,曲维光,王东波. 第一届古代汉语分词和词性标注国际评测. 中文信息学报. 2023, 37(3): 46-53,64
LI Bin, YUAN Yiguo, LU Jingya, FENG Minxuan, XU Chao, QU Weiguang, WANG Dongbo. Review of the First International Ancient Chinese Word Segmentation and POS Tagging Bakeoff. Journal of Chinese Information Processing. 2023, 37(3): 46-53,64

参考文献

[1] 孙茂松,邹嘉彦.汉语自动分词研究评述[J].当代语言学,2001(01): 22-32.
[2] XUE N.Chinese word segmentation as character tagging[C]//Proceedings of the International Journal of Computational Linguistics & Chinese Language Processing, Special Issue on Word Formation and Chinese Language Processing, 2003: 29-48.
[3] 张开旭,孙茂松.执行中文分词和词性标注的统一框架[J].中国通信,2012,9(03): 1-9.
[4] SPROAT R,EMERSON T.The first international Chinese word segmentation bakeoff[C]//Proceedings of the 2nd SIGHAN Workshop on Chinese Language Processing, 2003: 133-143.
[5] EMERSON T.The second international Chinese word segmentation bakeoff[C]//Proceedings of the 4th SIGHAN Workshop on Chinese Language Processing, 2005: 123-133.
[6] LEVOW G A.The third international Chinese language processing bakeoff: Word segmentation and named entity recognition[C]//Proceedings of the 5th SIGHAN Workshop on Chinese Language Processing, 2006: 108-117.
[7] JIN G,CHEN X.The 4th international Chinese language processing bakeoff: Chinese word segmentation, named entity recognition and Chinese POS tagging[C]//Proceedings of the 6th SIGHAN Workshop on Chinese Language Processing, 2008: 69-81.
[8] ZHAO H,LIU Q.The CIPS-SIGHAN CLP Chinese word segmentation backoff[C]//Proceedings of the CIPS-SIGHAN Joint Conference on Chinese Language Processing, 2010.
[9] DUAN H,SUI Z,TIAN Y,et al.The CIPS-SIGHAN CLP Chinese word segmentation on microblog corpora bakeoff[C]//Proceedings of the 2nd CIPS-SIGHAN Joint Conference on Chinese Language Processing, 2012: 35-40.
[10] DUAN H,SUI Z,GE T.The CIPS-SIGHAN CLP Chinese word segmentation bakeoff [C]//Proceedings of the 3rd CIPS-SIGHAN Joint Conference on Chinese Language Processing, 2014: 90-95.
[11] 蒋绍愚.也谈文言和白话[J].清华大学学报(哲学社会科学版),2019,34(02): 1-13.
[12] 中国古籍总目编纂委员会.中国古籍总目[M].北京: 中华书局,2009-2013.
[13] 邓三鸿,胡昊天,王昊,等.古文自动处理研究现状与新时代发展趋势展望[J].科技情报研究,2021,3(01): 1-20.
[14] 陈小荷,冯敏萱,徐润华.先秦文献的信息处理[M].北京: 世界图书出版公司,2013.
[15] 石民,李斌,陈小荷.基于CRF的先秦汉语分词标注一体化研究[J].中文信息学报,2010,24(02): 39-45.
[16] 程宁,李斌,葛四嘉,等.基于BiLSTM-CRF的古汉语自动断句与词法分析一体化研究[J].中文信息学报,2020,34(04): 1-9.
[17] 俞敬松,魏一,张永伟,等.基于非参数贝叶斯模型和深度学习的古文分词研究[J].中文信息学报,2020,34(06): 1-8.
[18] 王东波,刘畅,朱子赫,等.SikuBERT与SikuRoBERTa: 面向数字人文的《四库全书》预训练模型构建及应用研究[J].图书馆论坛,2022,42(06): 31-43.
[19] 刘畅,王东波,胡昊天,等.面向数字人文的融合外部特征的典籍自动分词研究: 以SikuBERT预训练模型为例[J].图书馆论坛,2022,42(06): 44-54.
[20] LIU Y,OTT M,GOYAL N,et al.Roberta: A robustly optimized bert pretraining approach[J].arXiv preprint arXiv: 1907.11692,2019.
[21] GAL Y,GHAHRAMANI Z.Dropout as a bayesian approx-imation: Representing model uncertainty in deep learning[C]//Proceedings of the International Conference on Machine Learning. PMLR,2016: 1050-1059.

基金

国家社会科学基金(21ZD&331);江苏省社会科学基金(20JYB004);国家语委项目(YB145—41);古籍工作重点课题(22GJK006)
PDF(1298 KB)

Accesses

Citation

Detail

段落导航
相关文章

/