Language Resources Construction
LI Bin, YUAN Yiguo, LU Jingya, FENG Minxuan, XU Chao, QU Weiguang, WANG Dongbo
Journal of Chinese Information Processing.
2023, 37(3):
46-53,64.
Automatic word segmentation and part-of-speech tagging of ancient texts are the basic tasks of ancient Chinese information processing. The lack of large-scale vocabulary and annotated corpus leads to the slow development of ancient Chinese processing technology. The paper summrizes the First International Ancient Chinese Word Segmentation and POS Tagging Bakeoff, which provies manually annotated corpus as unified training data and basic test set and blind test set. The bakeoff also distinguishes open and close test mode according to whether external resources are used. The bakeoff was held at the Second Workshop on Language Technologies for Historical and Ancient Languages (LT4HALA), which is in the context of the 13th Edition of the Language Resources and Evaluation Conference (LREC). A total of 14 teams participated in the bakeoff. On the basic test set, the F1-scores of word segmentation and POS tagging reaches 96.16% and 92.05%, respectively, in the close test, while 96.34% and 92.56%, respectively, in the open test. On the blind test set, the F1-scores of word segmentation and POS tagging reaches 93.64% and 87.77%, respectively, in the close test, while 95.03% and 89.47%, respectively, in the open test. The out-of-vocabulary words are still the barrier of ancient Chinese lexical analysis, and the deep learning and pre-training model effectively improve the performance of automatic ancient Chinese processing.