基于预训练语言模型的繁体古文自动句读研究

唐雪梅,苏祺,王军,陈雨航,杨浩

PDF(4947 KB)
PDF(4947 KB)
中文信息学报 ›› 2023, Vol. 37 ›› Issue (8) : 159-168.
自然语言理解与生成

基于预训练语言模型的繁体古文自动句读研究

  • 唐雪梅1,2,苏祺2,3,4,王军1,2,4,陈雨航1,2,杨浩2,4
作者信息 +

Automatic Traditional Ancient Chinese Texts Segmentation and Punctuation Based on Pre-trained Language Model

  • TANG Xuemei1,2, SU Qi2,3,4, WANG Jun1,2,4, CHEN Yuhang1,2, YANG Hao2,4
Author information +
History +

摘要

未经整理的古代典籍不含任何标点,不符合当代人的阅读习惯,古籍加断句标点之后有助于阅读、研究和出版。该文提出了一种基于预训练语言模型的繁体古文自动句读框架。该文整理了约10亿字的繁体古文语料,对预训练语言模型进行增量训练,在此基础上实现古文自动句读和标点。实验表明,经过大规模繁体古文语料增量训练后的语言模型具备更好的古文语义表示能力,能够有助提升繁体古文自动句读和自动标点的效果。融合增量训练模型之后,古文断句F1值达到95.03%,古文标点F1值达到了80.18%,分别比使用未增量训练的语言模型提升1.83%和2.21%。为解决现有篇章级句读方案效率低的问题,该文改进了前人的串行滑动窗口方案,在一定程度上提高了句读效率,并提出一种新的并行滑动窗口方案,能够高效准确地进行长文本自动句读。

Abstract

Ancient books without annotations do not contain any punctuation, which is not in line with modern people's reading habits. Punctuation in ancient books is helpful for reading, research and publication. In this paper, we propose an automatic punctuation framework for traditional ancient Chinese text based on pre-trained language model. We build a traditional Chinese ancient corpus, containing about 1 billion characters. We incrementally train BERT (Bidirectional Encoder Representation from Transformers) using this corpus. Experimental results demonstrate that the incrementally trained language model exhibits enhanced semantic representation ability for ancient Chinese texts. The segmentation and punctuation F1 scores of the incrementally trained model reach 95.03% and 80.18%, respectively, representing improvements of 1.83% and 2.21% compared to the language model without incremental training. To address the efficiency issue in existing long text segmentation, we modify the previous serial sliding window approach, leading to improved text segmentation efficiency to a certain extent. Additionally, in contrast to the existing serial sliding window approach, we propose a new parallel sliding window mode, which efficiently and accurately segments long text automatically.

关键词

自动句读 / 自动标点 / 预训练语言模型

Key words

automatic texts segmentation / automatic punctuation / pre-trained language model

引用本文

导出引用
唐雪梅,苏祺,王军,陈雨航,杨浩. 基于预训练语言模型的繁体古文自动句读研究. 中文信息学报. 2023, 37(8): 159-168
TANG Xuemei, SU Qi, WANG Jun, CHEN Yuhang, YANG Hao. Automatic Traditional Ancient Chinese Texts Segmentation and Punctuation Based on Pre-trained Language Model. Journal of Chinese Information Processing. 2023, 37(8): 159-168

参考文献

[1] 黄水清,王东波.古文信息处理研究的现状及趋势[J].图书情报工作,2017,61(12): 43-49.
[2] 王博立,史晓东,苏劲松.一种基于循环神经网络的古文断句方法[J].北京大学学报(自然科学版),2017,53(02): 255-261.
[3] 胡韧奋,李绅,诸雨辰.基于深层语言模型的古汉语知识表示及自动断句研究[J].中文信息学报,2021,35(04): 8-15.
[4] 俞敬松,魏一,张永伟.基于BERT的古文断句研究与应用[J].中文信息学报,2019,33(11):57-63.
[5] DEVLIN J, CHANG M W, LEE K, et al.Bert: Pre-training of deep bidirectional transformers for language understanding[C]//Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2019: 4171-4186.
[6] LIU Y, OTT M, GOYAL N, et al.Roberta: A robustly optimized bert pretraining approach[J]. arXiv preprint arXiv:1907.11692, 2019.
[7] 程宁,李斌,葛四嘉,等. 基于BiLSTM-CRF的古汉语自动断句与词法分析一体化研究[J]. 中文信息学报,2020,34(04): 1-9.
[8] 释贤超,方恺齐,释贤迥等. 自动标点的原理与实现[C]. 第9届数位典藏与数位人文国际研讨会,2018.
[9] 黄建年,侯汉清. 农业古籍断句标点模式研究[J].中文信息学报,2008,(04): 31-38.
[10] 陈天莹,陈蓉,潘璐璐,等. 基于前后文n-gram模型的古汉语句子切分[J]. 计算机工程,2007(03): 192-193.
[11] 黄瀚萱. 以序列标注方法解决古汉语断句问题[D]. 台湾: 交通大学硕士学位论文,2008.
[12] 张开旭,夏云庆,宇航. 基于条件随机场的古汉语自动断句与标点方法[J]. 清华大学学报(自然科学版),2009,49(10): 1733-1736.
[13] 张合,王晓东,杨建宇,等. 一种基于层叠CRF的古文断句与句读标记方法[J]. 计算机应用研究,2009,26(09): 3326-3329.
[14] HUANG Z,XU W,YU K. Bidirectional LSTM-CRF models for sequence tagging[J].arxiv preprint ariv:1508.01991, 2015.
[15] QIU J, MA H, LEVY O, et al. Blockwise self-attention for long document understanding[C]//Proceedings of the Association for Computational Linguistics: EMNLP 2020, 2020: 2555-2565.
[16] ZAHEER M, GURUGANESH G, DUBEY K A, et al. Big Bird: Transformers for longer sequences[C]//Proceedings of the 34th International Conference on Neural Information Processing Systems, 2020: 17283-17297.

基金

国家自然科学基金(72010107003)
PDF(4947 KB)

Accesses

Citation

Detail

段落导航
相关文章

/