古汉语与现代汉语在句法、用词等方面存在巨大的差异。古文句与句之间通常缺少分隔和标点符号,现代读者难以理解。人工断句有助于缓解上述困境,但需要丰富的专业知识,耗时耗力。计算机自动断句有助于加速对古文的准确理解,从而促进古籍研究以及中华文化的弘扬。除自动断句,该文还尝试了自动标点任务。该方案自行预训练古汉语BERT(Bidirectional Encoder Representations from Transformers)模型,并针对具体任务进行微调适配。实验表明,该方案优于目前深度学习中的主流序列切割BiLSTM+CRF模型,在单一文本类别和复合文本类别测试集上的F1值分别达到89.97%和91.67%。更重要的是,模型表现出了很强的泛化能力,未参与任何训练的《道藏》测试集上的F1值依然可达到88.76%。自动标点任务仅使用少量较为粗糙的带标点文本训练集时F1值为70.40%,较BiLSTM+CRF模型提升12.15%。两任务结果均达到当前最佳,相关代码和模型已经开源发布。
Abstract
Ancient Chinese differs from modern Chinese in words and grammar. Since there are no explicit marks among sentences in ancient Chinese texts, today's readers find it's hard to understand them. It is also difficult and requires expertise in a variety of fields to segment ancient text. We investigate to perform automatic texts segmentation and punctuation based on recent deep learning technologies. By pre-training a BERT (Bidirectional Encoder Representations from Transformers) model for ancient Chinese texts ourselves, we get the current state-of-the-art results on both tasks via fine-tuning. Comparing to traditional statistical methods and current BiLSTM+CRF solution, our approach significantly outperforms them by achieving F1-scores of 89.97% and 91.67% on small-scaled single category corpus and large-scaled multi-categories corpus,respectively. Especially, our approach shows its good generalization ability by achieving F1-score of 88.76% on a fully new Taoist corpus. On the punctuation task, our method F1 score reached 70.40%, which exceeded the baseline BiLSTM+CRF model by 12.15%.
关键词
自动断句 /
自动标点 /
BERT /
微调
{{custom_keyword}} /
Key words
automatic texts segmentation /
automatic punctuation adding /
BERT /
fine-tuning
{{custom_keyword}} /
{{custom_sec.title}}
{{custom_sec.title}}
{{custom_sec.content}}
参考文献
[1] 赵巧丽.略谈古人对古书句读的认知机制[J].今日南国(理论创新版),2008(3): 187-188.
[2] 黄水清,王东波.古文信息处理研究的现状及趋势[J].图书情报工作,2017,61(12): 43-49.
[3] 顾磊,赵阳.古籍智能整理研究现状及存在的问题[J].图书馆学研究,2016(9): 54-58.
[4] 黄建年,侯汉清.农业古籍断句标点模式研究[J].中文信息学报,2008,22(4): 31-38.
[5] 陈天莹,陈蓉,潘璐璐,等.基于前后文n-gram模型的古汉语句子切分[J].计算机工程,2007(03): 192-193,196.
[6] 张开旭,夏云庆,宇航.基于条件随机场的古汉语自动断句与标点方法[J].清华大学学报(自然科学版),2009,49(10): 1733-1736.
[7] Hen Hsen Huang,Chuen Tsai Sun,Hsin Hsi Chen. Classical Chinese sentence segmentation[C]//Proceedings of CIPS-SIGHAN Joint Conference on Chinese Language Processing,2010.
[8] 许京奕.古汉语文本自动句读研究[D].北京: 北京大学博士学位论文,2011.
[9] Huang Z,Xu W,Yu K,et al. Bidirectional LSTM-CRF models for sequence tagging[J]. arXiv preprint arXiv: 1508.019901,2015.
[10] 王博立,史晓东,苏劲松.一种基于循环神经网络的古文断句方法[J].北京大学学报(自然科学版),2017,53(02): 255-261.
[11] Chung J,Gulcehre C,Cho K,et al. Empirical evaluation of gated recurrent neural networks on sequence modeling[J].arXiv preprint arXiv: 1412.3555,2014.
[12] Vaswani A,Shazeer N,Parmar N,et al. Attention is all you need[J]. Neural Information Processing Systems,2017: 5998-6008.
[13] Devlin J,Chang M,Lee K,et al. BERT: Pre-training of deep bidirectional transformers for language understanding[J].arXiv preprint arXiv: 1801.04805,2018.
[14] Taylor W L. Cloze Procedure: A new tool for measuring readability[J]. Journalism Bulletin,1953,30(30): 415-433.
[15] Hochreiter S,Schmidhuber J. Long short-term memory[J]. Neural Computation,1997,9(8): 1735-1780.
[16] Zaremba W,Sutskever I,Vinyals O,et al. Recurrent Neural Network Regularization[J]. arXiv preprint arXiv: 1409.2329,2014.
[17] Srivastava N,Hinton G E,Krizhevsky A,et al. Dropout: A simple way to prevent neural networks from overfitting[J]. Journal of Machine Learning Research,2014,15(1): 1929-1958.
[18] Lafferty J D,Mccallum A,Pereira F,et al. Probabilistic models for segmenting and labeling sequence data[C]//Proceedings of the 18th International Conference on Machine Learning,2002.
[19] 张合,王晓东,杨建宇,周卫东.一种基于层叠CRF的古文断句与句读标记方法[J].计算机应用研究,2009(9): 3326-3329.
[20] Abadi M,Agarwal A,Barham P,et al. TensorFlow: Large-scale machine learning on heterogeneous distributed systems[J]. arXiv preprint arXiv: 1603.04467,2016.
{{custom_fnGroup.title_cn}}
脚注
{{custom_fn.content}}