|
|
Automatic Ancient Chinese Texts Segmentation Based on BERT |
YU Jingsong1, WEI Yi1, ZHANG Yongwei2 |
1.School of Software and Microelectronics, Peking University, Beijing 100871, China; 2.Institute of Linguistics, Chinese Academy of Social Sciences, Beijing 100732, China |
|
|
Abstract Ancient Chinese differs from modern Chinese in words and grammar. Since there are no explicit marks among sentences in ancient Chinese texts, today's readers find it's hard to understand them. It is also difficult and requires expertise in a variety of fields to segment ancient text. We investigate to perform automatic texts segmentation and punctuation based on recent deep learning technologies. By pre-training a BERT (Bidirectional Encoder Representations from Transformers) model for ancient Chinese texts ourselves, we get the current state-of-the-art results on both tasks via fine-tuning. Comparing to traditional statistical methods and current BiLSTM+CRF solution, our approach significantly outperforms them by achieving F1-scores of 89.97% and 91.67% on small-scaled single category corpus and large-scaled multi-categories corpus,respectively. Especially, our approach shows its good generalization ability by achieving F1-score of 88.76% on a fully new Taoist corpus. On the punctuation task, our method F1 score reached 70.40%, which exceeded the baseline BiLSTM+CRF model by 12.15%.
|
Received: 15 May 2019
|
|
|
|
|
[1] 赵巧丽.略谈古人对古书句读的认知机制[J].今日南国(理论创新版),2008(3): 187-188. [2] 黄水清,王东波.古文信息处理研究的现状及趋势[J].图书情报工作,2017,61(12): 43-49. [3] 顾磊,赵阳.古籍智能整理研究现状及存在的问题[J].图书馆学研究,2016(9): 54-58. [4] 黄建年,侯汉清.农业古籍断句标点模式研究[J].中文信息学报,2008,22(4): 31-38. [5] 陈天莹,陈蓉,潘璐璐,等.基于前后文n-gram模型的古汉语句子切分[J].计算机工程,2007(03): 192-193,196. [6] 张开旭,夏云庆,宇航.基于条件随机场的古汉语自动断句与标点方法[J].清华大学学报(自然科学版),2009,49(10): 1733-1736. [7] Hen Hsen Huang,Chuen Tsai Sun,Hsin Hsi Chen. Classical Chinese sentence segmentation[C]//Proceedings of CIPS-SIGHAN Joint Conference on Chinese Language Processing,2010. [8] 许京奕.古汉语文本自动句读研究[D].北京: 北京大学博士学位论文,2011. [9] Huang Z,Xu W,Yu K,et al. Bidirectional LSTM-CRF models for sequence tagging[J]. arXiv preprint arXiv: 1508.019901,2015. [10] 王博立,史晓东,苏劲松.一种基于循环神经网络的古文断句方法[J].北京大学学报(自然科学版),2017,53(02): 255-261. [11] Chung J,Gulcehre C,Cho K,et al. Empirical evaluation of gated recurrent neural networks on sequence modeling[J].arXiv preprint arXiv: 1412.3555,2014. [12] Vaswani A,Shazeer N,Parmar N,et al. Attention is all you need[J]. Neural Information Processing Systems,2017: 5998-6008. [13] Devlin J,Chang M,Lee K,et al. BERT: Pre-training of deep bidirectional transformers for language understanding[J].arXiv preprint arXiv: 1801.04805,2018. [14] Taylor W L. Cloze Procedure: A new tool for measuring readability[J]. Journalism Bulletin,1953,30(30): 415-433. [15] Hochreiter S,Schmidhuber J. Long short-term memory[J]. Neural Computation,1997,9(8): 1735-1780. [16] Zaremba W,Sutskever I,Vinyals O,et al. Recurrent Neural Network Regularization[J]. arXiv preprint arXiv: 1409.2329,2014. [17] Srivastava N,Hinton G E,Krizhevsky A,et al. Dropout: A simple way to prevent neural networks from overfitting[J]. Journal of Machine Learning Research,2014,15(1): 1929-1958. [18] Lafferty J D,Mccallum A,Pereira F,et al. Probabilistic models for segmenting and labeling sequence data[C]//Proceedings of the 18th International Conference on Machine Learning,2002. [19] 张合,王晓东,杨建宇,周卫东.一种基于层叠CRF的古文断句与句读标记方法[J].计算机应用研究,2009(9): 3326-3329. [20] Abadi M,Agarwal A,Barham P,et al. TensorFlow: Large-scale machine learning on heterogeneous distributed systems[J]. arXiv preprint arXiv: 1603.04467,2016. |
|
|
|