基于深层语言模型的古汉语知识表示及自动断句研究

胡韧奋,李绅,诸雨辰

PDF(2063 KB)
PDF(2063 KB)
中文信息学报 ›› 2021, Vol. 35 ›› Issue (4) : 8-15.
知识表示与知识获取

基于深层语言模型的古汉语知识表示及自动断句研究

  • 胡韧奋1,2,李绅1,诸雨辰3
作者信息 +

Knowledge Representation and Sentence Segmentation of Ancient Chinese Based on Deep Language Models

  • HU Renfen1,2, LI Shen1, ZHU Yuchen3
Author information +
History +

摘要

古文句读不仅需要考虑当前文本的语义和语境信息,还需要综合历史文化常识,对专家知识有较高要求。该文提出了一种基于深层语言模型(BERT)的古汉语知识表示方法,并在此基础上通过条件随机场和卷积神经网络实现了高精度的自动断句模型。在诗、词和古文三种文体上,模型断句F1值分别达到99%、95%和92%以上。在表达较为灵活的词和古文文体上,模型较之传统双向循环神经网络方法的F1值提升幅度达到10%以上。实验数据显示,模型能较好地捕捉诗词表达的节奏感和韵律感,也能充分利用上下文信息,实现语序、语法、语义、语境等信息的编码。在进一步的案例应用中,该方法在已出版古籍的断句疑难误例上也取得了较好的效果。

Abstract

Sentence segmentation of ancient Chinese texts is a very difficult task even for experts in this area, since it not only relies on the sentence meaning and the contextual information, but also requires historical and cultural knowledge. This paper proposes to build knowledge representation of ancient Chinese with BERT, a deep language model, and then construct the sentence segmentation model with Conditional Random Field and Convolutional Neural Networks. Our model achieves significant improvements in all of the three ancient text styles. It achieves 99%, 95% and 92% F1 scores for poems, lyrics and prose texts, respectively, out-performing Bi-GRU by 10% in lyrics and proses which are more difficult to segment. In further case studies, the method achieves good results in the difficult cases in published ancient books.

关键词

古汉语 / 自动断句 / 深层语言模型

Key words

ancient Chinese / automatic sentence segmentation / deep language model

引用本文

导出引用
胡韧奋,李绅,诸雨辰. 基于深层语言模型的古汉语知识表示及自动断句研究. 中文信息学报. 2021, 35(4): 8-15
HU Renfen, LI Shen, ZHU Yuchen. Knowledge Representation and Sentence Segmentation of Ancient Chinese Based on Deep Language Models. Journal of Chinese Information Processing. 2021, 35(4): 8-15

参考文献

[1] 朱熹. 韩文考异. 影印文渊阁四库全书(第1073册)[M]. 中国台湾: 台湾商务印书馆,1986: 566.
[2] 黄侃. 黄侃手批白文十三经[M]. 上海: 上海古籍出版社,1983: 5.
[3] 张开旭,夏云庆,宇航.基于条件随机场的古文自动断句与标点方法[J].清华大学学报(自然科学版)网络.预览,2009,49(10): 163-166.
[4] 王博立,史晓东,苏劲松.一种基于循环神经网络的古文断句方法[J].北京大学学报(自然科学版),2017,53(02): 255-261.
[5] Peters M E,Neumann M,Iyyer M,et al. Deep contextualized word representations[C]//Proceedings of NAACL. New Orleans,USA: Association for Computational Linguistics, 2018: 2227-2237.
[6] Devlin J,Chang M W,Lee K,et al. Bert: Pre-training of deep bidirectional transformers for language under-standing[C]//Proceedings of NAACL. Minneapolis,USA: Association for Computational Linguistics, 2019: 4171-4186.
[7] Mikolov T,Sutskever I,Chen K,et al. Distributed representations of words and phrases and their compositionality[C]//Proceedings of NIPS. Lake Tahoe,USA: Neural Information Processing Systems Foundation, 2013: 3111-3119.
[8] Zellig Harris. Distributional structure[J]. Word,1954,10(23): 146-162.
[9] Pennington J,Socher R,Manning C. Glove: Global vectors for word representation[C]//Proceedings of EMNLP. Doha,Qatar: Association for Computational Linguistics, 2014: 1532-1543.
[10] Levy O,Goldberg Y. Neural word embedding as implicit matrix factorization[C]//Proceedings of NIPS. Montreal,Canada: Neural Information Processing Systems Foundation, 2014: 2177-2185.
[11] Li S,Zhao Z,Hu R,et al. Analogical reasoning on Chinese morphological and semantic relations[C]//Proceedings of the ACL. Melbourne,Australia: Association for Computational Linguistics, 2018: 138-143.
[12] Zheng X,Chen J,Shang G. Deep neural network-based Chinese semantic role labeling[J/OL]. ZTE Communications,2018: 1-12. http://kns.cnki.net/kcms/detail/34.1294.TN.20180102.1045.002.html.[2018-01-02].
[13] Huang Z,Xu W,Yu K. Bidirectional LSTM-CRF models for sequence tagging[J/OL]. arXiv preprint arXiv: 1508.01991,2015.
[14] 司马朝军.中华书局《钦定四库全书总目》整理本校记[J].人文论丛,2013,1: 357-394.
[15] 颜春峰,汪少华.从《周礼正义》点校本谈避免破句的方法[J].古汉语研究,2014,2: 47-55,95.

基金

国家自然科学基金(62006021);教育部人文社会科学研究青年基金(18YJC751073);国家社会科学基金(18ZDA238)
PDF(2063 KB)

2585

Accesses

0

Citation

Detail

段落导航
相关文章

/