基于非参数贝叶斯模型和深度学习的古文分词研究

俞敬松,魏一,张永伟,杨浩

PDF(1626 KB)
PDF(1626 KB)
中文信息学报 ›› 2020, Vol. 34 ›› Issue (6) : 1-8.
语言分析与计算

基于非参数贝叶斯模型和深度学习的古文分词研究

  • 俞敬松1,魏一1,张永伟2,杨浩3
作者信息 +

Word Segmentation for Ancient Chinese Texts Based on Nonparametric Bayesian Models and Deep Learning

  • YU Jingsong1, WEI Yi1, ZHANG Yongwei2, YANG Hao3
Author information +
History +

摘要

古汉语文本中,汉字通常连续书写,词与词之间没有明显的分割标记,为现代人理解古文乃至文化传承带来许多障碍。自动分词是自然语言处理技术的基础任务之一。主流的自动分词方法需要大量人工分词语料训练,费时费力,古文分词语料获取尤其困难,限制了主流自动分词方法的应用。该文将非参数贝叶斯模型与BERT(Bidirectional Encoder Representations from Transformers)深度学习语言建模方法相结合,进行古文分词研究。在《左传》数据集上,该文提出的无监督多阶段迭代训练分词方法获得的F1值为93.28%;仅使用500句分词语料进行弱监督训练时,F1值可达95.55%,高于前人使用6/7语料(约36 000句)进行有监督训练的结果;使用相同规模训练语料时,该文方法获得的F1值为97.40%,为当前最优结果。此外,该文方法还具有较好的泛化能力,模型代码已开源发布。

Abstract

All the Chinese characters in ancient Chinese texts are written continuously, without obvious segmentation marks between words. This brings great challenges to text understanding and even cultural inheritance. To deal with word segmentation for ancient Chinese texts, we propose the Multi-Stage Iterative Training (MSIT) for unsupervised word segmentation by combining non-parametric Bayesian models with BERT(Bidirectional Encoder Representations from Transformers). It achieves the F1 score of 93.28% on Zuozhuan (an ancient Chinese history book) dataset. After adding only 500 ground truth sentences, which can be considered as weakly supervised learning, the F1 score reaches 95.55%. It outperforms the previous best result, which trains on 6/7 of the Zuozhuan dataset (about 36,000 ground truth sentences). When using the same training set, our method gets the F1 score of 97.40%, the state-of-the-art result. Our proposed method is not only better than traditional sequence labeling algorithms including BERT model, but also proved that it has better generalization ability by experiments. The model and related codes are available online.

关键词

古文分词 / 非参数贝叶斯模型 / 深度学习 / 无指导学习 / 弱指导学习

Key words

word segmentation for ancient Chinese texts / nonparametric Bayesian models / deep learning / unsupervised learning / weakly supervised learning

引用本文

导出引用
俞敬松,魏一,张永伟,杨浩. 基于非参数贝叶斯模型和深度学习的古文分词研究. 中文信息学报. 2020, 34(6): 1-8
YU Jingsong, WEI Yi, ZHANG Yongwei, YANG Hao. Word Segmentation for Ancient Chinese Texts Based on Nonparametric Bayesian Models and Deep Learning. Journal of Chinese Information Processing. 2020, 34(6): 1-8

参考文献

[1] 俞敬松,魏一,张永伟.基于BERT的古文断句研究与应用[J].中文信息学报,2019,33(11): 57-63.
[2] Devlin J, Chang M, Lee K, et al. BERT: Pre-training of deep bidirectional transformers for language understanding[J]. arXiv: preprint arXiv: 1810.04805, 2018.
[3] 石民,李斌,陈小荷.基于CRF的先秦汉语分词标注一体化研究[J].中文信息学报,2010,24(02): 39-45.
[4] 杨世超,纪月,赵立鹏.基于条件随机场的古汉语分词研究[J].电脑知识与技术,2017,13(22): 183-184.
[5] 邱冰,皇甫娟.基于中文信息处理的古代汉语分词研究[J].微计算机信息,2008(24): 100-102.
[6] 石民. 基于CRF的古汉语分词标注一体化研究[C]//中国中文信息学会.中国计算机语言学研究前沿进展(2007-2009),2009: 6.
[7] 王晓玉,李斌.基于CRFs和词典信息的中古汉语自动分词[J].数据分析与知识发现,2017,1(05): 62-70.
[8] Zhang Y, Clark S. Joint word segmentation and POS tagging using a single perceptron[C]//Proceedings of ACL-08 Meeting of the Association for Computational Linguistics, 2008: 888-896.
[9] 严顺.基于CRF的古汉语分词标注模型研究[J].江苏科技信息,2016(08): 10-12.
[10] Kitt C, Wilks Y. Unsupervised learning of word boundary with description length gain[C]//Proceedings of Conference on Computational Natural Language Learning, 1999: 1-6.
[11] Jin Z, Tanaka-Ishii K. Unsupervised segmentation of Chinese text by use of branching entropy[C]//Proceedings of ACL-06 Meeting of the Association for Computational Linguistics, 2006: 428-435.
[12] Magistry P, Sagot B. Unsupervized word segmentation: the case for Mandarin Chinese[C]//Proceedings of ACL-12 Meeting of the Association for Computational Linguistics, 2012: 383-387.
[13] Teh Y W, Jordan M I, Beal M J, et al. Sharing clusters among related groups: Hierarchical dirichlet processes[C]//Proceedings of NIPS 2004, 2004: 1385-1392.
[14] Mochihashi D, Yamada T, Ueda N, et al. Bayesian unsupervised word segmentation with nested Pitman-Yor language modeling[C]//Proceedings of the 14th Annual Meeting of the ACL and the 4th IJCNLP of the AFNLP, 2009: 100-108.
[15] Chen M, Chang B, Pei W, et al. A joint model for unsupervised Chinese word segmentation[C]//Proceedings of 2014 Conference on Empirical Methods in Natural Language Processing, 2014: 854-863.
[16] Goldwater S, Griffiths T L, Johnson M, et al. A Bayesian framework for word segmentation: Exploring the effects of context[J]. Cognition, 2009, 112(1): 21-54.
[17] Vaswani A, Shazeer N, Parmar N, et al. Attention is all you need[C]//Proceedings of the 31st Conference on Neural Information Processing Systems(NIPS 2017), 2017: 5998-6008.
[18] Hinton G E. Products of experts[C]//Proceedings of the 9th International Conference on Artificial Neural Networks, 1999: 1-6.
[19] Zhou Z, Li M. Tri-training: Exploiting unlabeled data using three classifiers[J]. IEEE Transactions on Knowledge and Data Engineering, 2005, 17(11): 1529-1541.
[20] 黄水清,王东波,何琳.基于先秦语料库的古汉语地名自动识别模型构建研究[J].图书情报工作,2015,59(12): 135-140.

基金

国家自然科学基金(61876004)
PDF(1626 KB)

Accesses

Citation

Detail

段落导航
相关文章

/