基于感知器的中文分词增量训练方法研究

韩 冰,刘一佳,车万翔,刘 挺

PDF(1548 KB)
PDF(1548 KB)
中文信息学报 ›› 2015, Vol. 29 ›› Issue (5) : 49-55.
词法与分词

基于感知器的中文分词增量训练方法研究

  • 韩 冰,刘一佳,车万翔,刘 挺
作者信息 +

An Incremental Learning Scheme for Perceptron Based Chinese Word Segmentation

  • HAN Bing, LIU Yijia, CHE Wanxiang, LIU Ting
Author information +
History +

摘要

该文提出了一种基于感知器的中文分词增量训练方法。该方法可在训练好的模型基础上添加目标领域标注数据继续训练,解决了大规模切分数据难于共享,源领域与目标领域数据混合需要重新训练等问题。实验表明,增量训练可以有效提升领域适应性,达到与传统数据混合相类似的效果。同时该文方法模型占用空间小,训练时间短,可以快速训练获得目标领域的模型。

Abstract

In this paper, we propose an incremental learning scheme for perceptron based Chinese word segmentation. Our method can perform continuous training over a fine tuned source domain model, enabling to deliver model without annotated data and re-training. Experimental results shows the scheme proposed can significantly improve adaptation performance on Chinese word segmentation and achieve comparable performance with traditional method. At the same time, our method can significantly reduce the model size and the training time.

关键词

中文分词 / 领域适应 / 增量训练

Key words

Chinese word segmentation / domain adaptation / incremental learning

引用本文

导出引用
韩 冰,刘一佳,车万翔,刘 挺. 基于感知器的中文分词增量训练方法研究. 中文信息学报. 2015, 29(5): 49-55
HAN Bing, LIU Yijia, CHE Wanxiang, LIU Ting. An Incremental Learning Scheme for Perceptron Based Chinese Word Segmentation. Journal of Chinese Information Processing. 2015, 29(5): 49-55

参考文献

[1] XUE N, SHEN L. Chinese word segmentation as LMR tagging[C]//Proceedings of the second SIGHAN workshop on Chinese language processing. 2003, 17: 176-179.
[2] ZHANG Y, CLARK S. Chinese Segmentation with a Word-Based Perceptron Algorithm[C]//Proceedings of the 45th Annual Meeting of the Association of Computational Linguistics. 2007: 840-847.
[3] SHI Y, WANG M. A dual-layer CRFs based joint decoding method for cascaded segmentation and labeling tasks[C]//Proceedings of IJCAI. 2007, 7: 1707-1712.
[4] SUN W. Word-based and Character-based Word Segmentation Models: Comparison and Combination[C]//Proceedings of the COLING 2010: Posters. 2010: 1211-1219.
[5] ZHANG M, ZHANG Y, CHE W,et al. Type-Supervised Domain Adaptation for Joint Segmentation and POS-Tagging[C]//Proceedings of the 14th Conference of the European Chapter of the Association for Computational Linguistics. 2014: 588-597.
[6] LIU Y, ZHANG Y. Unsupervised Domain Adaptation for Joint Segmentation and POS-Tagging[C]//Proceedings of COLING 2012: Posters. 2012: 745-754.
[7] LIU Y, ZHANG Y, CHE W, et al. Domain Adaptation for CRF-based Chinese Word Segmentation using Free Annotations[C]//Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP). 2014: 864-874.
[8] LIU Y, ZHANG M, CHE W, et al. Micro blogs Oriented Word Segmentation System[C]//Proceedings of the Second CIPS-SIGHAN Joint Conference on Chinese Language Processing. 2012: 85-89.
[9] XUE N. Chinese word segmentation as character tagging[J]. Computational Linguistics and Chinese Language Processing, 2003, 8(1): 29-48.
[10] COLLINS M. Discriminative Training Methods for Hidden Markov Models: Theory and experiments with perceptron algorithms[C]//Proceedings of the ACL-02 conference on Empirical methods in natural language processing-Volume 10. 2002: 1-8.
[11] SUN W, XU J. Enhancing Chinese word segmentation using unlabeled data[C]//Proceedings of the Conference on Empirical Methods in Natural Language Processing. 2011: 970-979.
[12] 张梅山, 邓知龙, 车万翔,等. 统计与词典相结合的领域自适应中文分词[J]. 中文信息学报, 2010, 26(2): 8-12.
PDF(1548 KB)

586

Accesses

0

Citation

Detail

段落导航
相关文章

/