成于思,施云涛. 基于深度学习和迁移学习的领域自适应中文分词[J]. 中文信息学报, 2019, 33(9): 9-16,23.
CHENG Yusi, SHI Yuntao. Domain Adaption of Chinese Word Segmentation Based onDeep Learning and Transfer Learning. , 2019, 33(9): 9-16,23.
Abstract:To improve the performance of Chinese word segmentation on specific domain, a domain adaption method of word segmentation is proposed based on deep learning and transfer learning. Firstly, a deep learning neural network of bidirectional long short-term memory CRF (BI-LSTM-CRF) model including a dictionary feature is constructed for Chinese word segmentation and trained on the general field corpus to obtain the model parameters. Secondly, the parameters of BI-LSTM-CRF model trained in a common domain corpus are fine-tuned using a small size of training corpus in construction law domain. The domain dictionary information is added to the dictionary feature. The experimental results show that transfer learning decreases the epochs for optimization. Compared with the BI-LSTM-CRF model trained in common domain, the proposed model increases the F1 by 7.02% in construction law domain. Compared with the BI-LSTM-CRF model using a domain dictionary in prediction process, the proposed model increases the F1 by 4.22%.
[1]北京航空航天大学, 信息处理用现代汉语分词规范(GB/T 13715-92)[S]. 北京: 中国标准出版社, 2004.
[2]邓丽萍, 罗智勇. 基于半监督CRF的跨领域中文分词[J]. 中文信息学报, 2017,31(4):9-19.
[3]朱艳辉, 刘璟, 徐叶强, 等. 基于条件随机场的中文领域分词研究[J]. 计算机工程与应用, 2016, 52(15): 97-100.
[4]Zhang HP, Yu HK,Xiong DY, et al. Hhmm-based Chinese lexical analyzer ictclas [C]//Proceedings of Sighan Workshop on Chinese Language Processing, Sapporo Japan, 2003: 184-187.
[5]刘泽文, 丁冬, 李春文. 基于条件随机场的中文短文本分词方法[J]. 清华大学学报(自然科学版), 2015, 55(8): 906-910.
[6]Yao Y, Huang Z. Bi-directional LSTM recurrent neural network for Chinese word segmentation [C]//Proceedings of International Conference on Neural Information Processing, Kyoto, Japan, 2016:345-353.
[7]Zhang Q, Liu X Y, Fu J L. Neural networks incorporating dictionaries for Chinese word segmentation[C]//Proceedings of the 32nd AAAI Conference on Artificial Intelligence, New Orleans, USA, 2018: 5682-5689.
[8]张子睿, 刘云清. 基于BI-LSTM-CRF模型的中文分词法[J]. 长春理工大学学报(自然科学版), 2017, 40(4): 87-92.
[9]章登义, 胡思, 徐爱萍. 一种基于双向LSTM的联合学习的中文分词方法[J/OL]. 计算机应用研究, 2019, 36(10).http://kns.cnki.net/kcmsldctail/51.1196.TP.20180709.1510.002.html
[10]张梅山, 邓知龙, 车万翔, 等. 统计与词典相结合的领域自适应中文分词[J]. 中文信息学报, 2012, 26(2): 8-12.
[11]许华婷, 张玉洁, 杨晓晖, 等. 基于Active Learning的中文分词领域自适应[J]. 中文信息学报, 2015, 29(5): 55-62.
[12]Goodfellow I, Bengio Y, Courville. 深度学习[M]. 赵申剑,等, 译. 北京: 人民邮电出版社, 2017.
[13]徐庸辉. 面向多实例分类的迁移学习研究[D]. 广东: 华南理工大学博士学位论文, 2017.
[14]Yang Z, Salakhutdinov R, Cohen W W. Transfer learning for sequence tagging with hierarchical recurrent networks[C]//Proceedings of International Conference on Learning Representations, Toulon, France, April 24-26, 2017.
[15]Xing J, Zhu K Q, Zhang S. Adaptive multi-tasktransfer learning for Chinese word segmentation in medical text[C]//Proceedings of the 27th International Conference on Computational Linguistics,New Mexico, USA, August 20-26, 2018:3619-3630.
[16]Huang Z, Xu W,Yu K. Bidirectional LSTM-CRF models for sequencetagging[J]. arXiv preprine arXiv:1508.0199/v/.2015.
[17]Mikolov T, Chen K, Corrado G, et al. Efficient estimation of word representations in vector space[C]//Proceedings of International Conference on Learning Representations, Arizona, USA, 2013:1388-1429.
[18]Pei W, Ge T, Chang B.Max-margin tensor neural network for Chinese word segmentation[C]//Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, Lisbon, Portugal, 2015: 1197-1206.
[19]Hochreiter S, Schmidhuber J. Long short-term memory[J].Neural Computation, 1997,9(8):1735-1780.
[20]成于思,施云涛.面向专业领域的中文分词方法[J].计算机工程与应用,2018,54(17):30-34,109.
[21]俞士汶, 段惠明, 朱学锋, 等. 北京大学现代汉语语料库基本加工规范[J].中文信息学报, 2002, 16(5): 49-64.
[22]尹海良. 现代汉语类词缀研究[D]. 济南:山东大学博士学位论文, 2007.