基于Active Learning的中文分词领域自适应

许华婷,张玉洁,杨晓晖,单 华,徐金安,陈钰枫

PDF(2362 KB)
PDF(2362 KB)
中文信息学报 ›› 2015, Vol. 29 ›› Issue (5) : 55-63.
词法与分词

基于Active Learning的中文分词领域自适应

  • 许华婷,张玉洁,杨晓晖,单 华,徐金安,陈钰枫
作者信息 +

Active Learning Based Domain Adaptation for Chinese Word Segmentation

  • XU Huating, ZHANG Yujie, YANG Xiaohui, SHAN Hua, XU Jinan, CHEN Yufeng
Author information +
History +

摘要

在新闻领域标注语料上训练的中文分词系统在跨领域时性能会有明显下降。针对目标领域的大规模标注语料难以获取的问题,该文提出Active learning算法与n-gram统计特征相结合的领域自适应方法。该方法通过对目标领域文本与已有标注语料的差异进行统计分析,选择含有最多未标记过的语言现象的小规模语料优先进行人工标注,然后再结合大规模文本中的n-gram统计特征训练目标领域的分词系统。该文采用了CRF训练模型,并在100万句的科技文献领域上,验证了所提方法的有效性,评测数据为人工标注的300句科技文献语料。实验结果显示,在科技文献测试语料上,基于Active Learning训练的分词系统在各项评测指标上均有提高。

Abstract

Chinese word segmentation systems trained on annotated corpus of newspaper would drop in performance when faced with a new domain. Since there is no large scale annotated corpus on the target domain, this paper describes a domain adaptation of Chinese word segmentation by active learning. The idea is to select a small amount of data for annotation to bridge the gap from the target domain to the News. The word segmentation model is re-trained by inlduing the newly annotated data. We use the CRF model for the training and a raw corpus of one million sentences on patent description as the target domain. For test data, 300 sentences are randomly selected and manually annotated. The experimental results show that the performances of the Chinese word segmentation system based on our approach are improved on each evaluation metrics.

关键词

中文分词 / 领域自适应 / 主动学习

Key words

Chinese word segmentation / domain adaptation / active learning

引用本文

导出引用
许华婷,张玉洁,杨晓晖,单 华,徐金安,陈钰枫. 基于Active Learning的中文分词领域自适应. 中文信息学报. 2015, 29(5): 55-63
XU Huating, ZHANG Yujie, YANG Xiaohui, SHAN Hua, XU Jinan, CHEN Yufeng. Active Learning Based Domain Adaptation for Chinese Word Segmentation. Journal of Chinese Information Processing. 2015, 29(5): 55-63

参考文献

[1] Rabiner L, Juang B. An introduction to hidden Markov models[J]. ASSP Magazine, 1986: 4-16.
[2] Adam L B, Della P V J, Della P S A. A maximum entropy approach to natural language processing[J]. Computational linguistics, 1996,22(1): 39-71.
[3] John L, Andrew M, et al. Conditional random fields: Probabilistic models for segmenting and labeling sequence data[C]//Proceedings of the ICML, 2001: 45-54.
[4] 张梅山,邓知龙,车万翔,等.统计与词典相结合的领域自适应中文分词[J].中文信息学报,2012,26(2): 8-12.
[5] Guo Z, Zhang Y, Su C, et al. Exploration of n-gram Features for the Domain Adaptation of Chinese Word Segmentation[J]. Nature Language Processing and Chinese Computing. Springer Berlin Heidelberg, 2012: 121-131.
[6] 苏晨, 张玉洁, 郭振, 等. 适用于特定领域机器翻译的汉语分词方法[J]. 中文信息学报, 2013, 27(5): 184-190.
[7] Angluin D. Queries and concept learning[J]. Machine Learning, 1988, 2(4):319-342.
[8] Burr S. Active Learning Literature Survey[J]. University of Wisconsinmadison, 2009, 39(2): 127-131.
[9] 宗成庆.统计自然语言处理[M].北京: 清华大学出版社,2008.
[10] GB/T 13715-1992.信息处理用现代汉语分词规范[S].北京:中国标准出版社,1992:
[11] Xia F. The Segmentation Guidelines for the Penn Chinese Treebank (3.0)[J]. 2000.
[12] 段慧明,松井久人於,徐国伟,等.大规模汉语标注语料库的制作与使用[J]. 语言文字应用,2000,(2):72-77.

许华婷(1991—),助理实验师,主要研究领域为自然语言处理。
E-mail: xuhuating91@163.com张玉洁(1961—),通信作者,教授,主要研究领域为自然语言处理。
E-mail: yjzhang@bjtu.edu.cn杨晓晖(1962—),副教授,主要研究领域为计算机应用。
E-mail: xhyang@bjtu.edu.cn
(上接第38页)

[29] E A F Gibson. A computational theory of human linguistic processing: Memory limitations and processing breakdown[D]. School of Computer Science: Carnegie Mellon University, 1991.
[30] M Marcus, G Kim, M A Marcinkiewicz, et al. The Penn Treebank: annotating predicate argument structure[C]//Proceedings of the Workshop on Human Language Technology. Association for Computational Linguistics, 1994: 114-119.

基金

国家国际科技合作专项资助(2014DFA11350);国家自然科学基金(61370130)
PDF(2362 KB)

609

Accesses

0

Citation

Detail

段落导航
相关文章

/