基于字符的中文分词、词性标注和依存句法分析联合模型

郭振,张玉洁,苏晨,徐金安

PDF(1867 KB)
PDF(1867 KB)
中文信息学报 ›› 2014, Vol. 28 ›› Issue (6) : 1-8.
词法·句法·语义分析及应用

基于字符的中文分词、词性标注和依存句法分析联合模型

  • 郭振,张玉洁,苏晨,徐金安
作者信息 +

Character-level Dependency Model for Joint Word Segmentation, POS Tagging, and Dependency Parsing in Chinese

  • GUO Zhen, ZHANG Yujie, SU Chen, XU Jinan
Author information +
History +

摘要

目前,基于转移的中文分词、词性标注和依存句法分析联合模型存在两大问题: 一是任务的融合方式有待改进;二是模型性能受限于全标注语料的规模。针对第一个问题,该文利用词语内部结构将基于词语的依存句法树扩展成了基于字符的依存句法树,采用转移策略,实现了基于字符的中文分词、词性标注和依存句法分析联合模型;依据序列标注的中文分词方法,将基于转移的中文分词处理方案重新设计为4种转移动作: Shift_S、Shift_B、Shift_M和Shift_E,同时能够将以往中文分词的研究成果融入联合模型。针对第二个问题,该文使用具有部分标注信息的语料,从中抽取字符串层面的n-gram特征和结构层面的依存子树特征融入联合模型,实现了半监督的中文分词、词性标注和依存句法分析联合模型。在宾州中文树库上的实验结果表明,该文的模型在中文分词、词性标注和依存分析任务上的F1值分别达到了98.31%、94.84%和81.71%,较单任务模型的结果分别提升了0.92%、1.77%和3.95%。其中,中文分词和词性标注在目前公布的研究结果中取得了最好成绩。

Abstract

Recent work on joint word segmentation, POS tagging, and dependency parsing in Chinese has two key problems: one is that the word segmentation based on character and the dependency parsing based on word are not well-combined in the transition-based framework; the other is that the current joint model suffers from the insufficiency of annotated corpus. In order to resolve the first problem, we propose to transform the conventional word-based dependency tree into character-based dependency tree by using the internal structure of words and then propose a novel character-level joint model for the three tasks. For Chinese word segmentation, we design 4 transition actions: Shfit_S, Shift_B, Shift_M and Shift_E, through which the features used in previous researches can also be integrated into the model. In order to resolve the second problem, we propose a novel semi-supervised joint model for exploiting n-gram feature and dependency subtree feature from partially-annotated corpus. Experimental results on the Chinese Treebank show that our joint model achieved the F1-scores of 98.31%, 94.84% and 81.71% for Chinese word segmentation, POS tagging, and dependency parsing, respectively. Our model outperforms the pipeline model in the three tasks by 0.92%, 1.77% and 3.95%, respectively. Especially, the F1 value of word segmentation and POS tagging achieved the best among the public results so far.

关键词

联合模型 / 中文分词和词性标注 / 依存句法分析 / 词语内部依存结构 / 半监督学习

Key words

joint model / Chinese word segmentation and POS tagging / dependency parsing / word internal dependency structure / semi-supervised learning

引用本文

导出引用
郭振,张玉洁,苏晨,徐金安. 基于字符的中文分词、词性标注和依存句法分析联合模型. 中文信息学报. 2014, 28(6): 1-8
GUO Zhen, ZHANG Yujie, SU Chen, XU Jinan. Character-level Dependency Model for Joint Word Segmentation, POS Tagging, and Dependency Parsing in Chinese. Journal of Chinese Information Processing. 2014, 28(6): 1-8

参考文献

[1] Zhang Y, Clark S. A fast decoder for joint word segmentation and POS-tagging using a single discriminative model[C]//Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 2010: 843-852.
[2] Kruengkrai C, Uchimoto K, Kazama J, et al. An error-driven word-character hybrid model for joint Chinese word segmentation and POS tagging[C]//Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP: Volume 1-Volume 1. Association for Computational Linguistics, 2009: 513-521.
[3] Hatori J, Matsuzaki T, Miyao Y, et al. Incremental Joint POS Tagging and Dependency Parsing in Chinese[C]//Proceedings of the IJCNLP. 2011: 1216-1224.
[4] Li Z, Zhang M, Che W, et al. Joint models for Chinese POS tagging and dependency parsing[C]//Proceedings of the Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 2011: 1180-1191.
[5] Hatori J, Matsuzaki T, Miyao Y, et al. Incremental joint approach to word segmentation, pos tagging, and dependency parsing in chinese[C]//Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Long Papers-Volume 1. Association for Computational Linguistics, 2012: 1045-1053.
[6] Zhang M, Zhang Y, Che W, et al. Chinese parsing exploiting characters[C]//Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics. 2013:125-134.
[7] Guo Z, Zhang Y, Su C, et al. Exploration of N-gram Features for the Domain Adaptation of Chinese Word Segmentation[C]//Proceedings of the NLPCC 2012.
[8] Wang Y, Jun'ichi Kazama Y T, Tsuruoka Y, et al. Improving Chinese Word Segmentation and POS Tagging with Semi-supervised Methods Using Large Auto-Analyzed Data[C]//Proceedings of the IJCNLP. 2011: 309-317.
[9] Koo T, Carreras X, Collins M. Simple semi-supervised dependency parsing[C]//Proceedings of ACL/HLT. 2008:595 603.
[10] Chen W, Kazama J, Uchimoto K, et al. Improving dependency parsing with subtrees from auto-parsed data[C]//Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing: Volume 2-Volume 2. Association for Computational Linguistics, 2009: 570-579.
[11] Chen W, Kazama J, Torisawa K. Bitext dependency parsing with bilingual subtree constraints[C]//Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, 2010: 21-29.
[12] Zhang Y, Nivre J. Analyzing the Effect of Global Learning and Beam-Search on Transition-Based Dependency Parsing[C]//Proceedings of the COLING (Posters). 2012: 1391-1400.
[13] Collins M, Roark B. Incremental parsing with the perceptron algorithm[C]//Proceedings of the 42nd Annual Meeting on Association for Computational Linguistics. Association for Computational Linguistics, 2004: 111.
[14] Zhu M, Zhang Y, Chen W, et al. Fast and Accurate Shift-Reduce Constituent Parsing[C]//Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics. 2013: 434-443.
[15] Li Z, Zhou G. Unified dependency parsing of Chinese morphological and syntactic structures[C]//Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning. Association for Computational Linguistics, 2012: 1445-1454.
[16] Zhao H, Huang C N, Li M, et al. Effective tag set selection in Chinese word segmentation via conditional random field modeling[C]//Proceedings of PACLIC. 2006, 20: 87-94.
[17] McDonald R, Crammer K, Pereira F. Online large-margin training of dependency parsers[C]//Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics. Association for Computational Linguistics, 2005: 91-98.
[18] Sun W. A stacked sub-word model for joint Chinese word segmentation and part-of-speech tagging[C]//Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies-Volume 1. Association for Computational Linguistics, 2011: 1385-1394.
[19] Zhang M, Zhang Y, Che W, et al. Character-Level Chinese Dependency Parsing[C]//Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics. 2014: 1326 1336.

基金

国家国际科技合作专项资助(2014DFA11350);国家自然科学基金(61370130);北京交通大学人才基金(KKRC11001532)
PDF(1867 KB)

Accesses

Citation

Detail

段落导航
相关文章

/