基于RNN的中文二分结构句法分析

PDF(2504 KB)

中文信息学报 ›› 2019, Vol. 33 ›› Issue (1) : 35-45.

语言分析与计算

基于RNN的中文二分结构句法分析

谷波¹,王瑞波²,李济洪²,李国臣³

作者信息 +

RNN Based Chinese Parsing for Binary Tree Structure

GU Bo¹, WANG Ruibo², LI Jihong², LI Guochen³

Author information +

History +

摘要

为了构建一个简单易扩展的中文句法分析器,我们依据朱德熙和陆俭明先生的中文二分结构的层次分析句法理论,手工构建了一个3万句的二分结构的中文句法树库,并使用哈夫曼编码方式来简化表示完全二叉树的层次结构。该文将中文句法分析转换为迭代二分的序列标注问题,并根据该任务的特点,提出了在词的间隔上进行标记的序列标注模型(RNN-Interval,RNN-INT),与常用的循环神经网络模型(RNN,LSTM)和条件随机场模型(CRF)进行对比实验,使用mx2交叉验证序贯t-检验来比较模型。实验结果表明,RNN-INT模型在窗口为1的词特征就可达到最好的性能,并好于其他窗口大小和其他序列标注模型(RNN,LSTM,CRF)。最后,在测试集上,在人工分词下,RNN-INT在短语级别的F₁值(块F₁) 达到71.25%,在句子级别的准确率达到约43%。

Abstract

We construct a 30 000 sentences binary Chinese Treebank which is base on Chinese syntactic theory proposed by Zhu DeXi and Lu JianMin, in which each parse is a full binary tree and represented by Huffman coding for simplicity. To deal with its parsing, we propose a sequential labeling model (RNN-Interval, abbr RNN-INT) based on RNN(recurrent neural network) tagging the intervals between words. We compared our model RNN-INT with primary RNN, LSTM and CRF models, employing the m×2 cross-validated sequential t-test. The experiment results show that the proposed model achieves the best performance with window size 1according to constituency F₁ and sentence accuracy, i.e. 71.25% and 43%, respectively.

导出引用

谷波,王瑞波,李济洪,李国臣. 基于RNN的中文二分结构句法分析. 中文信息学报. 2019, 33(1): 35-45

GU Bo, WANG Ruibo, LI Jihong, LI Guochen. RNN Based Chinese Parsing for Binary Tree Structure. Journal of Chinese Information Processing. 2019, 33(1): 35-45

参考文献

[1] Chomsky N. Syntactic structures[J]. International Journal of American Linguistics,1958,149(3):174-196.
[2] 宗成庆.统计自然语言处理 [M]. 北京:清华大学出版社,2008.
[3] 蒋宗礼,姜守旭. 形式语言与自动机理论[M]. 北京:清华大学出版社,2003.
[4] Manning C D,Schutze H. Foundations of statistical natural language processing[M]. Cambridge:MIT Press,1999.
[5] 吴伟成,周俊生,曲维光. 基于统计学习模型的句法分析方法综述[J]. 中文信息学报,2013,27(03):9-19.
[6] Briscoe T,Carroll J. Generalized probabilistic LR parsing of natural language (Corpora) with unification-based grammars[J].Computational Linguistics,1993,19(1):25-59.
[7] Collins M.Three generative,lexicalised models for statistical parsing[C]//Proceedings of the 8th Conference on European Chapter of the Association for Computational Linguistics. Association for Computational Linguistics,1997:16-23.
[8] Sagae K,Lavie A. A classifier-based parser with linear run-time complexity[C]//Proceedings of International Workshop on Parsing Technology. Association for Computational Linguistics,2005:125-132.
[9] Wang M,Sagae K,Mitamura T. A fast accurate deterministic parser for Chinese[C]//Proceedings of International Conference on Computational Linguistics and the Meeting of the Association for Computational Linguistics. Association for Computational Linguistics,2006:425-432.
[10] Zhang Y,Clark S. Transitionbased parsing of the Chinese treebank using a global discriminative model[C]//Proceedings of International Conference on Parsing Technologies. Association for Computational Linguistics,2009:162-171.
[11] Liu J,Zhang Y. Shift-reduce constituent parsing with neural lookahead features[J]. Transactions of the Association of Computational Linguistics (TACL),2017,(5):45-58.
[12] Cross J,Huang L. Span-based constituency parsing with a structure-label system and provably optimal dynamic oracles[C]//Proceedings of the Austin,Texas,USA:EMNLP6,2016,1-11.
[13] Socher R,et al. Parsing natural scenes and natural language with recursive neural networks[C]//Proceedings of International Conference on Machine Learning. Omnipress,2012:129-136.
[14] Zhang H,et al. K-Best combination of syntactic parsers[C]//Proceedings of Conference on Empirical Methods in Natural Language Processing(EMNLP),Singapore:2009,1552-1560.
[15] 朱慕华,王会珍,朱靖波. 向上学习方法改进移进-归约中文句法分析[J]. 中文信息学报,2015,29(2):33-39.
[16] 朱德熙. 语法讲义[M]. 北京:商务印书馆,1982.
[17] 陆俭明. 现代汉语语法研究教程[M]. 北京:北京大学出版社,2005.
[18] Graves A. Supervised sequence labelling with recurrent neural networks[M]. Springer Berlin Heidelberg,2012.
[19] Elman J L. Finding structure in time.[J]. Cognitive Science,1990,14(2):179-211.
[20] Huang Z,Xu W,Yu K. Bidirectional LSTM-CRF models for sequence tagging[J]. arXiv:1508.01991,2015.
[21] Hochreiter S,Schmidhuber J. Long short-term memory[J]. Neural Computation,1997,9(8):1735-1780.
[22] Lafferty John D,A Mccallum,F C N Pereira. Conditional random fields:Probabilistic models for segmenting and labeling sequence data[C]//Proceedings of the 8th International Conference on Machine Learning Morgan Kaufmann Publishers Inc,2001:282-289.
[23] Abney S,et al. Procedure for quantitatively comparing the syntactic coverage of English grammars[C]//Proceedings of A Workshop Held at Pacific Grove,California,Usa,February. DBLP,1991:306-311.
[24] Ramshaw L A,Marcus M P. Text chunking using transformation-based learning[J]. Text Speech and Language Technology,1995(11):82-94.
[25] Sang E F T K. Memory-based shallow parsing[J]. Journal of Machine Learning Research,2002,2(4):559-594.
[26] Collobert R,et al. Natural language processing (almost) from scratch[J]. Journal of Machine Learning Research,2011,12(1):2493-2537.
[27] Wang Yu,et al. Blocked 3×2 cross-validated t-test for comparing supervised classification learning algorithms[J]. Neural Computation,2014,26(1):208-235.
[28] Wang Ruibo,et al.Block-regularized m×2 cross-validated estimator of generalization error[J]. Neural Computation,2017,29(2):519-554.
[29] Team T D,et al. Theano:A Python framework for fast computation of mathematical expressions[J]. 2016,arXiv:1605.02688.

基金

国家社会科学基金(16BTJ34)

PDF(2504 KB)

649

Accesses

Citation

Detail

段落导航

摘要
Abstract
关键词
Key words
引用本文
参考文献
基金

Received	Published
2018-09-29	2019-01-21
Issue Date
2019-01-21

选择文件类型/文献管理软件名称

选择包含的内容

摘要

Abstract

关键词

Key words

引用本文

{{custom_sec.title}}

{{custom_sec.title}}

参考文献

{{custom_fnGroup.title_cn}}

脚注

基金