为了构建一个简单易扩展的中文句法分析器,我们依据朱德熙和陆俭明先生的中文二分结构的层次分析句法理论,手工构建了一个3万句的二分结构的中文句法树库,并使用哈夫曼编码方式来简化表示完全二叉树的层次结构。该文将中文句法分析转换为迭代二分的序列标注问题,并根据该任务的特点,提出了在词的间隔上进行标记的序列标注模型(RNN-Interval,RNN-INT),与常用的循环神经网络模型(RNN,LSTM)和条件随机场模型(CRF)进行对比实验,使用mx2交叉验证序贯t-检验来比较模型。实验结果表明,RNN-INT模型在窗口为1的词特征就可达到最好的性能,并好于其他窗口大小和其他序列标注模型(RNN,LSTM,CRF)。最后,在测试集上,在人工分词下,RNN-INT在短语级别的F1值(块F1) 达到71.25%,在句子级别的准确率达到约43%。
Abstract
We construct a 30 000 sentences binary Chinese Treebank which is base on Chinese syntactic theory proposed by Zhu DeXi and Lu JianMin, in which each parse is a full binary tree and represented by Huffman coding for simplicity. To deal with its parsing, we propose a sequential labeling model (RNN-Interval, abbr RNN-INT) based on RNN(recurrent neural network) tagging the intervals between words. We compared our model RNN-INT with primary RNN, LSTM and CRF models, employing the m×2 cross-validated sequential t-test. The experiment results show that the proposed model achieves the best performance with window size 1according to constituency F1 and sentence accuracy, i.e. 71.25% and 43%, respectively.
关键词
层次句法分析 /
循环神经网络(RNN) /
m×2CV序贯t-检验
{{custom_keyword}} /
Key words
hierarchical syntactic parsing /
RNN (recurrent neural network) /
m×2 cross-validated sequential t-test
{{custom_keyword}} /
{{custom_sec.title}}
{{custom_sec.title}}
{{custom_sec.content}}
参考文献
[1] Chomsky N. Syntactic structures[J]. International Journal of American Linguistics,1958,149(3):174-196.
[2] 宗成庆.统计自然语言处理 [M]. 北京:清华大学出版社,2008.
[3] 蒋宗礼,姜守旭. 形式语言与自动机理论[M]. 北京:清华大学出版社,2003.
[4] Manning C D,Schutze H. Foundations of statistical natural language processing[M]. Cambridge:MIT Press,1999.
[5] 吴伟成,周俊生,曲维光. 基于统计学习模型的句法分析方法综述[J]. 中文信息学报,2013,27(03):9-19.
[6] Briscoe T,Carroll J. Generalized probabilistic LR parsing of natural language (Corpora) with unification-based grammars[J].Computational Linguistics,1993,19(1):25-59.
[7] Collins M.Three generative,lexicalised models for statistical parsing[C]//Proceedings of the 8th Conference on European Chapter of the Association for Computational Linguistics. Association for Computational Linguistics,1997:16-23.
[8] Sagae K,Lavie A. A classifier-based parser with linear run-time complexity[C]//Proceedings of International Workshop on Parsing Technology. Association for Computational Linguistics,2005:125-132.
[9] Wang M,Sagae K,Mitamura T. A fast accurate deterministic parser for Chinese[C]//Proceedings of International Conference on Computational Linguistics and the Meeting of the Association for Computational Linguistics. Association for Computational Linguistics,2006:425-432.
[10] Zhang Y,Clark S. Transitionbased parsing of the Chinese treebank using a global discriminative model[C]//Proceedings of International Conference on Parsing Technologies. Association for Computational Linguistics,2009:162-171.
[11] Liu J,Zhang Y. Shift-reduce constituent parsing with neural lookahead features[J]. Transactions of the Association of Computational Linguistics (TACL),2017,(5):45-58.
[12] Cross J,Huang L. Span-based constituency parsing with a structure-label system and provably optimal dynamic oracles[C]//Proceedings of the Austin,Texas,USA:EMNLP6,2016,1-11.
[13] Socher R,et al. Parsing natural scenes and natural language with recursive neural networks[C]//Proceedings of International Conference on Machine Learning. Omnipress,2012:129-136.
[14] Zhang H,et al. K-Best combination of syntactic parsers[C]//Proceedings of Conference on Empirical Methods in Natural Language Processing(EMNLP),Singapore:2009,1552-1560.
[15] 朱慕华,王会珍,朱靖波. 向上学习方法改进移进-归约中文句法分析[J]. 中文信息学报,2015,29(2):33-39.
[16] 朱德熙. 语法讲义[M]. 北京:商务印书馆,1982.
[17] 陆俭明. 现代汉语语法研究教程[M]. 北京:北京大学出版社,2005.
[18] Graves A. Supervised sequence labelling with recurrent neural networks[M]. Springer Berlin Heidelberg,2012.
[19] Elman J L. Finding structure in time.[J]. Cognitive Science,1990,14(2):179-211.
[20] Huang Z,Xu W,Yu K. Bidirectional LSTM-CRF models for sequence tagging[J]. arXiv:1508.01991,2015.
[21] Hochreiter S,Schmidhuber J. Long short-term memory[J]. Neural Computation,1997,9(8):1735-1780.
[22] Lafferty John D,A Mccallum,F C N Pereira. Conditional random fields:Probabilistic models for segmenting and labeling sequence data[C]//Proceedings of the 8th International Conference on Machine Learning Morgan Kaufmann Publishers Inc,2001:282-289.
[23] Abney S,et al. Procedure for quantitatively comparing the syntactic coverage of English grammars[C]//Proceedings of A Workshop Held at Pacific Grove,California,Usa,February. DBLP,1991:306-311.
[24] Ramshaw L A,Marcus M P. Text chunking using transformation-based learning[J]. Text Speech and Language Technology,1995(11):82-94.
[25] Sang E F T K. Memory-based shallow parsing[J]. Journal of Machine Learning Research,2002,2(4):559-594.
[26] Collobert R,et al. Natural language processing (almost) from scratch[J]. Journal of Machine Learning Research,2011,12(1):2493-2537.
[27] Wang Yu,et al. Blocked 3×2 cross-validated t-test for comparing supervised classification learning algorithms[J]. Neural Computation,2014,26(1):208-235.
[28] Wang Ruibo,et al.Block-regularized m×2 cross-validated estimator of generalization error[J]. Neural Computation,2017,29(2):519-554.
[29] Team T D,et al. Theano:A Python framework for fast computation of mathematical expressions[J]. 2016,arXiv:1605.02688.
{{custom_fnGroup.title_cn}}
脚注
{{custom_fn.content}}
基金
国家社会科学基金(16BTJ34)
{{custom_fund}}