基于标点符号分割的汉语句法分析算法

毛奇,连乐新,周文翠,袁春风

PDF(275 KB)
PDF(275 KB)
中文信息学报 ›› 2007, Vol. 21 ›› Issue (2) : 29-34.
综述

基于标点符号分割的汉语句法分析算法

  • 毛奇,连乐新,周文翠,袁春风
作者信息 +

Chinese Syntactic Parsing Algorithm Based on Segmentation of Punctuation

  • MAO Qi, LIAN Le-xin, ZHOU Wen-cui, YUAN Chun-feng
Author information +
History +

摘要

目前大部分句法解析器都忽略标点符号这一重要的句法特征或者只进行非常简单的处理。本文根据标点符号的句法结构特性,提出单独解析块的概念,并且根据标点符号在句子中的特有特征和位置关系,给出了基于决策树算法(Id3)单独解析块识别方法,将标点融入汉语句法分析中。本文所用的实验数据(包括训练集和测试集)均来自中文宾州树库5.0。对句长大于40个词的汉语长句单独进行了实验,句法分析精度和召回率分别提高1.59%和0.93%,同时时间开销降低了近2/3。实验结果表明,标点对汉语长句句法分析非常有利, 系统性能获得了较大提高。

Abstract

So far, most syntactic parsers neglect the punctuations or oversimplify their functions. However, it is actually very important information of syntactic characters. According to the features of punctuation in the syntactic structure, this paper proposes a kind of new concept of separate parsing phrase, and according to the typical character and the position of punctuation in a sentence, this paper also presents one way to identify the separate parsing phrase based on the decision tree algorithm (Id3). In this paper, the punctuation is integrated into syntactic analysis. All the experimental data sets, including the training data and test data, are derived from the Chinese Penn Tree Bank 5.0. The experiments have been done solely using the sentences, the length of which is over 40 Chinese words. The results indicate that the accuracy and the recall rate have been improved by 1.59% and 0.93% respectively, and the time expense has been reduced by nearly 66.6%. The results show that the punctuation is quite useful and effective to parse the long sentences in Chinese.

关键词

计算机应用 / 中文信息处理 / 句法解析器 / 单独解析块 / 决策树(Id3)

Key words

computer application / Chinese information processing / syntactic parser / separate parsing phrase / decision tree algorithm Id3

引用本文

导出引用
毛奇,连乐新,周文翠,袁春风. 基于标点符号分割的汉语句法分析算法. 中文信息学报. 2007, 21(2): 29-34
MAO Qi, LIAN Le-xin, ZHOU Wen-cui, YUAN Chun-feng. Chinese Syntactic Parsing Algorithm Based on Segmentation of Punctuation. Journal of Chinese Information Processing. 2007, 21(2): 29-34

参考文献


[1] Charniak, E. Statistical parsing with a context-free grammar and word statistics. In: Proceedings of the Fourteenth National Conference on Artificial Intelligence[J]. AAAI Press/MIT Press, Menlo Park,CA,1997,598-603.
[2] Eugene Charniak. A maximum entropy inspired parser[A]. In: Proceedings of the 1st Meeting of the North American Chapter of the Association for Computational Linguistics[C]. Seattle ,Washington, April 29 to May 4,2000. 132-139.
[3] Daniel M. Bikel, David Chiang. Two statistical parsing models applied to the Chinese Treebank[A]. In: Martha Palmer, Mitch Marcus, Aravind Joshi, and Fei Xia, editors, Proceedings of the Second Chinese Language Processing Workshop[C]. Hong Kong: 2000. 1-6.
[4] Michael John Collins. Head-Driven Statistical Models for Natural Language Parsing[D]. PhD thesis, University of Pennsylvania, 1999.
[5] Daniel M. Bikel. A statistical model for parsing and word-sense disambiguation[A]. In: Joint SIGDAT Conference on Empirical Methods in Natural Language Processing and Very Large Corpora[C]. Hong Kong: October 2000.
[6] Xing Li; Chengqing Zong; Rile Hu, A Hierarchical Parsing Approach with Punctuation Processing for Long Chinese Sentences[A], Second International Joint Conference on Natural Language Processing: Companion Volume including Posters/Demos and tutorial abstracts[C].
[7] Jin, Meixun, Mi-Young Kim, Dongil Kim, et al. Segmentation of Chinese Long Sentences Using Commas[A]. In: Proceedings of 3rd ACL SIGHAN Workshop[C]. 2004.
[8] Steven P. Abney. Parsing by chunks, In: Principled-Based Parsing[J]. eds. R. Berwick, S. Abney, and C. Tenny, Kluwer Academic Publishers,1991, pp.257-278.
[9] 周强, 孙茂松, 黄昌宁. 汉语句子的组块分析体系[J]. 计算机学报, 1999, 22(11): 1158-1165.
[10] Dan Bikel. Parsing Engine. http://www.cis.upenn.edu/~dbikel/download.html.

基金

国家863高技术项目资助(2002AA117010-10);十五攻关教育部科技基础条件平台建设项目资助
PDF(275 KB)

637

Accesses

0

Citation

Detail

段落导航
相关文章

/