ZHOU Qiang
2004, 18(4): 2-9.
The syntactically annotated corpora , commonly called‘treebanks’, play an important role in empirical linguistics as well as in machine learning methods in natural language processing. After a brief summarization of several treebank annotation of different language , we proposed a new annotation scheme for Chinese treebank in this paper. Under this scheme , every Chinese sentence will be annotated with a complete parse tree , where each non-terminal constituent is assigned with two tags. One is the syntactic constituent tag , which describes its external functional relation with other constituents in the parse tree. The other is the grammatical relation tag , which describes the internal structural relation of its sub-components. These two tag sets consist of 16 and 27 tags respectively. They form an integrated annotation for the syntactic constituent in a parse tree through top-down and bottom-up descriptions. Based on this scheme , we built a 1,000,000 words Chinese treebank covering a balanced collection of journalistic , literary , academic , and other documents. The annotating experiments on different kinds of complex linguistic phenomena show the availability and compatibility of this annotation scheme.