树库是自然语言处理中一项重要的基础资源,现有树库基本上都是单视图树,支持短语结构语法或者依存语法。该文提出一套基于依存语法的多视图汉语树库标注体系,仅需标注中心语和语法角色两类信息,之后可以自动地推导出描述句法结构所需的短语结构功能和层次信息,从而可以在不增加标注工作量的前提下获得更多语法信息。基于该体系,构建了北京大学多视图汉语树库(PMT)1.0版,含有64000句、140万词,支持短语结构语法和依存语法两个视图。
Abstract
Treebank is an important resource for natural language processing. All the existing dependency treebanks and phrase structure treebanks might be taken as single-view treebanks. This paper proposed a schema for building a multi-view Chinese treebank based on dependency grammar. In this schema, we only need to annotate the head information and syntactic role of a child node, and then could infer the phrase structure function and hierarchy information of the phrase, which can greatly improve the efficiency of the labeling process without losing information. According to this schema, we built the treebank PKU Multi-view Chinese Treebank (PMT) version 1.0, which contains 64000 sentences and 1.4 million words, and supports the phrase structure grammar view and dependency grammar view.
关键词
多视图树库 /
依存语法 /
短语结构语法
{{custom_keyword}} /
Key words
Multi-view Chinese treebank /
phrase structure grammar /
dependency grammar
{{custom_keyword}} /
{{custom_sec.title}}
{{custom_sec.title}}
{{custom_sec.content}}
参考文献
[1] M P Marcus, B Santorin, M A Marcinkiewicz. Building a large annotated corpus of English: the Penn Treebank[J]. Computational Linguistics, 1993, 19(2): 313-330.
[2] M Collins. A Statistical Dependency Parser Of Chinese Under Small Training Data[C]//Proceedings of the 34th Annual Meeting of the ACL, 1996: 184-191.
[3] M Collins. Three Generative, Lexicalized Models for Statistical Parsing[C]//Proceedings of the 35th annual meeting of the association for computational linguistics, 1997: 16-23.
[4] H Yamada, Y Matsumoto. Statistical Dependency Analysis with Support Vector Machines[C]//Proceedings of the 8th International Workshop on Parsing Technologies (IWPT), 2003: 195-206.
[5] 党政法,周强.短语树到依存树的自动转换研究[J].中文信息学报,2005,19(3): 21-27.
[6] 李正华,车万翔,刘挺.短语结构树库向依存树库转化研究[J].中文信息学报,2008,22(6): 14-19.
[7] 朱德熙.现代汉语语法研究[M].北京: 商务印书馆,1979: 42-66.
[8] N Xue, F Xia, F D Chiou, et al. The Penn Chinese Treebank: Phrase Structure Annotation of a Large Corpus[J]. Natural Language Engineering, 2005, 11(2): 207-238.
[9] 陈凤仪,蔡碧芳,陈克健,等. 中文句结构树资料库 (Sinica Treebank)的构建[J]. Computational Linguistics and Chinese Language Processing, 1999, 4(2): 87-104.
[10] 周强.汉语句法树库标注体系[J].中文信息学报,2004,18(4): 1-8.
[11] 靳光瑾,肖航,富丽,等.现代汉语语料库建设及深加工[J].语言文字应用,2005(2): 111-120.
[12] 詹卫东.树库在汉语语法辅助教学中的应用初探[J]. Journal of Technology and Chinese Language Teaching, 2012, 3(2): 16-29.
[13] W Che, Z Li, T Liu. Chinese Dependency Treebank 1.0[DB]. Linguistic Data Consortium, Philadelphia.
[14] F Xia, O Rambow, R Bhatt, et al. Palmer. Towards a Multi-Representational Treebank[C]//Proceedings of The 7th International Workshop on Treebanks and Linguistic Theories (TLT 2009), 2009: 159-170.
[15] 朱德熙.语法讲义[M].北京: 商务印书馆,1982: 21.
[16] 陈保亚.20世纪中国语言学方法论[M].济南: 山东教育出版社,1999: 106-107.
{{custom_fnGroup.title_cn}}
脚注
{{custom_fn.content}}
基金
国家863计划主题项目(2012AA011101);国家社科基金重大项目(12&ZD227);国家自然科学基金青年项目(61103089);山东省优秀中青年科学家科研奖励基金(BS2013DX020);鲁东大学人文社会科学研究项目(WY2013003)。
{{custom_fund}}