该文提出了以谓词为核心的块依存语法,以谓词为核心,以组块为研究对象,在句内和句间寻找谓词所支配的组块,利用汉语中组块和组块间的依存关系补全缺省部分,明确谓词支配关系。根据块依存文法体系,目前共标注2 199篇文本,涵盖百科、新闻两个领域,共约180万字语料。该文简述了块依存文法的原则,并对组块及其依存关系进行了定义。该文详细介绍了标注流程、标注一致率、数据分布等情况。基于现有的树库,该文发现汉语中有约25%的小句是非自足的,约有88%的核心谓词可支配1~3个从属成分。
Abstract
This paper presents a Chinese Chunk-Based Dependency Grammar(CCDG). With this grammar, predicate-dominated chunks can be found within and between sentences, and default parts of sentences can be completed by the relations between chunks. This paper describes the principles of CCDG and defines the chunks and relations. We have annotated 2 199 texts, altogether 1800,000 words from encyclopedia and news texts based on the CCDG. The annotation procedure, label consistency, data distribution, and so on are described in detail. Based on current treebank, it is found that about 25% of clauses in Chinese are not self-sufficient, and about 88% of core predicates govern 1-3 subordinate components.
关键词
组块 /
块依存语法 /
树库
{{custom_keyword}} /
Key words
chunk /
Chinese chunk-based dependency grammar /
treebank
{{custom_keyword}} /
{{custom_sec.title}}
{{custom_sec.title}}
{{custom_sec.content}}
参考文献
[1] Abney S. Parsing by chunks[C]//Principle-based parsing, Kluwer Academic Publishers, 1991: 257-278.
[2] 刘芳,赵铁军,于浩, 等.基于统计的汉语组块分析[J].中文信息学报,2000,14(06): 28-32.
[3] 周强,孙茂松,黄昌宁.汉语句子的组块分析体系[J].计算机学报,1999,22(11): 1158-1165.
[4] 周强.汉语句法树库标注体系[J].中文信息学报,2004,18(4): 2-9.
[5] 周强.汉语基本块描述体系[J].中文信息学报,2007,21(03): 21-27.
[6] 陈亿,周强,宇航.分层次的汉语功能块描述库构建分析[J].中文信息学报,2008(03): 24-31.
[7] 李素建. 汉语组块计算的若干研究[D].中国科学院研究生院博士学位论文,2002.
[8] Liu T, Ma J, Li Sh. Building a dependency treebank for improving Chinese parser[J]. Journal of Chinese Language and Computing, 2006(16): 207-224.
[9] 邱立坤,史林林,王厚峰.多领域中文依存树库构建与影响统计句法分析因素之分析[J].中文信息学报,2015,29(5): 69-75.
[10] 郭丽娟,彭雪,李正华,等.面向多领域多来源文本的汉语依存句法树库构建[J].中文信息学报, 2019,33(2): 38-46.
[11] 郭丽娟,李正华,彭雪,等. 适应多领域多来源文本的汉语依存句法数据标注规范.中文信息学报, 2018,32(10): 32-39.
[12] Zhou M. A block-based robust dependency parser for unrestricted Chinese text[C]//Proceedings of the 2nd Chinese Language Processing Workshop Attached to ACL 2000, HongKong, China, 2000: 78-84.
[13] 闻媛,宋丽,吴泰中,等. 基于中文AMR语料库的非投影结构研究[J].中文信息学报,2018,32(12): 31-40.
[14] 宋柔.汉语篇章广义话题结构的流水模型[J].中国语文,2013,(06): 483-494.
[15] 宋柔,葛诗利,尚英,等.面向文本信息处理的汉语句子和小句[J].中文信息学报,2017,31(02): 18-24.
[16] 卢露,矫红岩,李梦,等.基于篇章的汉语句法结构树库构建[J/OL]. http://kns.cnki.net/kcms/detail/11.2109.TP.20200521.1558.007.html.[2020-08-18].
{{custom_fnGroup.title_cn}}
脚注
{{custom_fn.content}}
基金
国家语委项目(ZDI135-114)
{{custom_fund}}