为了支持汉语句法分析研究,目前句法分析领域已经标注了多个汉语依存句法树库。然而,已有树库主要针对较规范文本,而对各种网络文本如博客、微博、微信等考虑较少。为此,该文基于近期研制的标注规范及可视化在线标注系统,开展了大规模数据标注。聘请了15名兼职标注者,并采用严格的标注流程保证标注质量,目前,已经标注了约3万句的汉语依存句法树库,其中包含约1万句淘宝头条文本。该文重点介绍了数据选取、标注流程等问题,并详细分析了标注准确率、一致性和标注数据的分布情况。未来将继续对多领域多来源文本进行标注,扩大树库规模,并以合适的方式公开相应的标注数据。
Abstract
The existing Chinese dependency treebanks are mainly annotated for canonical texts, and give little consideration to web texts, such as blogs, WeiBo, and WeChat. This paper presents a large-scale tree-bank annotation, based on the recently designed annotation guideline and online annotating system. Altogether 15 part-time annotators are involved and a strict annotation procedure is applied to guarantee the quality. So far, we have annotated about 30,000 Chinese sentences with their dependency syntax trees, including about 10,000 sentences from Taobao headline texts. This paper describes the details in data selection and annotation workflow. We also analyze the annotation accuracy, inter-annotator consistency, and distribution of annotated data.
关键词
依存句法 /
树库构建 /
多领域多来源文本
{{custom_keyword}} /
Key words
dependency syntax /
treebank construction /
multi-domain and multi-source texts
{{custom_keyword}} /
{{custom_sec.title}}
{{custom_sec.title}}
{{custom_sec.content}}
参考文献
[1] 王跃龙,姬东鸿.汉语树库综述[J].当代语言学,2009(1):47-55.
[2] 李正华.汉语依存句法分析关键技术研究[D].哈尔滨: 哈尔滨工业大学博士学位论文, 2013.
[3] Petrov S, Google R M, York N, et al. Overview of the 2012 shared task on parsing the web[C]//Procee-dings of the 1st Workshop on Syntactic Analysis of Non-canonical Language at NAACL 2012, 2012.
[4] Sato M, Manabe H, Noji H, et al. Adversarial training for Cross-Domain universal dependency parsing[C]//Proceedings of the CONLL 2017 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies, 2017:71-79.
[5] 李正华,车万翔,刘挺.短语结构树库向依存结构树库转化研究[J].中文信息学报,2008,22(6):14-19.
[6] Yu J, Elkaref M, Bohnet B. Domain adaptation for dependency parsing via self-training[C]//Proceedings of the 14th International Conference on Parsing Technologies, 2015:1-10.
[7] Li Z, Liu T, Che W. Exploiting multiple treebanks for parsing with quasi-synchronous grammars[C]//Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Long Papers. Association for Computational Linguistics, 2012:675-684.
[8] 邱立坤,史林林,王厚峰.多领域中文依存树库构建与影响统计句法分析因素之分析[J].中文信息学报,2015,29(5):69-75.
[9] Chen K-J, Luo C-C, Chang M-C, et al. Sinica treebank: Design criteria,representational issues and implementation[M].Abeillè A.[S.l.]:Kluwer Academic Publishers,2003:231-248.
[10] Xue N, Xia F, Chiou F-D, et al.The Penn Chinese Tree-Bank: Phrase structure annotation of a large corpus[J]. Natural Language Engineering,2005,11(2): 207-238.
[11] 詹卫东.The application of treebank to assist Chinese grammar instruction: A preliminary investigation[J].Journal of Technology and Chinese Language Te-aching,2012,3(2):16-29.
[12] 周强.汉语句法树库标注体系[J].中文信息学报,2004,18(4):1-8.
[13] Che W, Li Z, Liu T. Chinese dependency treebank 1.0 (LDC2012T05)[DB/OL]. Philadelphia: Linguistic Data Consortium, 2012 http://catalog.ldc.upenn.edu/LPL2012Tos.
[14] Dozat T, Manning C D. Deep biaffine attention for neural dependency parsing[C]//Proceedings of the 5th International Conference on Learning Representations,2017.
[15] Zhenghua Li, Min Zhang, Yue Zhang, et al. Active learning for dependency parsing with partial annotation[C]//Proceedings of the 54th Annunl Meeting of the Association for Computational Linguistics, 2016:344-354.
{{custom_fnGroup.title_cn}}
脚注
{{custom_fn.content}}
基金
国家自然科学基金(61876116,61673289);江苏省高校自然科学研究重大项目(16KJA520001)
{{custom_fund}}