为提升依存分析并分析影响其精度的相关因素,该文构建了大规模中文通用依存树库和中等规模领域依存树库。基于这一系列树库,通过句法分析实验考察质量、规模、领域差异等因素对中文依存分析的影响,实验结果表明: (1)树库规模和质量均与句法分析精度成正相关关系,质量应先于规模因素被优先考虑;(2)通用树库和领域树库之间的差异程度与前者对后者的替代性成相关关系;(3)两种树库混合使用的效果同样与领域差异有关。
Abstract
To boost Chinese dependency parsing and analyze factors influencing Chinese dependency parsing, we constructe a large-scale general treebank and several middle-scale treebanks for specific domains. Then, we performe experiments to evaluate the parsing accuracy influenced by the quality, the scale and the domain difference of the dependency treenbank. The results show that both the treebank quality and its scale are positively related to parsing accuracy, and the quality is more influential. The experiments also demonstrate that general treebanks and domain treebanks are complementary, and, whether a general treebank and domain treebank should be used together is dependent on the difference between them.
关键词
依存树库 /
领域迁移 /
依存句法分析
{{custom_keyword}} /
Key words
dependency treebank /
domain adaptation /
dependency parsing
{{custom_keyword}} /
{{custom_sec.title}}
{{custom_sec.title}}
{{custom_sec.content}}
参考文献
[1] Ryan McDonald, Fernando Pereira, Kiril Ribarov, et al. Non-projective dependency parsing using spanning tree algorithms[C]//Proceedings of HLT-EMNLP, 2005: 523-530.
[2] Joakim Nivre. Inductive dependency parsing[M]. Springer.2006.
[3] Slav Petrov, Ryan McDonald. Overview of the 2012 Shared Task on Parsing the Web[C]//Notes of the First Workshop on Syntactic Analysis of Non-Canonical Language, 2012.
[4] 李正华,车万翔,刘挺.短语结构树库向依存树库转化研究[J].中文信息学报, 2008,22(6): 14-19.
[5] Zhenhua Li, Ting Liu, Wanxiang Che. Exploiting multiple treebanks for parsing with quasisynchronous grammars[C]//Proceedings of ACL, 2012: 675-684.
[6] Kenji Sagae, Yusuke Miyao, Rune Stre, et al. Evaluating the Effects of Treebank Size in a Practical Application for Parsing[C]//Proceedings of ACL 2008 Workshop on Software Engineering, Testing, and Quality Assurance for Natural Language Processing, 2008: 14-20.
[7] Meishan Zhang, Yue Zhang, Wanxiang Che, et al. Type-Supervised Domain Adaptation for Joint Segmentation and POS-Tagging[C]//Proceedings of EACL, 2014: 588-597.
[8] Wanxiang Che, Zhenghua Li, Ting Liu. Chinese Dependency Treebank 1.0 LDC2012T05[DB]. Web Download. Philadelphia: Linguistic Data Consortium, 2012.
[9] Likun Qiu, Yue Zhang, Peng Jin, et al. Multi-view Chinese treebanking[C]//Proceedings of COLING, 2014: 257-268.
[10] Pi-Chuan Chang, Huihsin Tseng, Dan Jurafsky, et al. Discriminative reordering with Chinese grammatical relations features[C]//Proceedings of the Third Workshop on Syntax and Structure in Statistical Translation, 2009: 51-59.
[11] 刘海涛. 基于依存树库的汉语句法计量研究[J]. 长江学术, 2008, 3:120-128.
[12] Wenliang Chen, Jun'ichi Kazama, Kiyotaka Uchimoto, et al. Improving Dependency Parsing with Subtrees from Auto-Parsed Data[C]//Proceedings of EMNLP, 2009, 2: 570-579.
[13] Bernd Bohnet. Top accuracy and fast dependency parsing is not a contradiction[C]//Proceedings of Coling, 2010: 89-97.
[14] Yue Zhang, Stephen Clark. Syntactic Processing Using the Generalized Perceptron and Beam Search[J]. Computational Linguistics, 2011, 37(1): 105-151.
[15] Wanxiang Che, Valentin Spitkovsky, Ting Liu. A comparison of Chinese parsers for Stanford dependencies[C]//Proceedings of EACL, 2012: 11-16.
[16] Nianwen Xue, Fei Xia, Fu-Dong Chiou, et al. The Penn Chinese Treebank: Phrase Structure Annotation of a Large Corpus[J]. Natural Language Engineering, 2005, 11(2): 207-238.
[17] 陈凤仪,蔡碧芳,陈克健,等. 中文句结构树资料库 (Sinica Treebank)的构建[J]. Computational Linguistics and Chinese Language Processing, 1999, 4(2): 87-104.
[18] 周强.2004.汉语句法树库标注体系[J].中文信息学报, 2004, 18(4): 1-8.
[19] 靳光瑾,肖航,富丽,等.现代汉语语料库建设及深加工[J].语言文字应用, 2005, 2: 111-120.
[20] 詹卫东.树库在汉语语法辅助教学中的应用初探[J]. Journal of Technology and Chinese Language Teaching, 2012, 3(2): 16-29.
[21] Nianwen Xue, Xiuhong Zhang, Zixin Jiang, et al. Chinese Treebank 8.0 LDC2013T21[DB]. Web Download. Philadelphia: Linguistic Data Consortium. 2013.
{{custom_fnGroup.title_cn}}
脚注
{{custom_fn.content}}
基金
国家社科基金重大项目(12&ZD227);国家自然科学基金(61572245,61370117,61103089);教育部新世纪优秀人才支持计划(NECT-11-0839);山东省优秀中青年科学家科研奖励基金(BS2013DX020);鲁东大学人文社会科学研究项目(WY2013003)
{{custom_fund}}