1. Beijing Key Laboratory of Internet Culture and Digital Dissemination Research, Beijing Information Science and Technology University, Beijing 100101, China; (2. Tsinghua National Laboratory for Information Science and Technology(TNList), Center for Speech and Language Technologies, Research Institute of Information Technology, Tsinghua University, Beijing 100084, China)
Abstract:In order to solve the problems of chunk boundary identification and intra-chunk structure analysis, this paper explores a new chunk parsing task based on the Chinese concept compound chunk (CCC) scheme. After making detailed comparisons with previous base chunk and functional chunk schemes, the main parsing difficulties for CCC chunking are revealed. Therefore, the paper proposes a CCC parsing method based on the “shift-reduce” model. The experiments on the CCC bank automatically extracted from Tsinghua Chinese Treebank (TCT) show the feasibility of the method for parsing some simple CCCs, which facilitates further syntactic and semantic parsing on complex CCCs.
[1] Abney S P. Parsing by chunks[M]. Springer Netherlands, 1992. [2] Tjong Kim Sang E F, Buchholz S. Introduction to the CoNLL-2000 shared task: Chunking[C]//Proceedings of the 2nd Workshop on Learning language in Logic and the 4th Conference on Computational Natural Language Learning-Volume 7. Association for Computational Linguistics, 2000: 127-132. [3] 周强, 李玉梅. CIPS-ParsEval-2009评测报告[C]//第一届汉语句法分析评测学术研讨会论文集(CIPS-ParsEval-2009),北京,2009 [4] 王鑫, 孙薇薇, 穗志方. 基于浅层句法分析的中文语义角色标注研究[J]. 中文信息学报, 2011, 25(1): 116-122. [5] 丁伟伟, 常宝宝. 基于语义组块分析的汉语语义角色标注[J]. 中文信息学报, 2009, 23(5): 53-61. [6] 李沐, 吕学强, 姚天顺. 一种基于 E-Chunk 的机器翻译模型[J]. Journal of Software, 2002, 13(4): 669-676. [7] 周强, 孙茂松, 黄昌宁. 汉语最长名词短语的自动识别[J]. 软件学报, 2000, 11(2): 195-201. [8] 王立霞, 孙宏林. 现代汉语介词短语边界识别研究[J]. 中文信息学报, 2005, 19(3): 80-86. [9] 李素建, 刘群, 白硕. 统计和规则相结合的汉语组块分析[J]. 计算机研究与发展, 2002, 39(4): 385-391. [10] 周强. 汉语基本块描述体系[J]. 中文信息学报, 2007, 21(3): 21-27. [11] 周强, 赵颖泽. 汉语功能块自动分析[J]. 中文信息学报, 2007, 21(5): 18-24. [12] 孙广路.基于条件随机域和语义类的中文组块分析方法[J].哈尔滨工业大学学报,2011,43(7): 135-139. [13] 李素建, 刘群, 孙茂松. 汉语组块的定义和获取[C]//语言计算与基于内容的文本处理——全国计算语言学联合学术会议 (SWCL2003) 论文集. 北京: 清华大学出版社. 2003: 110-115. [14] 周强.汉语句法树库标注体系[J].中文信息学报,2004,18(4):1-8. [15] 李超等.基于最大熵模型的汉语基本块分析技术研究[C]//第一届汉语句法分析评测学术研讨会论文集(CIPS-ParsEval-2009),北京,2009. [16] Chang C C, Lin C J. LIBSVM: a library for support vector machines[J]. ACM Transactions on Intelligent Systems and Technology (TIST), 2011, 2(3): 27. [17] Abney S, Flickenger S, Gdaniec C, et al. Procedure for quantitatively comparing the syntactic coverage of English grammars[C]//Proceedings of the Workshop on Speech and Natural Language. Association for Computational Linguistics, 1991: 306-311. [18] 王昕, 王金勇, 刘春阳等. 基于CRF的汉语语块分析和事件描述小句识别[C]//第一届汉语句法分析评测学术研讨会论文集(CIPS-ParsEval-2009),北京,2009. [19] Petrov S, Barrett L, Thibaux R, et al. Learning accurate, compact, and interpretable tree annotation[C]//Proceedings of the 21st International Conference on Computational Linguistics and the 44th Annual meeting of the Association for Computational Linguistics. Association for Computational Linguistics, 2006: 433-440. [20] Petrov S, Klein D. Improved Inference for Unlexicalized Parsing[C]//Proceedings of HLT-NAACL. 2007: 404-411. [21] Li H, Huang C N, Gao J, et al. Chinese chunking with another type of spec[C]//Proceedings of The Third SIGHAN Workshop on Chinese Language Processing. 2004: 24-26. [22] 李珩, 谭咏梅, 朱靖波, 等. 汉语组块识别[J]. 东北大学学报 (自然科学版), 2004, 25(2): 114-117. [23] 周俊生,戴新宇,陈家骏等 基于大间隔方法的汉语组块分析[J]. 软件学报,2009,20(4) : 870-877. [24] 周俏丽, 刘新, 郎文静, 等. 基于分治策略的组块分析[J]. 中文信息学报, 2012, 26(5): 120-128. [25] 王仲华, 卢娇丽, 付继宗. 基于 HMSVM 模型的中文浅层句法分析[J]. 电脑开发与应用, 2013, 26(2): 30-32. [26] 孔令鹏, 张琛, 张权. 基于 SVM 的快速中文组块分析方法[J]. 现代电子技术, 2012, 35(21): 93-96.