组块分析的主要任务是语块的识别和划分,它使句法分析的任务在某种程度上得到简化。针对长句子组块分析所遇到的困难,该文提出了一种基于分治策略的组块分析方法。该方法的基本思想是首先对句子进行最长名词短语识别,根据识别的结果,将句子分解为最长名词短语部分和句子框架部分;然后,针对不同的分析单元选用不同的模型加以分析,再将分析结果进行组合,完成整个组块分析过程。该方法将整句分解为更小的组块分析单元,降低了句子的复杂度。通过在宾州中文树库CTB4数据集上的实验结果显示,各种组块识别结果平均F1值结果为91.79%,优于目前其他的组块分析方法。
Abstract
Chunking includes identification and labeling of chunks, which is a way to reduce the difficulty of complete syntactic parsing through segmenting a sentence into small chunking parts. In order to reduce the complexity of long sentence chunking, a divide-and-conquer strategy is described in this paper. The basic idea of this method is to first recognize the maximal noun phrases (MNP) form a full sentence; then identify the chunks within the MNPs and among the frame of the sentence without MNPs ;. Experiments are carried out on the data set of UPenn Chinese Treebank-4 (CTB4) and the results show the the best of overall F1 score of Chinese chunking is 91.79%, which is higher than the performance produced by the state-of-the-art machine learning models.
Key wordsChinese chunking; divide-and-conquer; complete syntactic parsing; maximal noun phrase; conditional random fields; support vector machines
关键词
汉语组块分析 /
分治策略 /
句法分析 /
最长名词短语 /
条件随机场 /
支持向量机
{{custom_keyword}} /
Key words
Chinese chunking /
divide-and-conquer /
complete syntactic parsing /
maximal noun phrase /
conditional random fields /
support vector machines
{{custom_keyword}} /
{{custom_sec.title}}
{{custom_sec.title}}
{{custom_sec.content}}
参考文献
[1] Erik F, Tjong Kim Sang, Sabine Buchholz. Introduction to the CoNLL-2000 Shared Task: Chunking[C]//Proceedings of CoNLL-2000 and LLL-2000. Lisbon, Portugal, 2000.
[2] Chen WL, Zhang YJ, Hitoshi I. An empirical study of Chinese chunking[C]//Proceedings of the COLING/ACL 2006 Main Conf. Poster Sessions. Morristown: Association for Computational Linguistics, 2006: 97-104.
[3] 李素建,刘群,杨志峰.基于最大熵模型的组块分析[J].计算机学报,2003,26(12):1722-1727.
[4] 周俊生,戴新宇,陈家俊,等.基于大间隔方法的汉语组块分析[J].软件学报.2009,20(4):870-877.
[5] Li HQ, Huang CN, Gao JF, et al. Chinese chunking with another type of spec [C]//Proceedings of the 3rd SIGHAN Workshop on Chinese Language Processing. 2004. 41-48. http://aclweb.org/anthology-new/w/w04/w04-1107.pdf
[6] P.L. Shiuan, C.T.H. Ann. A Divide-and-Conquer Strategy for Parsing[C]//Proceedings of the ACL/SIGPARSE 5th International Workshop on Parsing Technologies. Santa Cruz, USA, 1996: 57-66.
[7] C. Braun, G. Neumann, J. Piskorski. A Divide-and-Conquer Strategy for Shallow Parsing of German Free Texts[C]//Proceedings of ANLP-2000. Seattle, Washington, 2000: 239-246.
[8] C. Lyon, B. Dickerson. Reducing the Complexity of Parsing by a Method of Decomposition[C]//International Workshop on Parsing Technology. 1997: 215-222.
[9] 张卫国.三种定语,三个意义及三个槽位[J].中国人民大学学报,1996,(4):97-100.
[10] 周强,孙茂松,黄昌宁.汉语最长名词短语的自动识别[J].软件学报,2000,11(2):195-201.
[11] 代翠,周俏丽,蔡东风,等.统计和规则相结合的汉语最长名词短语自动识别[J].中文信息学报,2008,22(6):110-115.
[12] John Lafferty, Andrew McCallum, Fernando Pereira. Conditional random fields: Probabilistic models for segmenting and labeling sequence data[C]//Proceedings of the 18th ICML. 2001: 282-289.
[13] V. Vapnik. The Nature of Statistical Learning Theory[C]//Springer-Verlag, New York, 1995.
()()
{{custom_fnGroup.title_cn}}
脚注
{{custom_fn.content}}
基金
国家自然科学基金资助项目(60842005)
{{custom_fund}}