该文提出了一种无监督和有监督相结合的中文分词方法 将邻接变化数(Accessor Variety,AV)引入基于条件随机场的中文分词系统中。针对邻接变化数在处理较少的训练数据时存在的缺陷,提出了一种归一化的改进方法,以减轻计算AV值时产生的波动。基于Bakeoff-4的中文分词实验表明,归一化的邻接变化数方法无论对于封闭测试,还是开放测试,都带来了性能的提升。
Abstract
This paper proposes a method combining supervised learning with unsupervised method to conduct Chinese word segmentation (CWS), which incorporates the Accessor Variety (AV) into the Conditional Random Fields (CRFs). To solve the flaw in Accessor Variety (AV) when dealing with limited training data, normalization is introduced to alleviate the fluctuation in the AV value in the phrase of unsupervised segmentation. Experiments on the Bakeoff-4 data indicate that normalized Accessor Variety is effective both for close and open tracks.
Key wordscomputer application; Chinese information processing; unsupervised segmentation; CRFs; normalized accessor variety
关键词
计算机应用 /
中文信息处理 /
无监督分词 /
条件随机场 /
归一化的邻接变化数方法
{{custom_keyword}} /
Key words
computer application /
Chinese information processing /
unsupervised segmentation /
CRFs /
normalized accessor variety
{{custom_keyword}} /
{{custom_sec.title}}
{{custom_sec.title}}
{{custom_sec.content}}
参考文献
[1] J. Lafferty, A. McCallum, and F. Pereira. Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data [C]//Proceedings of the 18th ICML, San Francisco, CA. 2001: 282-289.
[2] Zellig Sabbetai Harris. Morpheme within words [C]//Papers in Structural and boundaries Transformational Linguistics, 1970: 68-77.
[3] Hai Zhao and Chunyu Kit. Unsupervised Segmentation Helps Supervised Learning of Character Tagging for Word Segmentation and Named Entity Recognition [C]//The Sixth SIGHAN Workshop on Chinese Language Processing (SIGHAN-6), Hyderabad, India, 2008: 106-111.
[4] Haodi Feng, Kang Chen, Chunyu Kit, and Xiaotie Deng. Unsupervised segmentation of Chinese corpus using accessor variety [C]//K.-Y. Su, J. Tsujii, J. H. Lee, and O. Y. Kwong, editors, Natural Language Processing- IJCNLP 2004, volume 3248 of Lecture Notes in Computer Science, Springer Berlin/Heidelberg. Sanya, Hainan Island, China. 2005: 694-703.
[5] Xinnian Mao, Yuan Dong, Saike He, Sencheng Bao and Haila Wang, Chinese Word Segmentation and Name Entity Recognition Based on Condition Random Fields [C]//The Sixth SIGHAN Workshop on Chinese Language Processing (SIGHAN-6), Hyderabad, India. 2008.
[6] R.H. Byrd, J. Nocedal and R.B. Schnabel. Representations of quasi-Newton matrices and their use in limited memory methods [J]. Mathematical Programming, 1994,(63): 129-156.
{{custom_fnGroup.title_cn}}
脚注
{{custom_fn.content}}
基金
高等学校学科创新引智计划资助项目(B08004);国家支撑计划资助项目(2007BAHo5B02-04)
{{custom_fund}}