归一化的邻接变化数方法在中文分词中的应用

何赛克1,王小捷2,董 远1,3,张韬政2,白 雪2

PDF(545 KB)
PDF(545 KB)
中文信息学报 ›› 2010, Vol. 24 ›› Issue (1) : 15-20.
综述

归一化的邻接变化数方法在中文分词中的应用

  • 何赛克1,王小捷2,董 远1,3,张韬政2,白 雪2
作者信息 +

Apply Normalized Accessor Variety in Chinese Word Segmentation

  • HE Saike1, WANG Xiaojie2, DONG Yuan1,3, ZHANG Taozheng2, BAI Xue2
Author information +
History +

摘要

该文提出了一种无监督和有监督相结合的中文分词方法 将邻接变化数(Accessor Variety,AV)引入基于条件随机场的中文分词系统中。针对邻接变化数在处理较少的训练数据时存在的缺陷,提出了一种归一化的改进方法,以减轻计算AV值时产生的波动。基于Bakeoff-4的中文分词实验表明,归一化的邻接变化数方法无论对于封闭测试,还是开放测试,都带来了性能的提升。

Abstract

This paper proposes a method combining supervised learning with unsupervised method to conduct Chinese word segmentation (CWS), which incorporates the Accessor Variety (AV) into the Conditional Random Fields (CRFs). To solve the flaw in Accessor Variety (AV) when dealing with limited training data, normalization is introduced to alleviate the fluctuation in the AV value in the phrase of unsupervised segmentation. Experiments on the Bakeoff-4 data indicate that normalized Accessor Variety is effective both for close and open tracks.
Key wordscomputer application; Chinese information processing; unsupervised segmentation; CRFs; normalized accessor variety

关键词

计算机应用 / 中文信息处理 / 无监督分词 / 条件随机场 / 归一化的邻接变化数方法

Key words

computer application / Chinese information processing / unsupervised segmentation / CRFs / normalized accessor variety

引用本文

导出引用
何赛克1,王小捷2,董 远1,3,张韬政2,白 雪2. 归一化的邻接变化数方法在中文分词中的应用. 中文信息学报. 2010, 24(1): 15-20
HE Saike1, WANG Xiaojie2, DONG Yuan1,3, ZHANG Taozheng2, BAI Xue2. Apply Normalized Accessor Variety in Chinese Word Segmentation. Journal of Chinese Information Processing. 2010, 24(1): 15-20

参考文献

[1] J. Lafferty, A. McCallum, and F. Pereira. Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data [C]//Proceedings of the 18th ICML, San Francisco, CA. 2001: 282-289.
[2] Zellig Sabbetai Harris. Morpheme within words [C]//Papers in Structural and boundaries Transformational Linguistics, 1970: 68-77.
[3] Hai Zhao and Chunyu Kit. Unsupervised Segmentation Helps Supervised Learning of Character Tagging for Word Segmentation and Named Entity Recognition [C]//The Sixth SIGHAN Workshop on Chinese Language Processing (SIGHAN-6), Hyderabad, India, 2008: 106-111.
[4] Haodi Feng, Kang Chen, Chunyu Kit, and Xiaotie Deng. Unsupervised segmentation of Chinese corpus using accessor variety [C]//K.-Y. Su, J. Tsujii, J. H. Lee, and O. Y. Kwong, editors, Natural Language Processing- IJCNLP 2004, volume 3248 of Lecture Notes in Computer Science, Springer Berlin/Heidelberg. Sanya, Hainan Island, China. 2005: 694-703.
[5] Xinnian Mao, Yuan Dong, Saike He, Sencheng Bao and Haila Wang, Chinese Word Segmentation and Name Entity Recognition Based on Condition Random Fields [C]//The Sixth SIGHAN Workshop on Chinese Language Processing (SIGHAN-6), Hyderabad, India. 2008.
[6] R.H. Byrd, J. Nocedal and R.B. Schnabel. Representations of quasi-Newton matrices and their use in limited memory methods [J]. Mathematical Programming, 1994,(63): 129-156.

基金

高等学校学科创新引智计划资助项目(B08004);国家支撑计划资助项目(2007BAHo5B02-04)
PDF(545 KB)

631

Accesses

0

Citation

Detail

段落导航
相关文章

/