中文分词十年回顾

黄昌宁,赵海

PDF(501 KB)
PDF(501 KB)
中文信息学报 ›› 2007, Vol. 21 ›› Issue (3) : 8-19.
综述

中文分词十年回顾

  • 黄昌宁1,赵海2
作者信息 +

Chinese Word Segmentation: A Decade Review

  • HUANG Chang-ning1, ZHAO Hai2
Author information +
History +

摘要

过去的十年间,尤其是2003年国际中文分词评测活动Bakeoff开展以来,中文自动分词技术有了可喜的进步。其主要表现为: (1)通过“分词规范+词表+分词语料库”的方法,使中文词语在真实文本中得到了可计算的定义,这是实现计算机自动分词和可比评测的基础;(2)实践证明,基于手工规则的分词系统在评测中不敌基于统计学习的分词系统;(3)在Bakeoff数据上的评估结果表明,未登录词造成的分词精度失落至少比分词歧义大5倍以上;(4)实验证明,能够大幅度提高未登录词识别性能的字标注统计学习方法优于以往的基于词(或词典)的方法,并使自动分词系统的精度达到了新高。

Abstract

During the last decade, especially since the First International Chinese Word Segmentation Bakeoff was held in July 2003, the study in automatic Chinese word segmentation has been greatly improved. Those improvements could be summarized as following: (1) on the computation sense Chinese words in real text have been well-defined by “segmentation guidelines + lexicon + segmented corpus”; (2) practical results show that performance of statistic segmentation systems outperforms that of handcrafted rule-based systems; (3) the evaluation in terms of Bakeoff data shows that the accuracy drop caused by out-of-vocabulary (OOV) words is at least five times greater than that of segmentation ambiguities; (4) the better performance of OOV recognition the higher accuracy of the segmentation system in whole, and the accuracy of statistic segmentation systems with character-based tagging approach outperforms any other word-based system.

关键词

计算机应用 / 中文信息处理 / 中文分词 / 词语定义 / 未登录词识别 / 字标注分词方法

Key words

computer application / Chinese information processing / Chinese word segmentation (CWS) / definition of words / out-of-vocabulary (OOV) word recognition / Character-based tagging approach of CWS

引用本文

导出引用
黄昌宁,赵海. 中文分词十年回顾. 中文信息学报. 2007, 21(3): 8-19
HUANG Chang-ning, ZHAO Hai. Chinese Word Segmentation: A Decade Review. Journal of Chinese Information Processing. 2007, 21(3): 8-19

参考文献

[1] 黄昌宁. 中文信息处理的分词问题[J]. 语言文字应用, 1997,(1): 72-78.
[2] Sproat, R. and Emerson, T. The First International Chinese Word Segmentation Bakeoff[A]. In: Proceedings of the Second SIGHAN Workshop on Chinese Language Processing[C].Sapporo, Japan: July 11-12, 2003,133-143.
[3] Sproat R., Shi, C. et al. A stochastic finite-state word segmentation algorithm for Chinese[J]. Computational Linguistics, 1996, 22(3): 377-404.
[4] 国家技术监督局. 中华人民共和国国家标准GB/T 13715-92信息处理用现代汉语分词规范[S]. 北京: 中国标准出版社, 1993.
[5] 孙茂松,张磊. 人机并存,“质”“量”合一[J]. 语言文字应用, 1997,(1): 79-86.
[6] 刘开瑛. 现代汉语自动分词评测研究[J]. 语言文字应用,1997,(1): 101-106.
[7] 孙茂松,邹嘉彦. 汉语自动分词综述[J]. 当代语言学, 2001,3(1),22-32.
[8] 杨尔弘,方莹,等. 汉语自动分词和词形评测[J]. 中文信息学报, 2006,20(1): 44-49.
[9] Emerson, T. The Second International Chinese Word Segmentation Bakeoff[A]. In: Proceedings of the Fourth SIGHAN Workshop on Chinese Language Processing[C]. Jeju Island, Korea: 2005,123-133.
[10] Levow, G. The Third International Chinese Language Processing Bakeoff: Word segmentation and named entity recognition[A]. In: Proceedings of the Fifth SIGHAN Workshop on Chinese Language Processing[C]. Sydney: July 2006, 108-117.
[11] Chengjie Sun, Chang-Ning Huang et al. Detecting segmentation errors in Chinese annotated corpus[A]. In: Proceedings of the Fourth SIGHAN Workshop on Chinese Language Processing[C]. Jeju Island, Korea: 2005, 1-8.
[12] 孙茂松. 谈谈汉语分词语料库的一致性问题[J]. 语言文字应用, 1999,(2).
[13] Aitao Chen. Chinese word segmentation using minimal linguistic Knowledge[A]. In: Proceedings of the Second SIGHAN Workshop on Chinese Language Processing[C]. Sapporo, Japan: July 11-12, 2003, 172-175.
[14] Hongqiao Li, Chang-Ning Huang et al. The use of SVM for Chinese new word identification[A]. In: Proceedings of the 1st International Joint Conference on Natural Language Processing (IJCNLP2004)[C]. Hainan Island, China: March 22-24, 2004, 723-732.
[15] Hai Zhao, Chang-Ning Huang and Mu Li. An improved Chinese word segmentation system with conditional random field[A]. In: Proceedings of the Fifth SIGHAN Workshop on Chinese Language Processing[C]. Sydney: July 2006, 108-117.
[16] Andi Wu and Zhixin Jiang. Word segmentation in sentence analysis[A]. In: Proceedings of 1998 International Conference on Chinese Information Processing[C]. Beijing, China: 1998, 169-180.
[17] Andi Wu. Chinese word segmentation in MSR-NLP[A]. In: Proceedings of the Second SIGHAN Workshop on Chinese Language Processing[C]. Sapporo, Japan: July 11-12, 2003, 172-175.
[18] Neinwen Xue and Susan P. Converse. Combining classifiers for Chinese word segmentation[A]. In: Proceedings of the First SIGHAN Workshop on Chinese Language Processing[C], Taipei, Taiwan: 2002, 63-70.
[19] 黄昌宁. 聚焦Bakeoff[A]. 张普,蔺荪,等编. 数字化汉语教学的研究与应用[C]. 香港城市大学: 2006年7月19-22日, 20-27.
[20] Neinwen Xue and Libin Shen. Chinese word segmentation as LMR tagging[A]. In: Proceedings of the Second SIGHAN Workshop on Chinese Language Processing[C]. Sapporo, Japan: July 11-12, 2003,176-179.
[21] Jin Kiat Low, Hwee Tou Ng and Wenyuan Guo. A maximum entropy approach to Chinese words Segmentation[A]. In: Proceedings of the Fourth SIGHAN Workshop on Chinese Language Processing[C]. Jeju Island, Korea: 2005, 161-164.
[22] Huihsin Tseng, Pichuan Chang et al. A conditional random field word segmenter for SIGHAN Bakeoff 2005[A]. In: Proceedings of the Fourth SIGHAN Workshop on Chinese Language Processing[C]. Jeju Island, Korea: 2005, 168-171.
[23] Hai Zhao, Changning Huang et al. Effective tag set selection in Chinese word segmentation via conditional random field modeling[A]. In: Proceedings of PACLIC-20[C]. Wuhan, China: November 1-3, 2006, 87-94.

基金

国家自然科学基金资助项目(60621062);国家973资助项目(2003CB317007,2004CB318108)
PDF(501 KB)

1371

Accesses

0

Citation

Detail

段落导航
相关文章

/