基于词边界分类的中文分词方法

李寿山,黄居仁

PDF(668 KB)
PDF(668 KB)
中文信息学报 ›› 2010, Vol. 24 ›› Issue (1) : 3-8.
综述

基于词边界分类的中文分词方法

  • 李寿山,黄居仁
作者信息 +

Chinese Word Segmentation Based on Word Boundary Decision

  • LI Shoushan,HUANG Chu-Ren
Author information +
History +

摘要

该文研究和探讨一种新的分词方法 基于词边界分类的方法。该方法直接对字符与字符之间的边界进行分类,判断其是否为两个词之间的边界,从而达到分词的目的。相对于目前主流的基于字标注的分词方法,该方法的实现和训练更加快速、简单和直接,但却能获得比较接近的分词效果。更显著的是我们可以很容易地从词边界分类方法获得在线分词学习方法,该方法能够使我们的分词系统非常迅速地学习新的标注样本。

Abstract

This paper focuses on the word boundary decision (WBD) approach to Chinese word segmentation. This new approach classifies a boundary between two characters into either a word boundary or not. Compared to the stat-of-the-arts methods based on character tagging, this approach is easier to implement and faster to execute, as well as a competitive performance. Particularly, the robust online learning module can be added to adapt a WBD system to new data quickly, enabling a reliable online Chinese segmentation system without domain or training data constraints.
Key wordscomputer application; Chinese information processing; Chinese word segmentation; WBD approach; online learning

关键词

计算机应用 / 中文信息处理 / 中文分词 / WBD方法 / 在线学习

Key words

computer application / Chinese information processing / Chinese word segmentation / WBD approach / online learning

引用本文

导出引用
李寿山,黄居仁. 基于词边界分类的中文分词方法. 中文信息学报. 2010, 24(1): 3-8
LI Shoushan,HUANG Chu-Ren. Chinese Word Segmentation Based on Word Boundary Decision. Journal of Chinese Information Processing. 2010, 24(1): 3-8

参考文献

[1] 黄昌宁. 中文信息处理的分词问题[J]. 语言文字应用, 1997, 11(1):72-78.
[2] 黄昌宁, 赵海. 中文分词十年回顾[J]. 中文信息学报, 2007, 21(3):8-20.
[3] 骆正清, 陈增武, 胡上序. 一种改进的MM分词方法的算法设计[J]. 中文信息学报, 1996,10(3): 30-36.
[4] 吴春颖, 王士同. 基于二元语法的N-最大概率中文粗分模型[J]. 计算机应用, 2007, 27(12): 332-339.
[5] Xue N. and L. Shen. Chinese word segmentation as LMR tagging[C]//Proceedings of the Second SIGHAN Workshop on Chinese Language Processing. 2003.
[6] Crammer K., O. Dekel, J. Keshet, S. Shalev-Shwartz, and Y. Singer. 2006. Online passive-aggressive algorithms[J]. Journal of Machine Learning Research, 2006(7): 551-585.
[7] Huang C., P. S·imon, S. Hsieh, and L. Prevot. Rethinking Chinese Word Segmentation: Tokenization, Character Classification, or Wordbreak Identification[C]//Proceedings of the Association of Computational Linguistics Annual Meeting (ACL). 2007.
[8] Huang C., T. Yo, P. S·imon, and S. Hsieh. A Realistic and Robust Model for Chinese Word Segmentation[C]//Proceedings of ROCLING. 2008.
[9] http://sourceforge.net/project/showfiles.php?group_id=201943[OL].
[10] Ng H. and J. Low. Chinese part-of-speech tagging: one-at-a-time or all-at-once? Word-based or character-based?[C]//Proceedings of EMNLP. 2004.
[11] CKIP. Academia Sinica Balanced Corpus of Modern Chinese[OL]. http://www.sinica.edu.tw/SinicaCorpus/. 2001.

基金

香港理工大学新教授启动资助项目(1-BBZM)
PDF(668 KB)

635

Accesses

0

Citation

Detail

段落导航
相关文章

/