该文研究和探讨一种新的分词方法 基于词边界分类的方法。该方法直接对字符与字符之间的边界进行分类,判断其是否为两个词之间的边界,从而达到分词的目的。相对于目前主流的基于字标注的分词方法,该方法的实现和训练更加快速、简单和直接,但却能获得比较接近的分词效果。更显著的是我们可以很容易地从词边界分类方法获得在线分词学习方法,该方法能够使我们的分词系统非常迅速地学习新的标注样本。
Abstract
This paper focuses on the word boundary decision (WBD) approach to Chinese word segmentation. This new approach classifies a boundary between two characters into either a word boundary or not. Compared to the stat-of-the-arts methods based on character tagging, this approach is easier to implement and faster to execute, as well as a competitive performance. Particularly, the robust online learning module can be added to adapt a WBD system to new data quickly, enabling a reliable online Chinese segmentation system without domain or training data constraints.
Key wordscomputer application; Chinese information processing; Chinese word segmentation; WBD approach; online learning
关键词
计算机应用 /
中文信息处理 /
中文分词 /
WBD方法 /
在线学习
{{custom_keyword}} /
Key words
computer application /
Chinese information processing /
Chinese word segmentation /
WBD approach /
online learning
{{custom_keyword}} /
{{custom_sec.title}}
{{custom_sec.title}}
{{custom_sec.content}}
参考文献
[1] 黄昌宁. 中文信息处理的分词问题[J]. 语言文字应用, 1997, 11(1):72-78.
[2] 黄昌宁, 赵海. 中文分词十年回顾[J]. 中文信息学报, 2007, 21(3):8-20.
[3] 骆正清, 陈增武, 胡上序. 一种改进的MM分词方法的算法设计[J]. 中文信息学报, 1996,10(3): 30-36.
[4] 吴春颖, 王士同. 基于二元语法的N-最大概率中文粗分模型[J]. 计算机应用, 2007, 27(12): 332-339.
[5] Xue N. and L. Shen. Chinese word segmentation as LMR tagging[C]//Proceedings of the Second SIGHAN Workshop on Chinese Language Processing. 2003.
[6] Crammer K., O. Dekel, J. Keshet, S. Shalev-Shwartz, and Y. Singer. 2006. Online passive-aggressive algorithms[J]. Journal of Machine Learning Research, 2006(7): 551-585.
[7] Huang C., P. S·imon, S. Hsieh, and L. Prevot. Rethinking Chinese Word Segmentation: Tokenization, Character Classification, or Wordbreak Identification[C]//Proceedings of the Association of Computational Linguistics Annual Meeting (ACL). 2007.
[8] Huang C., T. Yo, P. S·imon, and S. Hsieh. A Realistic and Robust Model for Chinese Word Segmentation[C]//Proceedings of ROCLING. 2008.
[9] http://sourceforge.net/project/showfiles.php?group_id=201943[OL].
[10] Ng H. and J. Low. Chinese part-of-speech tagging: one-at-a-time or all-at-once? Word-based or character-based?[C]//Proceedings of EMNLP. 2004.
[11] CKIP. Academia Sinica Balanced Corpus of Modern Chinese[OL]. http://www.sinica.edu.tw/SinicaCorpus/. 2001.
{{custom_fnGroup.title_cn}}
脚注
{{custom_fn.content}}
基金
香港理工大学新教授启动资助项目(1-BBZM)
{{custom_fund}}