基于有效子串标注的中文分词

赵海,揭春雨

PDF(275 KB)
PDF(275 KB)
中文信息学报 ›› 2007, Vol. 21 ›› Issue (5) : 8-13.
综述

基于有效子串标注的中文分词

  • 赵海,揭春雨
作者信息 +

Effective Subsequence-Based Tagging for Chinese Word Segmentation

  • ZHAO Hai, Chunyu Kit
Author information +
History +

摘要

由于基于已切分语料的学习方法和体系的兴起,中文分词在本世纪的头几年取得了显著的突破。尤其是2003年国际中文分词评测活动Bakeoff开展以来,基于字标注的统计学习方法引起了广泛关注。本文探讨这一学习框架的推广问题,以一种更为可靠的算法寻找更长的标注单元来实现中文分词的大规模语料学习,同时改进已有工作的不足。我们提出子串标注的一般化框架,包括两个步骤,一是确定有效子串词典的迭代最大匹配过滤算法,二是在给定文本上实现子串单元识别的双词典最大匹配算法。该方法的有效性在Bakeoff-2005评测语料上获得了验证。

Abstract

The research of automatic Chinese word segmentation has been advancing rapidly in recent years, especially after the First International Chinese Word Segmentation Bakeoff held in 2003. In particular, character-based tagging has claimed a great success in this field. In this paper, we attempt to generalize this method to subsequence-based tagging. Our goal is to find longer tagging units through a reliable algorithm. We propose a two-step framework to serve this purpose. In the first step, an iterative maximum matching filtering algorithm is applied to obtain an effective subsequence lexicon, while in the second step, a bi-lexicon based maximum matching algorithm is employed for identifying subsequence units. The effectiveness of this approach is verified by our experiments using two closed test data sets from Bakeoff-2005.

关键词

计算机应用 / 中文信息处理 / 中文分词 / 基于子串标注的分词

Key words

computer application / Chinese information processing / Chinese word segmentation (CWS) / subsequence-based tagging approach of CWS

引用本文

导出引用
赵海,揭春雨. 基于有效子串标注的中文分词. 中文信息学报. 2007, 21(5): 8-13
ZHAO Hai, Chunyu Kit. Effective Subsequence-Based Tagging for Chinese Word Segmentation. Journal of Chinese Information Processing. 2007, 21(5): 8-13

参考文献

[1] 黄昌宁. 中文信息处理的分词问题 [J]. 语言文字应用, 1997,(1): 72-78.
[2] Richard Sproat and Chilin Shih. A stochastic finite-state word segmentation algorithm for Chinese [J]. Computational Linguistics, 1996, 22(3): 377-404.
[3] 黄昌宁,赵海. 中文分词十年回顾 [J]. 中文信息学报,2007, 21(3): 8-20.
[4] Richard Sproat and Thomas Emerson. The First International Chinese Word Segmentation Bakeoff [A]. In: Proceedings of the Second SIGHAN Workshop on Chinese Language Processing [C]. Sapporo, Japan: July 11-12, 2003. 133-143.
[5] 国家技术监督局. 中华人民共和国国家标准GB/T 13715-92信息处理用现代汉语分词规范[M]. 北京: 中国标准出版社, 1993.
[6] 刘开瑛. 现代汉语自动分词评测研究 [J]. 语言文字应用,1997,(1): 101-106.
[7] 孙茂松, 邹嘉彦. 汉语自动分词综述[J]. 当代语言学, 2001,3(1): 22-32.
[8] 杨尔弘, 方莹, 刘冬明, 乔羽. 汉语自动分词和词性标注评测[J]. 中文信息学报, 2006, 20(1): 46-51.
[9] Nianwen Xue and Libin Shen. Chinese word segmentation as LMR tagging [A]. In: Proceedings of the Second SIGHAN Workshop on Chinese Language Processing [C]. Sapporo, Japan: July 11-12, 2003. 176-179.
[10] Fuchun Peng, Fangfang Feng and Andrew McCallum. Chinese segmentation and new word detection using Conditional Random Fields [A]. In: COLING 2004 [C]. Geneva, Switzerland: August 23-27, 2004. 562-568.
[11] Thomas Emerson. The Second International Chinese Word Segmentation Bakeoff [A]. In: Proceedings of the Fourth SIGHAN Workshop on Chinese Language Processing [C]. Jeju Island, Korea: 2005. 123-133.
[12] Gina-Anne Levow. The Third International Chinese Language Processing Bakeoff: Word segmentation and named entity recognition [A]. In: Proceedings of the Fifth SIGHAN Workshop on Chinese Language Processing[C]. Sydney: July 2006. 108-117.
[13] Hai Zhao, Chang-Ning Huang and Mu Li. An improved Chinese word segmentation system with conditional random field [A]. In: Proceedings of the Fifth SIGHAN Workshop on Chinese Language Processing [C]. Sydney: July, 2006. 108-117.
[14] Huihsin Tseng, Pichuan Chang, Galen Andrew, Daniel Jurafsky and Christopher Manning. A conditional random field word segmenter for SIGHAN Bakeoff 2005 [A]. In: Proceedings of the Fourth SIGHAN Workshop on Chinese Language Processing [C]. Jeju Island, Korea: 2005. 168-171.
[15] Jin Kiat Low, Hwee Tou Ng and Wenyuan Guo. A maximum entropy approach to Chinese words segmentation [A]. In: Proceedings of the Fourth SIGHAN Workshop on Chinese Language Processing [C]. Jeju Island, Korea: 2005. 161-164.
[16] John D. Lafferty, Andrew McCallum and Fernando C. N. Pereira. 2001. Conditional Random Field: Probabilistic models for segmenting and labeling sequence data [A]. In: ICML-18 [C]. June 28-July 01, 2001. 282-289.
[17] Jorge Nocedal and Stephen J. Wright. Numerical Optimization [B]. Springer, 1999.
[18] Hai Zhao, Chang-Ning Huang, Mu Li and Bao-Liang Lu. Effective tag set selection in Chinese word segmentation via conditional random field modeling [A]. In: PACLIC-20 [C]. Wuhan, China: November 1-3, 2006. 87-94.
[19] Ruiqiang Zhang, Genichiro Kikui and Eiichiro Sumita. Subword-based tagging by Conditional Random Fields for Chinese word wegmentation [A]. In: HLT/NAACL-2006 [C]. New York: 2006. 193-196.
[20] Jun-Sheng Zhou, Xin-Yu Dai, Rui-Yu Ni and Jia-Jun Chen. A hybrid approach to Chinese word segmentation around CRFs [A]. In: Proceedings of the Fourth SIGHAN Workshop on Chinese Language Processing [C]. Jeju Island, Korea: 2005. 196-199.
[21] Chang-Ning Huang and Hai Zhao. Which is essential for Chinese word segmentation: Character versus word [A]. In: PACLIC 20 [C]. Wuhan, China: November 1-3, 2006. 1-12.

基金

香港城市大学SRG项目7002037和香港特别行政区资助的CERG研究项目9040861(CityU 1318/03H)
PDF(275 KB)

Accesses

Citation

Detail

段落导航
相关文章

/