Abstract:The research of automatic Chinese word segmentation has been advancing rapidly in recent years, especially after the First International Chinese Word Segmentation Bakeoff held in 2003. In particular, character-based tagging has claimed a great success in this field. In this paper, we attempt to generalize this method to subsequence-based tagging. Our goal is to find longer tagging units through a reliable algorithm. We propose a two-step framework to serve this purpose. In the first step, an iterative maximum matching filtering algorithm is applied to obtain an effective subsequence lexicon, while in the second step, a bi-lexicon based maximum matching algorithm is employed for identifying subsequence units. The effectiveness of this approach is verified by our experiments using two closed test data sets from Bakeoff-2005.
作者简介: 赵海(1976—),男,博士,博士后研究员,主要研究方向为自然语言处理和机器学习;揭春雨(1964—),男,博士,助理教授,博、硕士导师,主要研究方向为计算语言学、机器学习、计算术语学和计算诗学。 ① SIGHAN是国际计算语言学会(ACL)下属的“中文处理专业委员会”的简称, 网址http://www.sighan.org。
[1] 黄昌宁. 中文信息处理的分词问题 [J]. 语言文字应用, 1997,(1): 72-78. [2] Richard Sproat and Chilin Shih. A stochastic finite-state word segmentation algorithm for Chinese [J]. Computational Linguistics, 1996, 22(3): 377-404. [3] 黄昌宁,赵海. 中文分词十年回顾 [J]. 中文信息学报,2007, 21(3): 8-20. [4] Richard Sproat and Thomas Emerson. The First International Chinese Word Segmentation Bakeoff [A]. In: Proceedings of the Second SIGHAN Workshop on Chinese Language Processing [C]. Sapporo, Japan: July 11-12, 2003. 133-143. [5] 国家技术监督局. 中华人民共和国国家标准GB/T 13715-92信息处理用现代汉语分词规范[M]. 北京: 中国标准出版社, 1993. [6] 刘开瑛. 现代汉语自动分词评测研究 [J]. 语言文字应用,1997,(1): 101-106. [7] 孙茂松, 邹嘉彦. 汉语自动分词综述[J]. 当代语言学, 2001,3(1): 22-32. [8] 杨尔弘, 方莹, 刘冬明, 乔羽. 汉语自动分词和词性标注评测[J]. 中文信息学报, 2006, 20(1): 46-51. [9] Nianwen Xue and Libin Shen. Chinese word segmentation as LMR tagging [A]. In: Proceedings of the Second SIGHAN Workshop on Chinese Language Processing [C]. Sapporo, Japan: July 11-12, 2003. 176-179. [10] Fuchun Peng, Fangfang Feng and Andrew McCallum. Chinese segmentation and new word detection using Conditional Random Fields [A]. In: COLING 2004 [C]. Geneva, Switzerland: August 23-27, 2004. 562-568. [11] Thomas Emerson. The Second International Chinese Word Segmentation Bakeoff [A]. In: Proceedings of the Fourth SIGHAN Workshop on Chinese Language Processing [C]. Jeju Island, Korea: 2005. 123-133. [12] Gina-Anne Levow. The Third International Chinese Language Processing Bakeoff: Word segmentation and named entity recognition [A]. In: Proceedings of the Fifth SIGHAN Workshop on Chinese Language Processing[C]. Sydney: July 2006. 108-117. [13] Hai Zhao, Chang-Ning Huang and Mu Li. An improved Chinese word segmentation system with conditional random field [A]. In: Proceedings of the Fifth SIGHAN Workshop on Chinese Language Processing [C]. Sydney: July, 2006. 108-117. [14] Huihsin Tseng, Pichuan Chang, Galen Andrew, Daniel Jurafsky and Christopher Manning. A conditional random field word segmenter for SIGHAN Bakeoff 2005 [A]. In: Proceedings of the Fourth SIGHAN Workshop on Chinese Language Processing [C]. Jeju Island, Korea: 2005. 168-171. [15] Jin Kiat Low, Hwee Tou Ng and Wenyuan Guo. A maximum entropy approach to Chinese words segmentation [A]. In: Proceedings of the Fourth SIGHAN Workshop on Chinese Language Processing [C]. Jeju Island, Korea: 2005. 161-164. [16] John D. Lafferty, Andrew McCallum and Fernando C. N. Pereira. 2001. Conditional Random Field: Probabilistic models for segmenting and labeling sequence data [A]. In: ICML-18 [C]. June 28-July 01, 2001. 282-289. [17] Jorge Nocedal and Stephen J. Wright. Numerical Optimization [B]. Springer, 1999. [18] Hai Zhao, Chang-Ning Huang, Mu Li and Bao-Liang Lu. Effective tag set selection in Chinese word segmentation via conditional random field modeling [A]. In: PACLIC-20 [C]. Wuhan, China: November 1-3, 2006. 87-94. [19] Ruiqiang Zhang, Genichiro Kikui and Eiichiro Sumita. Subword-based tagging by Conditional Random Fields for Chinese word wegmentation [A]. In: HLT/NAACL-2006 [C]. New York: 2006. 193-196. [20] Jun-Sheng Zhou, Xin-Yu Dai, Rui-Yu Ni and Jia-Jun Chen. A hybrid approach to Chinese word segmentation around CRFs [A]. In: Proceedings of the Fourth SIGHAN Workshop on Chinese Language Processing [C]. Jeju Island, Korea: 2005. 196-199. [21] Chang-Ning Huang and Hai Zhao. Which is essential for Chinese word segmentation: Character versus word [A]. In: PACLIC 20 [C]. Wuhan, China: November 1-3, 2006. 1-12.