Abstract:During the last decade, especially since the First International Chinese Word Segmentation Bakeoff was held in July 2003, the study in automatic Chinese word segmentation has been greatly improved. Those improvements could be summarized as following: (1) on the computation sense Chinese words in real text have been well-defined by “segmentation guidelines + lexicon + segmented corpus”; (2) practical results show that performance of statistic segmentation systems outperforms that of handcrafted rule-based systems; (3) the evaluation in terms of Bakeoff data shows that the accuracy drop caused by out-of-vocabulary (OOV) words is at least five times greater than that of segmentation ambiguities; (4) the better performance of OOV recognition the higher accuracy of the segmentation system in whole, and the accuracy of statistic segmentation systems with character-based tagging approach outperforms any other word-based system.
[1] 黄昌宁. 中文信息处理的分词问题[J]. 语言文字应用, 1997,(1): 72-78.
[2] Sproat, R. and Emerson, T. The First International Chinese Word Segmentation Bakeoff[A]. In: Proceedings of the Second SIGHAN Workshop on Chinese Language Processing[C].Sapporo, Japan: July 11-12, 2003,133-143.
[3] Sproat R., Shi, C. et al. A stochastic finite-state word segmentation algorithm for Chinese[J]. Computational Linguistics, 1996, 22(3): 377-404.
[4] 国家技术监督局. 中华人民共和国国家标准GB/T 13715-92信息处理用现代汉语分词规范[S]. 北京: 中国标准出版社, 1993.
[5] 孙茂松,张磊. 人机并存,“质”“量”合一[J]. 语言文字应用, 1997,(1): 79-86.
[6] 刘开瑛. 现代汉语自动分词评测研究[J]. 语言文字应用,1997,(1): 101-106.
[7] 孙茂松,邹嘉彦. 汉语自动分词综述[J]. 当代语言学, 2001,3(1),22-32.
[8] 杨尔弘,方莹,等. 汉语自动分词和词形评测[J]. 中文信息学报, 2006,20(1): 44-49.
[9] Emerson, T. The Second International Chinese Word Segmentation Bakeoff[A]. In: Proceedings of the Fourth SIGHAN Workshop on Chinese Language Processing[C]. Jeju Island, Korea: 2005,123-133.
[10] Levow, G. The Third International Chinese Language Processing Bakeoff: Word segmentation and named entity recognition[A]. In: Proceedings of the Fifth SIGHAN Workshop on Chinese Language Processing[C]. Sydney: July 2006, 108-117.
[11] Chengjie Sun, Chang-Ning Huang et al. Detecting segmentation errors in Chinese annotated corpus[A]. In: Proceedings of the Fourth SIGHAN Workshop on Chinese Language Processing[C]. Jeju Island, Korea: 2005, 1-8.
[12] 孙茂松. 谈谈汉语分词语料库的一致性问题[J]. 语言文字应用, 1999,(2).
[13] Aitao Chen. Chinese word segmentation using minimal linguistic Knowledge[A]. In: Proceedings of the Second SIGHAN Workshop on Chinese Language Processing[C]. Sapporo, Japan: July 11-12, 2003, 172-175.
[14] Hongqiao Li, Chang-Ning Huang et al. The use of SVM for Chinese new word identification[A]. In: Proceedings of the 1st International Joint Conference on Natural Language Processing (IJCNLP2004)[C]. Hainan Island, China: March 22-24, 2004, 723-732.
[15] Hai Zhao, Chang-Ning Huang and Mu Li. An improved Chinese word segmentation system with conditional random field[A]. In: Proceedings of the Fifth SIGHAN Workshop on Chinese Language Processing[C]. Sydney: July 2006, 108-117.
[16] Andi Wu and Zhixin Jiang. Word segmentation in sentence analysis[A]. In: Proceedings of 1998 International Conference on Chinese Information Processing[C]. Beijing, China: 1998, 169-180.
[17] Andi Wu. Chinese word segmentation in MSR-NLP[A]. In: Proceedings of the Second SIGHAN Workshop on Chinese Language Processing[C]. Sapporo, Japan: July 11-12, 2003, 172-175.
[18] Neinwen Xue and Susan P. Converse. Combining classifiers for Chinese word segmentation[A]. In: Proceedings of the First SIGHAN Workshop on Chinese Language Processing[C], Taipei, Taiwan: 2002, 63-70.
[19] 黄昌宁. 聚焦Bakeoff[A]. 张普,蔺荪,等编. 数字化汉语教学的研究与应用[C]. 香港城市大学: 2006年7月19-22日, 20-27.
[20] Neinwen Xue and Libin Shen. Chinese word segmentation as LMR tagging[A]. In: Proceedings of the Second SIGHAN Workshop on Chinese Language Processing[C]. Sapporo, Japan: July 11-12, 2003,176-179.
[21] Jin Kiat Low, Hwee Tou Ng and Wenyuan Guo. A maximum entropy approach to Chinese words Segmentation[A]. In: Proceedings of the Fourth SIGHAN Workshop on Chinese Language Processing[C]. Jeju Island, Korea: 2005, 161-164.
[22] Huihsin Tseng, Pichuan Chang et al. A conditional random field word segmenter for SIGHAN Bakeoff 2005[A]. In: Proceedings of the Fourth SIGHAN Workshop on Chinese Language Processing[C]. Jeju Island, Korea: 2005, 168-171.
[23] Hai Zhao, Changning Huang et al. Effective tag set selection in Chinese word segmentation via conditional random field modeling[A]. In: Proceedings of PACLIC-20[C]. Wuhan, China: November 1-3, 2006, 87-94.