李玉梅,陈晓,姜自霞,易江燕,靳光瑾,黄昌宁. 分词规范亟需补充的三方面内容[J]. 中文信息学报, 2007, 21(5): 1-7.
LI Yu-mei, CHEN Xiao, JIANG Zi-xia, YI Jiang-yan, JIN Guang-jin, HUANG Chang-ning. Three Complements to Make Better Guideline of Chinese Word Segmentation. , 2007, 21(5): 1-7.
分词规范亟需补充的三方面内容
李玉梅1,陈晓1,姜自霞1,易江燕1,靳光瑾1,黄昌宁2
1. 教育部语言文字应用研究所,北京 100010; 2. 微软亚洲研究院,北京 100080
Three Complements to Make Better Guideline of Chinese Word Segmentation
LI Yu-mei1, CHEN Xiao1, JIANG Zi-xia1, YI Jiang-yan1, JIN Guang-jin1, HUANG Chang-ning2
1. Institute of Applied Linguisitics, Ministry of Education, P.R.C, Beijing 100010, China; 2. Microsoft Research Asia, Beijing 100080, China
Abstract:Three complements are proposed in this paper to make better guideline of Chinese word segmentation, which are essential for building high quality Chinese segmented corpora. They are named entity (person name, location name and organization name) tagging rules, factoid (date, time, percentage, etc.) tagging rules and disambiguation rules. Because named entities and factoids are considered as segmentation units in many corpora, and the disambiguation problem is seldom defined in former segmentation guidelines. Actually, people always have different intuitions of ambiguity strings, so it is necessary to explain them in segmentation guidelines. Our practices have shown that specifying particular segmentation rules can help to decrease errors and inconsistencies in annotated corpus.
[1] GB/T1375-92,信息处理用现代汉语分词规范(GB/T1375-92)[S]. [2] 俞士汶,段慧明,朱学峰,等.北京大学现代汉语语料库基本加工规范[J].中文信息学报,2002,16(5): 29-64;2002,16(6): 58-64. [3] 切词规则.[DB/OL]香港城市大学语言资讯科学研究中心,2005, http://sighan.cs.uchicago.edu/swclp4/ [4] Hongqiao Li, Chang-Ning Huang, et al. The use ofe SVM of Chinese new word identification[A]. In: Proceedings of the First International Joint Conference on Natural Language Processing (IJCNLP2004) [C]. Hainan Island, China: March 22-24,2004.723-732. [5] Mu Li, Jianfeng Gao et al. Unsupervised training for overlapping ambiguity resolution in Chinese word segmentation[A]. In: Proceedings of the Second SIGHAN Workshop Attached to the ACL-2003[C]. July 11-12, 2003. Sapporo, Japan: 1-7. [6] 靳光瑾,肖航,富丽,等,现代汉语语料库建设及深加工[J].语言文字应用,2005,(2): 112-121. [7] GB/T 20532-2006,信息处理用现代汉语词类标记集规范[S].