本文认为,为提高语料库的分词标注质量应在分词规范中补充三个内容: ①命名实体(人名、地名、机构名)标注细则;②表义字串(日期、时间、百分数等)标注细则;③歧义字串的消解细则。因为一方面命名实体和表义字串已被不少分词语料库视为分词单位,另一方面在以往的分词规范中几乎从不谈及歧义消解问题。其实人们对歧义字串的语感往往是不同的。因此有必要在规范中对典型的歧义字串予以说明。实践表明,在规范中交待清楚以上三方面内容,就可以在很大程度上避免标注的错误和不一致性。
Abstract
Three complements are proposed in this paper to make better guideline of Chinese word segmentation, which are essential for building high quality Chinese segmented corpora. They are named entity (person name, location name and organization name) tagging rules, factoid (date, time, percentage, etc.) tagging rules and disambiguation rules. Because named entities and factoids are considered as segmentation units in many corpora, and the disambiguation problem is seldom defined in former segmentation guidelines. Actually, people always have different intuitions of ambiguity strings, so it is necessary to explain them in segmentation guidelines. Our practices have shown that specifying particular segmentation rules can help to decrease errors and inconsistencies in annotated corpus.
关键词
计算机应用 /
中文信息处理 /
语料库 /
分词规范 /
分词歧义消解
{{custom_keyword}} /
Key words
computer application /
Chinese information processing /
corpus /
guideline of Chinese word segmentation /
word segmentation disambiguation
{{custom_keyword}} /
{{custom_sec.title}}
{{custom_sec.title}}
{{custom_sec.content}}
参考文献
[1] GB/T1375-92,信息处理用现代汉语分词规范(GB/T1375-92)[S].
[2] 俞士汶,段慧明,朱学峰,等.北京大学现代汉语语料库基本加工规范[J].中文信息学报,2002,16(5): 29-64;2002,16(6): 58-64.
[3] 切词规则.[DB/OL]香港城市大学语言资讯科学研究中心,2005, http://sighan.cs.uchicago.edu/swclp4/
[4] Hongqiao Li, Chang-Ning Huang, et al. The use ofe SVM of Chinese new word identification[A]. In: Proceedings of the First International Joint Conference on Natural Language Processing (IJCNLP2004) [C]. Hainan Island, China: March 22-24,2004.723-732.
[5] Mu Li, Jianfeng Gao et al. Unsupervised training for overlapping ambiguity resolution in Chinese word segmentation[A]. In: Proceedings of the Second SIGHAN Workshop Attached to the ACL-2003[C]. July 11-12, 2003. Sapporo, Japan: 1-7.
[6] 靳光瑾,肖航,富丽,等,现代汉语语料库建设及深加工[J].语言文字应用,2005,(2): 112-121.
[7] GB/T 20532-2006,信息处理用现代汉语词类标记集规范[S].
{{custom_fnGroup.title_cn}}
脚注
{{custom_fn.content}}
基金
国家重点基础研究发展规划(973计划)项目资助;国家语委“十五”科研重大项目资助(ZDA10544)
{{custom_fund}}