俞士汶,朱学锋,段慧明. 大规模现代汉语标注语料库的加工规范[J]. 中文信息学报, 2000, 14(6): 58-64.
YU Shi-wen,ZHU Xue-feng,DUAN Hui-ming. The Guideline for Segmentation and Part-Of-Speech Tagging on Very Large Scale Corpus of Contemporary Chinese. , 2000, 14(6): 58-64.
大规模现代汉语标注语料库的加工规范
俞士汶,朱学锋,段慧明
北京大学计算语言学研究所
The Guideline for Segmentation and Part-Of-Speech Tagging on Very Large Scale Corpus of Contemporary Chinese
YU Shi-wen,ZHU Xue-feng,DUAN Hui-ming
Institute of Computational Linguistics ,Peking University
Abstract:The Institute of Computational Linguistics of Peking University is developing a very large-scale contemporary Chinese corpus segmented and with many tags based on the owned resources ,e. g. the Grammatical Knowledge-base of Contemporary Chinese. There are about 40 tags in the tag set . It contains common Part-Of-Speech tags ,special usage tags of verbs and adjectives , proper noun ,placename of phrase type ,organization name of phrase type and so on.
The scale of the corpus is about 27 millions Chinese characters. The Institute of Computational Linguistics of PKU has completed the task of 14 millions characters and the processing quality is very high.
It is necessary to work out a complete guideline of corpus processing to obtain high quality tagged corpus. This paper introduces the principles of making out the guideline and the experiences of carrying out the guideline.