基于CRF的先秦汉语分词标注一体化研究

石 民,李 斌,陈小荷

PDF(817 KB)
PDF(817 KB)
中文信息学报 ›› 2010, Vol. 24 ›› Issue (2) : 39-46.
综述

基于CRF的先秦汉语分词标注一体化研究

  • 石 民,李 斌,陈小荷
作者信息 +

CRF Based Research on a Unified Approach to
Word Segmentation and POS Tagging for Pre-Qin Chinese

  • SHI Min, LI Bin, CHEN Xiaohe
Author information +
History +

摘要

该文探索了古代汉语,特别是先秦文献的词切分及词性标注。首先对《左传》文本进行了词汇处理(分词和词性标注)和考察分析,然后采用条件随机场模型(CRF),进行自动分词、词性标注、分词标注一体化的对比实验。结果表明,一体化分词比单独分词的准确率和召回率均有明显提高,开放测试的F值达到了94.60%;一体化词性标注的F值达到了89.65%,比传统的先分词后标注的“两步走”方法有明显提高。该项研究可以服务于古代汉语词汇研究和语料库建设,以弥补人工标注的不足。

Abstract

This paper explores the cross field between NLP and ancient Chinese, particularly the pre-Qin documents. The text of "Zuo Zhuan" is firstly analyzed after manual segmentationand POS tagging. Then the Conditional Random Fields model (CRF) is adopted for the word segmentation (WS), POS tagging (PT) and a unified process of WS and PT, respectively. The precision and recall of the unified approach are much higher than the independent WS and PT in the open test, with a F-score of 94.60% in WS and 89.65% in PT. This method is suitable for the study of ancient Chinese vocabulary and corpus construction, and can be applied to compensatethe manual tagging.
Key wordscomputer application; Chinese information processing; Pre-Qin Chinese; word segmentation; POS tagging; Zuo Zhuan; conditional random fields model

关键词

计算机应用 / 中文信息处理 / 先秦汉语 / 分词 / 词性标注 / 左传 / 条件随机场模型

Key words

computer application / Chinese information processing / Pre-Qin Chinese / word segmentation / POS tagging / Zuo Zhuan / conditional random fields model

引用本文

导出引用
石 民,李 斌,陈小荷. 基于CRF的先秦汉语分词标注一体化研究. 中文信息学报. 2010, 24(2): 39-46
SHI Min, LI Bin, CHEN Xiaohe. CRF Based Research on a Unified Approach to
Word Segmentation and POS Tagging for Pre-Qin Chinese. Journal of Chinese Information Processing. 2010, 24(2): 39-46

参考文献

[1] 尉迟治平.计算机技术和汉语史研究[J].古汉语研究,2000,3:56-60.
[2] 魏培泉,黄居仁,等.建构一个以共时与历时语言研究为导向的历史语料库[J]. 中文计算语言学期刊,1997,2(1):131-145.
[3] 邱冰.基于中文信息处理的古代汉语分词研究[J].微计算机信息,2008,1:100-102.
[4] 白拴虎.汉语词切分及词性标注一体化方法[C]//计算语言学进展与应用.北京:清华大学出版社,1995:56-61.
[5] Hwee Tou Ng and Jin Kiat Low. Chinese Part-of-Speech Tagging: One-at-a-Time or All-at-Once? Word-Based or Character-Based?[C]//Proceedings of ACL-04:277-284.
[6] Yue Zhang and Stephen Clark.Joint Word Segmentation and POS Tagging using a Single Perceptron[C]//Proceedings of ACL-08:888-896.
[7] 杨伯峻.春秋左传注(修订版)[M].北京:中华书局,1990.
[8] 陈克炯.春秋左传详解词典[M].河南:中州古籍出版社,2004.
[9] 宋彦,等.一种基于字词联合解码的中文分词方法[J].软件学报,2009,9:2366-2375.

基金

国家“211工程”三期重点学科建设项目“语言科技创新及工作平台建设”子课题“先秦文献词汇统计与知识检索系统”
PDF(817 KB)

Accesses

Citation

Detail

段落导航
相关文章

/