提高汉语自动分词精度的多步处理策略

赵铁军,吕雅娟,于浩,杨沐昀,刘芳

PDF(188 KB)
PDF(188 KB)
中文信息学报 ›› 2001, Vol. 15 ›› Issue (1) : 13-18.
综述

提高汉语自动分词精度的多步处理策略

  • 赵铁军,吕雅娟,于浩,杨沐昀,刘芳
作者信息 +

Increasing Accuracy of Chinese Segmentation with Strategy of Multi-step Processing

  • ZHAO Tie-jun,LV Ya-juan,YU Hao,YANG Mu-yun,LIU Fang
Author information +
History +

摘要

汉语自动分词在面向大规模真实文本进行分词时仍然存在很多困难。其中两个关键问题是未登录词的识别和切分歧义的消除。本文描述了一种旨在降低分词难度和提高分词精度的多步处理策略,整个处理步骤包括7个部分,即消除伪歧义、句子的全切分、部分确定性切分、数词串处理、重叠词处理、基于统计的未登录词识别以及使用词性信息消除切分歧义的一体化处理。开放测试结果表明分词精确率可达98%以上。

Abstract

The automatic word segmentation of Chinese sentences is difficult when the processing mechanism faces large-scale real texts. The crucial two issues in Chinese segmentation are the identification of unknown words and the disambiguation of segmentation strings. This paper describes a strategy based on multi-steps processing for decreasing the difficulties and improving the accuracy of the segmentation. The processing steps include seven parts , i. e. , disambiguation of pseudo-ambiguities ,full segmentation of a sentence , determinate segmentation for some words , processing of numeral string ,processing for reduplication of words ,statistical identification for unknown words and final correction for segmentation ambiguities with part-of-speech which is integrated in the tagger. The output of this procedure is promising with above 98% accuracy in open test .

关键词

汉语自动分词 / 歧义 / 多步处理

Key words

Chinese segmentation / ambiguity / multi-step strategy

引用本文

导出引用
赵铁军,吕雅娟,于浩,杨沐昀,刘芳. 提高汉语自动分词精度的多步处理策略. 中文信息学报. 2001, 15(1): 13-18
ZHAO Tie-jun,LV Ya-juan,YU Hao,YANG Mu-yun,LIU Fang. Increasing Accuracy of Chinese Segmentation with Strategy of Multi-step Processing. Journal of Chinese Information Processing. 2001, 15(1): 13-18

参考文献

[1] 刘继武,赵铁军,刘挺. 词性信息在汉语自动分词中的应用. 见:‘99智能计算机接口与应用进展. 北京:电子工业出版社,1999 ,147 - 150
[2] Richard Sproat et al . A stochastic finite-state word-segmentation algorithm for Chinese. Computational Linguistics ,1996 ,22 (3) :377 - 404
[3] Kok-Wee Gan et al . A statistically emergent approach for language processing :application to modeling context effects in ambiguous Chinese word boundary perception. Computational Linguistics , 1996 , 22 (4) :531 - 553
[4] 沈达阳. 基于统计和规则的汉语真实文本自动分词和词性标注系统的研究与实现[硕士学位论文] . 北京:清华大学,1996
[5] 孙茂松,左正平,邹嘉彦. 高频最大交集型歧义切分字段在汉语自动分词中的作用. 中文信息学报, 1999 :13 (1) :27 - 34
[6 ] 吕雅娟等. 基于分解与动态规划策略的汉语未登录词识别. 中文信息学报,2001 ,15 (1)
[7] 孙茂松,黄昌宁等. 中文姓名的自动识别. 中文信息学报,1995 ,9 (2)
[8] 中国社会科学院语言文字应用研究所. 姓氏人名用字分析统计. 北京:语文出版社,1991

基金

国家863计划(863-306-ZT03-06-3/863-306-ZD13-04-4);国家自然科学基金(69775017)
PDF(188 KB)

Accesses

Citation

Detail

段落导航
相关文章

/