一种基于EM非监督训练的自组织分词歧义解决方案

王伟,钟义信,孙建,杨力

PDF(235 KB)
PDF(235 KB)
中文信息学报 ›› 2001, Vol. 15 ›› Issue (2) : 39-45.

一种基于EM非监督训练的自组织分词歧义解决方案

  • 王伟,钟义信,孙建,杨力
作者信息 +

A Self-organized Scheme for Word Segmentation Ambiguity Resolution Based on EM Training Algorithm

  • WANG Wei,ZHONG Yi-xin,SUN Jian,YANG Li
Author information +
History +

摘要

本文旨在提供一种基于非监督训练的分词歧义解决方案和一种分词算法。基于EM的思想,每个句子所对应的所有(或一定范围内)的分词结果构成训练集,通过这个训练集和初始的语言模型可以估计出一个新的语言模型。最终的语言模型通过多次迭代而得到。通过一种基于该最终语言模型的统计分词算法,对于每个句子至少带有一个歧义的测试集的正确切分精度达到85.36%(以句子为单位) 。

Abstract

This paper is mainly to present a word segmentation ambiguity resolution scheme based on unsupervised training. According to the idea of EM ,a language model is built increasingly by collection the fractional counts of patterns (such as bigram pair) from the augmentations of all the segmentation candidates of a sentence. The learned language model is incorporated into a statistical segmentor. Experiments show that this scheme can resolve 85.36% ambiguity on test set each sentence of which has at least one ambiguous part (and the accuracy rate is based on sentence) .

关键词

EM算法 / 分词歧义 / 非监督

Key words

EM algorithm / segmentation ambiguity / unsupervised learning

引用本文

导出引用
王伟,钟义信,孙建,杨力. 一种基于EM非监督训练的自组织分词歧义解决方案. 中文信息学报. 2001, 15(2): 39-45
WANG Wei,ZHONG Yi-xin,SUN Jian,YANG Li. A Self-organized Scheme for Word Segmentation Ambiguity Resolution Based on EM Training Algorithm. Journal of Chinese Information Processing. 2001, 15(2): 39-45

参考文献

[1] 马晏. 基于评价的汉语自动分词系统的研究与实现. 语言信息专论,1996 ,2 - 36
[2] Sun Maosong. Word Segmentation and Part-of-Speech Tagging for Unrestricted Chinese Text . (http://dns.itsl.tsinghua.edu.cn/ainlp/update.htm)
[3] Xiaoqiang Luo ,Salim Roukos. An Iterative Algorithm to Build Chinese Language Model. ACL96 ,1996
[4] Thomas G Dietterich. Machine-Learning Research Four Current Directions. AI MAGZINE ,1997 ,97~135
[5] Brown et al . The Mathematics of Statistical Machine Translation. Computational Linguistics ,1993
[6] Stolcke ,A. Entropy - based Pruning of Backoff Language Models. In : Proceedings of the ARPA Workshop on Human Language Technology ,1998
[7] Christopher et al . Foundations of Statistical Nantural Language Processing. June 18 ,1999 MIT Press
[8] 刘开瑛. 中文文本自动分词和标注. 北京:商务印书馆,2000
[9] 郭祥昊. 语言信息处理理论及自动文摘关键技术研究[博士学位论文] . 北京:北京邮电大学,1998

基金

国家自然科学基金资助(6998201)
PDF(235 KB)

Accesses

Citation

Detail

段落导航
相关文章

/