王伟,钟义信,孙建,杨力. 一种基于EM非监督训练的自组织分词歧义解决方案[J]. 中文信息学报, 2001, 15(2): 39-45.
WANG Wei,ZHONG Yi-xin,SUN Jian,YANG Li. A Self-organized Scheme for Word Segmentation Ambiguity Resolution Based on EM Training Algorithm. , 2001, 15(2): 39-45.
一种基于EM非监督训练的自组织分词歧义解决方案
王伟,钟义信,孙建,杨力
北京邮电大学智能中心
A Self-organized Scheme for Word Segmentation Ambiguity Resolution Based on EM Training Algorithm
WANG Wei,ZHONG Yi-xin,SUN Jian,YANG Li
Research Center of Intelligence of Beijing University of Posts and Telecommunications
Abstract:This paper is mainly to present a word segmentation ambiguity resolution scheme based on unsupervised training. According to the idea of EM ,a language model is built increasingly by collection the fractional counts of patterns (such as bigram pair) from the augmentations of all the segmentation candidates of a sentence. The learned language model is incorporated into a statistical segmentor. Experiments show that this scheme can resolve 85.36% ambiguity on test set each sentence of which has at least one ambiguous part (and the accuracy rate is based on sentence) .
[1] 马晏. 基于评价的汉语自动分词系统的研究与实现. 语言信息专论,1996 ,2 - 36 [2] Sun Maosong. Word Segmentation and Part-of-Speech Tagging for Unrestricted Chinese Text . (http://dns.itsl.tsinghua.edu.cn/ainlp/update.htm) [3] Xiaoqiang Luo ,Salim Roukos. An Iterative Algorithm to Build Chinese Language Model. ACL96 ,1996 [4] Thomas G Dietterich. Machine-Learning Research Four Current Directions. AI MAGZINE ,1997 ,97~135 [5] Brown et al . The Mathematics of Statistical Machine Translation. Computational Linguistics ,1993 [6] Stolcke ,A. Entropy - based Pruning of Backoff Language Models. In : Proceedings of the ARPA Workshop on Human Language Technology ,1998 [7] Christopher et al . Foundations of Statistical Nantural Language Processing. June 18 ,1999 MIT Press [8] 刘开瑛. 中文文本自动分词和标注. 北京:商务印书馆,2000 [9] 郭祥昊. 语言信息处理理论及自动文摘关键技术研究[博士学位论文] . 北京:北京邮电大学,1998