一种新的基于统计的词典扩展方法

周正宇,李宗葛

PDF(286 KB)
PDF(286 KB)
中文信息学报 ›› 2001, Vol. 15 ›› Issue (5) : 47-52.

一种新的基于统计的词典扩展方法

  • 周正宇,李宗葛
作者信息 +

A New Statistical Method of Automatic Lexicon Augmentation

  • ZHOU Zheng-yu,LI Zong-ge
Author information +
History +

摘要

在建立统计语言模型时,往往会遇到词典的词汇量不够的问题。对于医学等专业领域的语料,这一问题尤为严重。针对这一问题,本文提出了一种新的基于统计的识别新词方法——右边缘扩展法。该方法对分词后的语料中产生的连续单字词进行关联范数估计,利用右边缘扩展的方法判断词的边界。在实验中,我们将右边缘扩展法与基于Witten-Bell back off方法的两两合并法相结合,循环地调整词典,优化语言模型。实验结果表明,该算法具有很高的识别正确率与检出率,可以有效地识别出语料中出现的新词汇,尤其是专业术语。

Abstract

The out-of-vocabulary problem is one of the bottlenecks in Chinese Language Modeling. The problem is especially serious for domain-specific training data set . This paper presents a new statistical method to extract new words from the training data. This new method is based on association norm estimation , and searches for the word boundaries by right boundary expanding. Combining the new method with another word-merging method ,we can iteratively optimize the lexicon ,segmentation and language model. And very encouraging results are reported in our experiments.

关键词

词典 / 关联范数估计 / 右边缘扩展法 / 语言模型

Key words

lexicon / association norm estimation / right boundary expanding / language model

引用本文

导出引用
周正宇,李宗葛. 一种新的基于统计的词典扩展方法. 中文信息学报. 2001, 15(5): 47-52
ZHOU Zheng-yu,LI Zong-ge. A New Statistical Method of Automatic Lexicon Augmentation. Journal of Chinese Information Processing. 2001, 15(5): 47-52

参考文献

[1] Tang Haijiang ,Pascale Fung. A multi-path syllable to word decoder with language model optimization and automatic lexicon augmentation. 2000 International Symposium on Chinese Spoken Language Processing , Beijing ,China ,Oct 2000
[2] Chien Lee-Feng. PAT-tree-based adaptive keyphrase extraction for Intelligent Chinese Information Retrieval. Information Processing and Management ,1999 ,35 :501 - 521
[3] Yang Kae-Cherng ,Ho Tai-Hsuan ,Chien Lee-Feng , et al . Statistics-based segment pattern lexicon-a new direction for Chinese language modeling. IEEE ,1998 International Conference on Acoustics ,Speech and Signal Processing ,Seattle ,WA ,1998 ,169 - 172
[4] Gao Jianfeng ,Wang Hai-Feng ,Li Mingjing , et al . A Unified Approach to Statistical Language Modeling for Chinese. IEEE ,2000 International Conference on Acoustics ,Speech and Signal Processing ,2000
[5] Wong Pad-Kwong ,Chan Chorkin. Chinese word segmentation based on maximum matching and word binding force. The 16th International Conference on Computational Linguistics ,Copenhagen ,Denmark , 1996 ,200 - 203
[6] Witten I H ,Bell T C ,The zero-frequency problem:estimation the probabilities of novel events in adaptive text compression. IEEE trans. On inform. Theory ,1991 ,37 (4) :1085 - 1094
PDF(286 KB)

654

Accesses

0

Citation

Detail

段落导航
相关文章

/