在建立统计语言模型时,往往会遇到词典的词汇量不够的问题。对于医学等专业领域的语料,这一问题尤为严重。针对这一问题,本文提出了一种新的基于统计的识别新词方法——右边缘扩展法。该方法对分词后的语料中产生的连续单字词进行关联范数估计,利用右边缘扩展的方法判断词的边界。在实验中,我们将右边缘扩展法与基于Witten-Bell back off方法的两两合并法相结合,循环地调整词典,优化语言模型。实验结果表明,该算法具有很高的识别正确率与检出率,可以有效地识别出语料中出现的新词汇,尤其是专业术语。
Abstract
The out-of-vocabulary problem is one of the bottlenecks in Chinese Language Modeling. The problem is especially serious for domain-specific training data set . This paper presents a new statistical method to extract new words from the training data. This new method is based on association norm estimation , and searches for the word boundaries by right boundary expanding. Combining the new method with another word-merging method ,we can iteratively optimize the lexicon ,segmentation and language model. And very encouraging results are reported in our experiments.
关键词
词典 /
关联范数估计 /
右边缘扩展法 /
语言模型
{{custom_keyword}} /
Key words
lexicon /
association norm estimation /
right boundary expanding /
language model
{{custom_keyword}} /
{{custom_sec.title}}
{{custom_sec.title}}
{{custom_sec.content}}
参考文献
[1] Tang Haijiang ,Pascale Fung. A multi-path syllable to word decoder with language model optimization and automatic lexicon augmentation. 2000 International Symposium on Chinese Spoken Language Processing , Beijing ,China ,Oct 2000
[2] Chien Lee-Feng. PAT-tree-based adaptive keyphrase extraction for Intelligent Chinese Information Retrieval. Information Processing and Management ,1999 ,35 :501 - 521
[3] Yang Kae-Cherng ,Ho Tai-Hsuan ,Chien Lee-Feng , et al . Statistics-based segment pattern lexicon-a new direction for Chinese language modeling. IEEE ,1998 International Conference on Acoustics ,Speech and Signal Processing ,Seattle ,WA ,1998 ,169 - 172
[4] Gao Jianfeng ,Wang Hai-Feng ,Li Mingjing , et al . A Unified Approach to Statistical Language Modeling for Chinese. IEEE ,2000 International Conference on Acoustics ,Speech and Signal Processing ,2000
[5] Wong Pad-Kwong ,Chan Chorkin. Chinese word segmentation based on maximum matching and word binding force. The 16th International Conference on Computational Linguistics ,Copenhagen ,Denmark , 1996 ,200 - 203
[6] Witten I H ,Bell T C ,The zero-frequency problem:estimation the probabilities of novel events in adaptive text compression. IEEE trans. On inform. Theory ,1991 ,37 (4) :1085 - 1094
{{custom_fnGroup.title_cn}}
脚注
{{custom_fn.content}}