分词是中文处理中的重要基础问题。为了克服Web文本分析中传统方法在适应繁杂的专业领域和多变的语言现象时存在的困难,本文以无督导分词方法为基本框架,使用EM算法建立n元multigram语言模型,提出了一种基于置信度的主动学习分词算法,使得系统在主要利用大量未标注数据的同时,还能够主动选择少量最有价值的数据提交人工标注。实验结果表明算法性能优于相关的几种无督导分词算法。
Abstract
Word segmentation is a fundamental task in Chinese processing. To solve the difficulties of traditional methods in coping with various application domains and evolutive language phenomena , this paper adopts an unsupervised learning framework , using EM algorithm to train the n-multigram language model. A new certainty-based active learning segmentation algorithm is proposed , which combine labeled data with unlabeled data together to optimize language model. In experiments it outperforms other unsupervised word segmentation algorithms.
关键词
计算机应用 /
中文信息处理 /
分词 /
无督导机器学习 /
主动学习 /
EM算法
{{custom_keyword}} /
Key words
computer application /
Chinese information processing /
word segmentation /
unsupervised machine learning /
active learning /
EM algorithm
{{custom_keyword}} /
{{custom_sec.title}}
{{custom_sec.title}}
{{custom_sec.content}}
参考文献
[1] 孙茂松,邹嘉彦. 汉语自动分词研究评述[J] . 当代语言学, 2001 ,3 (1) :22 - 32 (MaoSong Sun , T’Sou B K Y, A Review and Evaluation on Automatic Segmentation of Chinese [J] (in Chinese) . Contemporary Linguistics , 1 (3) , Beijing , PRC , 2001 , 22 - 32.
[2] S Deligne , F Bimbot Language Modeling by Variable Length Sequences : Theoretical Formulation and Evaluation of Multigrams[A] . In :Proceedings IEEE International Conference on Acoustics , Speech and SignalProcessing (ICASSP) [C] ,1995 ,67 - 73.
[3] A Dempster ,N Laird ,and D Rubin Maximum-likelihood from Incomplete Data via the EM algorithm[J] . J Royal Statist Soc Ser , B(39) ,1977 ,21 - 29.
[4] Fuchun Peng , Language Independent Text Learning with Statistical n-GramLanguage Models [D] . University of Waterloo , Ontario ,Canada ,2003.
[5] C Manning ,H Schutze ,Foundations of Statistical Natural Language Processing[M] . MIT Press ,Cambridge ,Massachusetts ,1999.
[6] D A Cohn ,Z Chahramani , and MI Jordan , 1996 , Active Learning with statistical models[J] . Journal of Artificial Intelligence Research , Vol 4 ,129 - 145.
[7] L Rabiner A Tutorialon Hidden Markov Models and Selected Applications in Speech Recognition[A] . In : Proceedings of IEEE[C] , 77 (2) ,1989 , 172 - 209.
[8] Xianping Ge , et al , Discovering Chinese words from unsegmented text [A] . SIGIR'99 (Proc on the 22nd Annual Int ACM SIGIR Conf On R&D in IR) [C] , Berkley , August 1999 , 217 - 272.
[9] Andrew. K. McCallum , K. Nigam , Employing EM and pool-based active learning for text classification[A] . In : Proceedings of the International Conference on Machine Learning[C] , 1998 , 398 - 406.
[10] M. Sassano. An empirical study of active learning with SVM for Japanese word segmentation[A] . In : Proc. of the Association of Computational Linguistics (ACL) [C] , 2002. 505 - 512.
[11] Chunyu Kit , Zhiming Xu and Jonathan J. Webster , Integrating Ngram Model and Case-based Learning for Chinese Word Segmentation[A] . 2nd SIGHAN Workshop on Chinese Language Processing[C] , 160 - 163.
{{custom_fnGroup.title_cn}}
脚注
{{custom_fn.content}}
基金
国家自然科学基金资助项目(60272088);国家863资助项目(2002AA11401)
{{custom_fund}}