Abstract:Though cache-based language models can better adapt to cross-domain environment , the hypothesis that it has made is too simple. It assumes that a word that has appeared in the article often reappears later in the same article. But it does not take into account the influence of stop words and mutual action between different words. According to this problem , we have made two improvements to the model. First , we use TFIDF scheme instead of simple statistics. Second , we adopt an extended cache-based 2-gram model , which expand the information that the model exploits. Experiments have shown that the performance of the adaptive model has been improved greatly.
[1] Ronald Rosenfeld , Two decades of statistical language modeling : Where do we go from here ? Proceedings of the IEEE [C] , 88 (8) , 2000. [2] DeMori , R. , and M. Federico , Language Model Adaptation , [A] . In Computational Models of Speech Pattern Processing , Keith Pointing (ed.) , NATO ASI Series , Springer Verlag , 1999. [3] R. Kuhn and R. D. Mori , A cache-based natural language model for speech reproduction [J] . IEEE Transactions on Pattern Analysis and Machine Intelligence , PAM2 - 12 (6) : 570 - 583 , 1990. [4] Daniel Gildea and Thomas Hofmann , Topic-based language models using EM. In Proceedings of the 6th European Conference on Speech Communication and Technology( EUROPEANSPEECH) [C] , 1999. [5] G. Salton , Automatic text processing : The transformation , Analysis , and Retrieval of Information by Computer [M] , Addison-Wesley 1989. [6] P. Clarkson and A. Robinson , Language model adaption using mixture and an exponentially decaying cache[C] . In Boc. ICASSP - 97 , 1997. [7] A. P. Dempster , N. M. Laivd , and D. B. Rubin , Maximum likelihood from incomplete data via the EM algorithm [J] . Journal of the Royal Statistical Society B ,1977 ,39 : 1 - 38.