基于混合语言模型的文档相似性计算模型

李晓光,于戈,王大玲

PDF(248 KB)
PDF(248 KB)
中文信息学报 ›› 2006, Vol. 20 ›› Issue (4) : 43-50.

基于混合语言模型的文档相似性计算模型

  • 李晓光,于戈,王大玲
作者信息 +

Document Similarity Model Based on Mixture Language Model

  • LI Xiao-guang, YU Ge,WANG Da-ling
Author information +
History +

摘要

为了克服现有文档相似性模型对文档特性拟合的不完全性和缺乏理论根据的弱点,本文在统计语言模型的基础上,提出了一种基于混合语言模型(Mixture Language Model,MLM)文档相似性计算模型。MLM利用统计语言模型描述文档特征,将相关影响因素作为模型的潜在子模型,文档语言模型由各子模型混合构成,从而准确和全面地反映文档特征。由于MLM根据具体应用确定相关影响因素,并以此构建相应文档描述模型,因此具有很强的灵活性和扩展性。在MLM的基础上,本文给出了一个基于文档主题内容相似性的实例,在TREC9数据集上的实验表明MLM优于向量空间模型(VSM)。

Abstract

To overcome the incompleteness of modeling document characteristics and the lack of theory for current document similarity models, this paper puts forward to utilize mixture language model (MLM) to evaluate document-to-document similarity. In MLM, the characteristic of a document is described based on statistic language model, and the factors of influencing its characteristic are viewed as the latent models, and then the document language model is a mixture model combined with each latent models. MLM not only models document characteristics more perfectly, but it is flexible and scalable to be implemented with respect to applications. Within the framework of MLM, a document similarity method is presented from the viewpoint of document content. The experimental results over the TREC9 dataset indicate that MLM outperforms VSM.

关键词

人工智能 / 自然语言处理 / 文档相似性 / 统计语言模型 / 混合模型 / EM算法

Key words

artificial intelligence / natural language processing / document similarity / statistic language model / finite mixture model / EM

引用本文

导出引用
李晓光,于戈,王大玲. 基于混合语言模型的文档相似性计算模型. 中文信息学报. 2006, 20(4): 43-50
LI Xiao-guang, YU Ge,WANG Da-ling. Document Similarity Model Based on Mixture Language Model. Journal of Chinese Information Processing. 2006, 20(4): 43-50

参考文献

[1] Saracevic, T, Relevance Reconsidered[A]. In: P. Ingwersen and N. O. Pors. Information Science: Integration in Perspective[C] , 1996.
[2] Salton, G. The SMART Retrieval System: Experiments in Automatic Document Processing[M]. Prentice-Hall Inc. , Englewood Cliffs, NL, 1971.
[3] J. A. Bilmes. A Gentle Tutorial of the EM Algorithm and its application to Parameter Estimation for Gaussian Mixture and Hidden Markov Models[R]. Technical report, U. C. Berkeley, 1998.
[4] 边肇祺,等. 模式识别[M]. 北京:清华大学出版社, 2002.
[5] Hugo Zaragoza, Djoerd Hiemstra and Michael Tipping. Bayesian Extension to the Language Model for Ad Hoc Information Retrieval [A]. In: proceedings of SIGIR’03 [C] (2003).
[6] Lafferty, J. and Zhai, C. Document Language Models, Query Models, and Risk Minimization for Information Retrieval [A]. In: proceedings of SIGIR’01 [C]. 2001.
[7] D. Miller, T. Leek and R. M. Schwartz. A Hidden Markov Model Information Retrieval System [A]. In: proceedings of SIGIR’99 [C]. 1999.
[8] V. I. Levenshtein. Binary codes capable of correcting spurious insertions and deletions of ones (original in Russian) [A]. Russian Problemy Peredachi Informatsii 1 [C] , pp. 12 - 25, 1965.
[9] P. Yianilos. The Like It intelligent string comparison facility[R]. NEC Institute Tech Report 97 - 093, 1997.
[10] E. Spertus. ParaSite: Mining structural information on the web [A]. In: proceeding of The Sixth International World Wide web Conference[C]. 1997.
[11] K. D. Bollacker, S. Lawrence. and C. Lee Giles. CiteSeer: An Autonomous web Agent for Automatic Retrieval and Identification of Interesting Publications [A]. 2nd International ACM Conference on Autonomous Agents[C]. pp. 116 - 123, 1998.
[12] Salton Gerard, Developments in automatic text retrieval [N]. Science, 1991.
[13] 张俊林,曲为民,孙乐,孙玉芳,一种改善的基于语言模型的中文检索系统研究[J]. 中文信息学报, 2004, 18 (2) : 23 - 29.
[14] 张俊林,孙乐,孙玉芳,一种改进的基于记忆的自适应汉语语言模型[J]. 中文信息学报, 2005, 19 (1) : 8 - 13.

基金

国家自然科学基金资助项目(60573090;60503036;60473073)
PDF(248 KB)

632

Accesses

0

Citation

Detail

段落导航
相关文章

/