基于半监督话题模型的用户查询日志命名实体挖掘

曹 雷1,2,郭嘉丰1,白 露1,2,程学旗1

PDF(1213 KB)
PDF(1213 KB)
中文信息学报 ›› 2012, Vol. 26 ›› Issue (5) : 26-33.
综述

基于半监督话题模型的用户查询日志命名实体挖掘

  • 曹 雷1,2,郭嘉丰1,白 露1,2,程学旗1
作者信息 +

Named Entity Mining from Query Log through Semi-supervised Topic Modeling

  • CAO Lei1,2, GUO Jiafeng1, BAI Lu1,2, CHENG Xueqi1
Author information +
History +

摘要

基于用户查询日志的命名实体挖掘,目标是从用户查询日志中挖掘具有指定类别的命名实体。已有研究工作提出一种基于种子实体的挖掘方法,利用实体类别与候选实体之间的模板分布相似性来对候选实体进行排序。然而该挖掘方法忽略了命名实体具有歧义性、查询模板具有多义性和未标注实体信息,因而不能够有效的对候选实体进行排序。该文采用半监督话题模型,利用查询模板之间的关系来学习实体类别的模板分布,进而改善候选实体的排序效果。实验结果表明了该文提出方法的有效性。

Abstract

Named entity mining from query log aims to mine a list of named entities with the specific type from the query log. Previous work proposed a seed-based method which ranked the candidate entities based on the similarity between the template distribution of the specified class and that of the entities. However, it doesnt take into account the ambiguity of named entity, the polysemy of the template and the unlabeled data. In this paper, we propose a semi-supervised topic model, which leverages the relationship between the templates (i.e. the co-occurrence between templates) to learn the template distribution of the specified class so as to improve the entity ranking. Experimental results show the effectiveness of the proposed method.
Key wordsquery log; named entity mining; Semi-supervised Topic Model

关键词

用户查询日志 / 命名实体挖掘 / 半监督话题模型

Key words

query log / named entity mining / Semi-supervised Topic Model

引用本文

导出引用
曹 雷1,2,郭嘉丰1,白 露1,2,程学旗1. 基于半监督话题模型的用户查询日志命名实体挖掘. 中文信息学报. 2012, 26(5): 26-33
CAO Lei1,2, GUO Jiafeng1, BAI Lu1,2, CHENG Xueqi1. Named Entity Mining from Query Log through Semi-supervised Topic Modeling. Journal of Chinese Information Processing. 2012, 26(5): 26-33

参考文献

[1] Marius Paca. Weakly-supervised discovery of named entities using Web search queries[C]// Proceedings of the 16th ACM Conference on Information and Knowledge Management, 2007: 683-690.
[2] 翟海军,郭嘉丰,王小磊,等. 基于用户查询日志的命名实体挖掘 [J].中文信息学报, 2010, 24 (1) :71-76.
[3] Jiafeng Guo, Gu Xu, Xueqi Cheng, et al. Named entity recognition in query[C]// Proceedings of the 32nd International ACM SIGIR Conference on Research and Development in Information Retrieval, 2009: 267-274.
[4] Gu Xu, Shuang-Hong Yang, Hang Li. Named entity mining from click-through data using weakly supervised latent dirichlet allocation[C]// Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2009: 1365-1374.
[5] Junwu Du, Zhimin Zhang, Jun Yan, et al. Using search session context for named entity recognition in query[C]// Proceeding of the 33rd international ACM SIGIR Conference on Research and Development in Information Retrieval, 2010: 765-766.
[6] Thomas Hofmann. Probabilistic latent semantic indexing[C]// Proceeding of the 22nd International ACM SIGIR Conference on Research and Development in Information Retrieval, 1999: 50-57.
[7] David M. Blei, Andrew Y. Ng, Michael I. Jordan. Latent dirichlet allocation[J]. Journal of Machine Learning Research, 2003, 3: 993-1022.
[8] David M. Blei, Jon D. McAuliffe. Supervised topic models[C]// Proceedings of the 21st Annual Conference on Neural Information Processing Systems, 2007.
[9] Yue Lu, Chengxiang Zhai. Opinion integration through semi-supervised topic modeling[C]// Proceeding of the 17th International Conference on World Wide Web, 2008: 121-130.
[10] ChengXiang Zhai, Atulya Velivelli, Bei Yu. A cross-collection mixture model for comparative text mining[C]// Proceedings of the 10th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2004: 743-748.
[11] Tao Tao, ChengXiang Zhai. Regularized estimation of mixture models for robust pseudo-relevance feedback[C]// Proceedings of the 29th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, 2006: 162-169.

基金

国家自然科学基金资助项目(60903139, 60873243, 60933005);国家863计划重点资助项目(2010AA012502, 2010AA012503)
PDF(1213 KB)

Accesses

Citation

Detail

段落导航
相关文章

/