王 玥,吕学强,李 卓,舒 燕. 搜索日志中中文人名自动识别[J]. 中文信息学报, 2015, 29(3): 162-168.
WANG Yue, LV Xueqiang, LI Zhuo, SHU Yan. Automatic Identification of Chinese Names in Search Logs. , 2015, 29(3): 162-168.
Automatic Identification of Chinese Names in Search Logs
WANG Yue1, LV Xueqiang1, LI Zhuo1, SHU Yan2
1. Beijing Key Laboratory of Internet Culture and Digital Dissemination Research, Beijing Information Science and Technology University,Beijing 100101,China; 2. Beijing TRS Information Technology Co., Ltd, Beijing 100101,China
Abstract:Search log name recognition has been a focus in Log Mining, which has direct impact on search engine’s retrieval efficiency and accuracy. The paper analyzes the drawbacks of name identification methods for long texts when applied to search logs, and proposes a method to identify Chinese names in search logs. The method employs the name internal word probability extracted from search query logs by the Conditional Random Fields, then estimates the credibility of person name according to the characteristics in the search log. Experimental results on Sogou query logs show that our approach reaches 81.97%accuracyand 85.81% recall on average, yielding F-measure of 83.79% .
[1] Downey D,Broadhead M,Etzioni O.Locating complex named entities in Web text[C]//Proceedings of the 20th international joint conference on artifical intelligence.San Francisco,CA: Morgan Kaufmann Publishers Inc.2007: 2733-2739. [2] Shen D,Walker T,Zheng Z,et al. Personal nameclassification in Web queries[C]//Proceedings of the international conference on Web search and web datamining. New York,NY: ACM,2008: 149-158. [3] Artiles J,Gonzalo J,Verdejo F. A testbed for people searching strategies in the www[C]//Proceedings of the 28th Annual International ACM SIGIR Conference on Research and Development in Information Retriev- al. New York,NY: ACM,2005: 569-570. [4] 张磊,王斌,靖红芳等.中文网页搜索日志中的特殊命名实体挖掘[J].哈尔滨工业大学学报,2011,43(5):119-122. [5] 罗智勇,宋柔.一种基于可信度的人名识别方法[J].中文信息学报,2005,19(3): 67-72,86. [6] 宋柔.基于语料库和规则库的人名识别方法[M].计算语言学研究与应用,北京:北京语言学院出版社,1993年. [7] 郑家恒,李鑫,谭红叶.基于语料库的中文姓名识别方法研究[J].中文信息学报,2000, 14(1):7-12. [8] 时迎超,王会珍,肖桐,胡明涵. 面向人名消歧任务的人名识别系统[J]. 中文信息学报,2011,25(3): 17-22. [9] 李波,张蕾. 基于错误驱动学习和知网的中文人名识别[J]. 计算机工程,2012,38(12): 179-181. [10] 张华平,刘群.基于角色标注的中国人名自动识别研究[J].计算机学报,2004,27(1): 85-91. [11] 毛婷婷,李丽双. 基于混合模型的中国人名自动识别[J].中文信息学报,2007,21(2): 22-28. [12] 李中国,刘颖.边界模板和局部统计相结合的中国人名识别[J].中文信息学报,2006,20(5): 44-50,57. [13] Brown P, De Souza P,Mercer R, et al. Classbased n-gram models of natural language[J]. Journal Computational Linguistics,1992,18(4): 467-479. [14] Chen H H,Ding Y W,Tsai S C,et al. Description of the NTU system used for MET2[C]//Proceedings of the 7th Message Understanding Conference.[S. l.]: [s. n.],1998. [15] Joachims T. Text Categorization with support vector machines: Learning with many relevant features[J]. Springer, 1998,1398(23): 137-142. [16] Pasca M. Weakly-supervised discovery of named entities using Web search queries[C]//Proceedings of the 16th International Conference on Information and Knowledge Management. New York, NY: ACM, 2007:683-690. [17] 黄昌宁,赵海.由字构词——中文分词新方法; 中文信息处理前沿进展——中国中文信息学会二十五周年学术会议论文集[C]//中国中文信息学会二十五周年学术会议,2006. [18] 维基百科.常见姓氏列表[OL].[2012].zh.wikipedia.org/wiki/常见姓氏列表. [19] 姚勤智.亚洲人名词库[OL]. [2012] http://bbs.jjol.cn/showthread.php?t=2001. [20] 搜狐研发中心.用户查询日志[OL]. [2012].www.sogou.com/labs/dl/q.html. [21] 郭家清,蔡东风等.一种基于条件随机场的人名识别方法[J].通讯和计算机,2007,4(2)27-30.