搜索引擎日志中“N+V”和“N1+N2+V”型短语自动识别

赵红改,吕学强,施水才,郑 丽

PDF(2060 KB)
PDF(2060 KB)
中文信息学报 ›› 2012, Vol. 26 ›› Issue (5) : 20-26.
综述

搜索引擎日志中“N+V”和“N1+N2+V”型短语自动识别

  • 赵红改,吕学强,施水才,郑 丽
作者信息 +

Automatic Identification of Phrase of “N+V” Structure and
“N1+N2+ V” Structure in Search Engine Query Logs

  • ZHAO Honggai,LV Xueqiang, SHI Shuicai,ZHENG Li
Author information +
History +

摘要

正确识别搜索引擎日志中的短语,对搜索引擎用短语词典构建和提高搜索引擎性能具有重要的作用。该文提出一种应用条件随机场实现对搜狗日志语料中“N+V”和“N1+N2+V”型短语自动识别的方法。模型的特征集包含词、词性和词语长度。由人工设计候选特征集,从中选择有效的特征构成特征模板,训练生成用于短语自动识别的条件随机场模型。封闭测试和开放测试的实验结果表明,模型能够实现对这两种短语的有效识别。

Abstract

Correct identification of the phrases in the query log plays an important role in the construction of search engine oriented phrase dictionary for and in improving search performance. The paper adopts conditional random fields for the identification of the phrases of “N+V” structure and the phrases of “N1+N2+V” structure in search engine query logs, namely, the Sogou log. The features for the model are composed of words types, part-of-speech features and word length features. Among these amnually designed candidate features sets, the effective features are selected to build the final conditional random fields The experiment results of closed tests and open tests show that the approach can identify the two types of phrases well.
Key wordsconditional random fields; query logs; the phrases of “N1+N2+V” structure; the phrases of “N+V” structure; features templates

关键词

条件随机场模型 / 搜索引擎日志 / “N+V”型短语 / “N1+N2+V”型短语 / 特征模板

Key words

conditional random fields / query logs / the phrases of “N1+N2+V” structure / the phrases of “N+V” structure / features templates

引用本文

导出引用
赵红改,吕学强,施水才,郑 丽. 搜索引擎日志中“N+V”和“N1+N2+V”型短语自动识别. 中文信息学报. 2012, 26(5): 20-26
ZHAO Honggai,LV Xueqiang, SHI Shuicai,ZHENG Li. Automatic Identification of Phrase of “N+V” Structure and
“N1+N2+ V” Structure in Search Engine Query Logs. Journal of Chinese Information Processing. 2012, 26(5): 20-26

参考文献

[1] 陈红涛,杨放春, 陈磊. 基于大规模中文搜索引擎的搜索日志挖掘[J]. 计算机应用研究,2008,25(6): 1663-1664.
[2] 干俊伟,黄德根.汉语介词短语的自动识别[J].中文信息学报,2005,19(4): 17-23.
[3] 冯冲,陈肇雄,黄河燕,等.基于条件随机域的复杂最长名词短语识别[J].小型微型计算机系统,2006,27(6): 1134-1139.
[4] J.Lafferty,A. McCallum, F.Pereira. Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data[C]//Proceedings of ICML’2001: 282-289.
[5] U.Kiyotaka,Qing Ma,M. Masaki,et al. Named entity extraction based on a maximum entropy model and transformation rules[C]//Proceedings of the 38th Annual Meeting on Association for Computational Linguistics, Hong Kong: 326 -335.
[6] 王东波,陈小荷,年洪东.基于条件随机场的有标记联合结构自动识别[J].中文信息学报,2008,22(6): 4-7.
[7] A. McCallum. Efficiently Inducing Features of Conditional Random Fields[EB/OL]. (2003-),http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.14.1630.
[8] H. Wallach. Efficient Training of Conditional Random Fields[D]. University of Edinburgh: Cognitive Science Division of Informatics,2002: 1-78.
[9] 肖诗斌,赵红改,王洪俊,等.搜索引擎日志中“N1+N2+V”型名词短语研究[J].广西师范大学学报.2011,29(1): 116-122.
[10] 唐昱.现代汉语名动式偏正结构研究[D].华中科技大学,2006:1-34.
[11] 周强,俞士汶.汉语短语标注标记集的确定[J].中文信息学报,1996,10(4): 1-10.

基金

国家社会科学基金资助项目(09CYY021)
PDF(2060 KB)

Accesses

Citation

Detail

段落导航
相关文章

/