随着互联网的飞速发展,因网络作弊而产生的垃圾页面越来越多,严重影响了搜索引擎的检索效率和用户体验。反作弊已经成为搜索引擎所面临的最重要挑战之一。但目前的反作弊研究大都是基于页面内容或链接特征的,没有一个通用可行的识别方法。本文主要基于作弊目的的分析,给出作弊页面另一种体系的分类,为基于目的的作弊页面识别起到良好的导向作用。
Abstract
Named Entities are important meaningful units in texts. The recognition and analysis of named entities is of great significance in the field of Web information extraction, Web content management and knowledge engineering, etc. The research on named entities includes named entity recognition, disambiguation, coreference resolution, attribute extraction and relation detection, etc. Focusing on named entity recognition, disambiguation and crosslingual coreference resolution, the paper gives a thorough survey on the state of the art of these tasks, including the challenges, methods, evaluations, performances and the problems to be solved. The paper suggests that, the performances of the current systems of named entity recognition, disambiguation and crosslingual coreference resolution are far from the requirement of largescale practical applications. In the view of methods and approaches, named entity recognition, disambiguation and crosslingual conference resolution should be carried beyond the natural language texts and should be investigated directly among the largescale, redundant, heterogeneous, illformed and noisy web pages.
关键词
计算机应用 /
中文信息处理 /
命名实体识别 /
命名实体排歧 /
命名实体跨语言关联
{{custom_keyword}} /
Key words
computer application /
Chinese information processing /
Web spam, intention analysis, spam pages taxonomy
{{custom_keyword}} /
{{custom_sec.title}}
{{custom_sec.title}}
{{custom_sec.content}}
参考文献
[1] 中国互联网络信息中心(CNNIC). 2007.第19次中国互联网络发展状况统计报告[OL]. http://www.cnnic.cn/html/Dir/2007/01/22/4395.htm.
[2] 中国互联网络信息中心(CNNIC). 2005. 第16次中国互联网络发展状况统计报告[OL]. http://www.china.org.cn/chinese/news/922344.htm.
[3] Silverstein, C., Marais, H., Henzinger, M. et al. 1999. Analysis of a very large web search engine query log.[C]//Proceedings of the 22th Annual international ACM SIGIR Conference on Research and Development in information Retrieval (Berkeley, California, United States, August 15-19, 1999). SIGIR '99. ACM Press, New York, NY, 6-12.
[4] Henzinger, M., Motwani, R., Silverstein. C. Challenges in Web Search Engines.[C]//Proceedings of the 25th Annual international ACM SIGIR Conference on Research and Development in information Retrieval (Tampere, Finland, August 11-15, 2002). SIGIR '02. ACM Press, New York, NY, 2002: 11-22.
[5] Gyongyi, Z. and Garcia-Molina, H. 2005. Web spam taxonomy.[C]//First International Workshop on Adversarial Information Retrieval on the Web (Chiba, Japan, May 2005). AIRWeb '05.
[6] Brin, S. and Page, L. The anatomy of a large-scale hypertextual Web search engine.[C]//Proceedings of the Seventh international Conference on World Wide Web 7 (Brisbane, Australia). 1998: 107-117.
[7] Kleinberg. J.M. 1999. Authoritative sources in a hyperlinked environment[J]. Journal of the ACM, 1999, 46(5): 604-632.
[8] Wu, B. and Davison, B. Cloaking and redirection: a preliminary study. In First International Workshop on Adversarial Information Retrieval on the Web (Chiba, Japan, May 2005).[C]//AIRWeb '05. 2005.
[9] Wang, Y., Ma, M., Niu, Y., and Chen, H. Spam double-funnel: Connecting web spammers with advertisers.[C]//Proc. of the 16th International Conference World Wide Web (Banff, Alberta, Canada. May 8 12, 2007). WWW '07. ACM Press, New York, NY, 2007: 291-300.
[10] Fetterly, D., Manasse, M. and Najork, M. Spam, damn spam, and statistics: Using statistical analysis to locate spam web pages.[C]//Amer-Yahia S. and Gravano, L., eds. Proceedings of the 7th International Workshop on the Web and Databases (WebDB 2004). New York: ACM Press, 2004: 1~6.
[11] Ntoulas, A., Najork, M., Manasse, M., and Fetterly, D. Detecting spam web pages through content analysis.[C]//Proceedings of the 15th International Conference on World Wide Web (Edinburgh, Scotland, May 23-26, 2006). WWW '06. ACM Press, New York, NY, 2006: 83-92.
[12] Davison B. Recognizing nepotistic links on the Web.[C]//Artificial Intelligence for Web Search, pages 23--28. AAAI Press, July 2000. Presented at the AAAI-2000 workshop on Artificial Intelligence for Web Search, Technical Report WS-00-01.
[13] Amitay, E., Carmel, D., Darlow, A., Lempel, R., and Soffer, A. The connectivity sonar: detecting site functionality by structural patterns.[C]//Proceedings of the Fourteenth ACM Conference on Hypertext and Hypermedia (Nottingham, UK, August 26-30, 2003). HYPERTEXT '03. 2003.
[14] Gyngyi, Z., Garcia-Molina, H., and Pedersen, J. Combating web spam with trustrank[J]. In Proceedings of the Thirtieth international Conference on Very Large Data Bases-Volume 30. 576-587.
[15] Krishnan, V. and Raj, R. Web Spam Detection with Anti-Trust-Rank.[C]//the 2nd International Workshop on Adversarial Information Retrieval on the Web (Seattle, United States, August 2006). AIRWeb '06.
[16] Becchetti, L., Castillo, C., Donato1, D., Leonardi, S. and Baeza-Yates, R. Using Rank Propagation and Probabilistic Counting for Link Based Spam Detection.[C]//Proc. of WebKDD'06(Philadelphia, Pennsylvania, USA, August 20, 2006).
[17] Saracevic, T. 1995. Evaluation of evaluation in information retrieval.[C]//Proceedings of the 18th Annual international ACM SIGIR Conference on Research and Development in information Retrieval (Seattle, Washington, United States, July 09-13, 1995). E. A. Fox, P. Ingwersen, and R. Fidel, Eds. SIGIR '95. ACM Press, New York, NY, 138-146.
[18] Benczur, A., B?ro, I., Csalogany, K. and Sarlos T. 2007. Web spam detection via commercial intent analysis.[C]//Third International Workshop on Adversarial Information Retrieval on the Web (Banff, Alberta, Canada, May 8, 2007). AIRWeb '07. ACM Press, New York, NY, 89-92.
{{custom_fnGroup.title_cn}}
脚注
{{custom_fn.content}}
基金
国家973重点基础研究资助项目(2004CB318108);国家自然科学基金资助项目(60621062, 60503064, 60736044);国家863高科技资助项目(2006AA01Z141)
{{custom_fund}}