Abstract:The research on information extraction is being developed into open information extraction, i.e. extracting open categories of entities, relations and events from open domain text resources. The methods used are also transferred from pure statistical machine learning model based on human annotated corpora into statistical learning model incorporated with knowledge bases mined from large-scaled and heterogeneous Web resources. This paper firstly reviews the history of the researches on information extraction, then detailedly introduces the task definitions, difficulties, typical methods, evaluations, performances and the challenges of three main open domain information extraction tasks, i.e. entity extraction, entity disambiguation and relation extraction. Finally, based on our researches on this field, we analyze and discuss the development directions of open information extraction research and its applications in large-scaled knowledge engineering, question answering, etc. Key wordsopen information extraction; knowledge engineering; text understanding
[1] Ralph Grishman. 1997. Information Extraction: Techniques and Challenges[R]. New York: New York University, 1997. [2] Ralph Grishman, Beth Sundheim. Message Understanding Conference-6: A Brief History[C]//Proceedings of COLING, 1996. [3] http://www.itl.nist.gov/iad/mig/tests/ace/[OL]. [4] http://www.nist.gov/tac/[OL]. [5] Martina Naughton, N. Kushmerichand J. Carthy. Event Extraction from Hetergeneous News Sources[C]//Proceedings of AAAI, 2006. [6] D. McClosky, M. Surdeanu, C. D. Manning. Event Extraction as Dependency Parsing[C]//Proceedings of ACL-HLT, 2011. [7] Yu Hong, Jianfeng Zhang, Bin Ma, Jianmin Yao, Guodong Zhou, Qiaoming Zhu. Using Cross-Entity Inference to Improve Event Extraction[C]//Proceedings of ACL-HLT, 2011. [8] 刘康. 文本倾向性分析技术研究[D]. 中国科学院自动化研究所博士学位论文, 2010. [9] 赵军. 命名实体识别、排歧和多语言关联[J]. 中文信息学报,2009, 23(2): 3-17. [10] Jun Zhao, Feifan Liu. Product Named Entity Recognition in Chinese Texts[J]. International Journal of Language Resource and Evaluation. 2008, 42(2): 132-152. [11] Richard C. Wang, William Cohen. Automatic Set Instance Extraction using the Web[C]//Proceedings of ACL-IJCNLP, 2009. [12] Richard C. Wang, William Cohen. Iterative Set Expansion of Named Entities using the Web[C]//Proceedings of ICDM, 2008. [13] Richard C. Wang, Nico Schlaefer, William Cohen, Eric Nyberg. Automatic Set Expansion for List Question Answering[C]//Proceedings of EMNLP, 2008. [14] Casey Whitelaw, Alex Kehlenbeck, Nemanja Petrovic. Web-Scale Named Entity Recognition[C]//Proceedings of CIKM, 2008. [15] Marius Pasca: Organizing and searching the world wide web of facts-step two: harnessing the wisdom of the crowds[C]//Proceedings of WWW, 2007. [16] Yeye He, Dong Xin. SEISA: Set Expansion by Iterative Similarity Aggregation[C]//Proceedings of WWW, 2011. [17] Marco Pennacchiotti, Patrick Pantel. Entity Extraction via Ensemble Semantics[C]//Proceedings of EMNLP, 2009. [18] Vishnu Vyas, Patrick Pantel, Eric Crestan. Helping Editors Choose Better Seed Sets for Entity Set Expansion[C]//Proceedings of CIKM, 2009. [19] Vishnu Vyas, Patrick Pantel. Semi-Automatic Entity Set Refinement[C]//Proceedings of NAACL, 2009. [20] Richard C. Wang, William Cohen. Language-Independent Set Expansion of Named Entities using the Web[C]//Proceedings of ICDM, 2007. [21] 齐振宇, 赵军, 杨帆. 一种开放式中文命名实体识别的新方法[C]//第五届全国信息检索学术会议,上海, 2009年. [22] Philip Edmonds. SENSEVAL: The Evaluation of Word Sense Disambiguation Systems[R]//ELRA Newsletter, October, 2002. [23] Fan Yang, Jun Zhao, Bo Zou, Kang Liu. Chinese-English Backward Translation Assisted with Mining Monolingual Web Pages[C]//Proceedings of ACL, 2008. [24] Fan Yang, Jun Zhao, Kang Liu. A Chinese-English Organization Name Translation System Using Heuristic Web Mining and Asymmetric Alignment[C]//Proceedings of ACL, 2009. [25] Bagga, Baldwin. Entity-Based Cross-Document Coreferencing Using the Vector Space Model[C]//Proceedings of HLT/ACL, 2008. [26] Gideon S. Mann, David Yarowsky. Unsupervised Personal Name Disambiguation[C]//Proceedings of CONIL, 2003. [27] Cheng Niu, Wei Li, Rohini K. Srihari. Weakly Supervised Learning for Cross-document Person Name Disambiguation Supported by Information Extraction[C]//Proceedings of ACL, 2004. [28] Ted Pedersen, Amruta Purandare, Anagha Kulkarni. Name Discrimination by Clustering Similar Contexts[C]//Proceedings of CICLing, 2005. [29] Ying Chen, James Martin. Towards Robust Unsupervised Personal Name Disambiguation[C]//Proceedings of EMNLP, 2007. [30] Bradley Malin. Unsupervised Name Disambiguation via Social Network Similarity[C]//Proceedings of SIAM, 2005. [31] Bradley Malin, Edoardo Airoldi. A Network Analysis Model for Disambiguation of Names in Lists[J]. Computational & Mathematical Organization Theory, 2005, 11: 119-139. [32] Kai-Hsiang Yang, Kun-Yan Chiou, Hahn-Ming Lee, Jan-Ming Ho. Web Appearance Disambiguation of Personal Names Based on Network Motif[C]//Proceedings of WI, 2006. [33] Xianpei Han, Jun Zhao. Named Entity Disambiguation by Leveraging Wikipedia semantic knowledge[C]//Proceedings of CIKM, 2009. [34] Xianpei Han, Jun Zhao. Structural Semantic Relatedness: A Knowledge-Based Method to Named Entity Disambiguation[C]//Proceedings of ACL, 2011. [35] Joseph Hassell, Boanerges Aleman-Meza, I. BudakArpinar. Ontology-Driven Automatic Entity Disambiguation in Unstructured Text[C]//Proceedings of ISWC, 2006. [36] Ron Bekkerman, Andrew McCallum. Disambiguating Web Appearances of People in a Social Network[C]//Proceedings of WWW, 2005. [37] Dmitri V. Kalashnikov, Rabia Nuray-Turan, Sharad Mehrotra. Towards Breaking the Quality Curse. A Web-Querying Approach to Web People Search[C]//Proceedings of SIGIR, 2008. [38] Yiming Lu, Zaiqing Nie, Taoyuan Cheng, Ying Gao, Ji-Rong Wen. Name Disambiguation Using Web Connection[C]//Proceedings of AAAI, 2007. [39] Razvan Bunescu, Marius Pasca. Using Encyclopedic Knowledge for Named Entity Disambiguation[C]//Proceedings of EACL, 2006. [40] Silviu Cucerzan. Large-Scale Named Entity Disambiguation Based on Wikipedia Data[C]//Proceedings of EMNLP, 2007. [41] Wei Zhang, Yan Chuan Sim, Jian Su, Chew Lim Tan. Entity Linking with Effective Acronym Expansion, Instance Selection and Topic Modeling[C]//Proceedings of IJCAI, 2011. [42] Matthew Honnibal, Robert Dale. DAMSEL: The DSTO/Macquarie System for Entity-Linking[C]//Proceeding of TAC, 2009. [43] Dan Bikel, Vittorio Castelli, Radu Florian, Ding-Jung Han. Entity Linking and Slot Filling through Statistical Processing and Inference Rules[C]//Proceedings of TAC, 2009. [44] Xianpei Han, Le Sun. A Generative Entity-Mention Model for Linking Entities with Knowledge Base[C]//Proceedings of ACL, 2011. [45] Sayali Kulkarni, Amit Singh, Ganesh Ramakrishnan, Soumen Chakrabarti. Collective annotation of Wikipedia entities in web text[C]//Proceedings of KDD, 2009. [46] Xianpei Han, Le Sun, Jun Zhao. Collective Entity Linking in Web Text: A Graph-Based Method[C]//Proceedings of SIGIR, 2011. [47] Javier Artiles, Julio Gonzalo, Satoshi Sekine. The SemEval-2007 WePS Evaluation: Establishing a benchmark for the Web People Search Task[C]//Proceedings SemEval, 2007. [48] Javier Artiles, Julio Gonzalo, Satoshi Sekine. WePS2 Evaluation Campaign: Overview of the Web People Search Clustering Task[C]//Proceedings of WWW Workshop of WePS2, 2009. [49] Paul McNamee, Hoa Dang. Overview of the TAC 2009 Knowledge Base Population Track[C]//Proceedings of Text Analysis Conference (TAC-2009), 2009. [50] http://www-nlpir.nist.gov/related_projects/muc/proceedings/muc_7_proceedings/overview.html[OL]. [51] Michele Banko, Michael J Cafarella. Stephen Soderland, Matt Broadhead and Oren Etzioni. Open Information Extraction from the Web[C]//Proceedings of IJCAI, 2007. [52] Fei Wu, Daniel S. Weld. Autonomously Semantifying Wikipedia[C]//Proceedings of CIKM, 2007. [53] Oren Etzioni, Anthony Fader, Janara Christensen, Stephen Soderland, Mausam. Open Information Extraction: the Second Generation[C]//Proceedings of IJCAI, 2011. [54] Mohit Bansal, Dan Klein. Web-Scale Features for Full-Scale Parsing[C]//Proceedings of ACL-HLT, 2011. [55] Guangyou Zhou, Jun Zhao, Kang Liu, Li Cai. Exploiting Web-Derived Selectional Preference to Improve Statistical Dependency Parsing[C]//Proceedings of ACL-HLT, 2011. [56] Xiaobin Xue, Jiwoon Jeon, W. Bruce Croft. Retrieval Models for Question and Answer Archives[C]//Proceedings of SIGIR, 2008. [57] Guangyou Zhou, Li Cai, Jun Zhao, Kang Liu. Phrase-Based Translation Model for Question Retrieval in Community Question Answer Archives[C]//Proceedings of ACL-HLT, 2011. [58] Li Cai, Guangyou Zhou, Kang Liu, Jun Zhao. Learning the Latent Topics for Community QA[C]//Proceedings of IJCNLP, 2011. [59] Li Cai, Guangyou Zhou, Kang Liu, Jun Zhao. Learning to Classify Large-Scale Questions in Community QA by Leveraging Wikipedia Semantic Knowledge[C]//Proceedings of CIKM, 2011. [60] George A. Miller, WordNet: A Lexical Database for English[J]. Communication of the ACM, 38(11): 39-41. [61] HowNet: http://www.keenage.com/[DB/OL]. [62] Douglas B. Lenat. CYC: A Large-Scale Investment in Knowledge Infrastructure[J]. Communications of the ACM 1995,38(11): 33-38. [63] Alexander Madche and Steffen Staab. Ontology Learning for the Semantic Web[J]. IEEE Intelligent Systems, 2001, 16(2): 72-79. [64] L. Brainbridge. Ironies of automation[J]. Automatica, 1983, 19: 775-779. [65] Fabian M. Suchanek, Gjergji Kasneci and Gerhard Weikum. YAGO: A Core of Semantic Knowledge Unifying WordNet and Wikipedia[C]//Proceedings of WWW, 2007. [66] 徐立恒,刘洋,来斯惟,等. 基于多特征表示的本体概念挂载研究[C]//全国第十一届计算语言学学术会议,洛阳,2011.