Abstract:Cross Document Coreference(CDC) resolution is an important step in information integration and information fusion. As a consequence, a CDC corpus is indispensable for research and evaluation of CDC resolution. Given the fact that no Chinese CDC corpus is publicly available oriented for information extraction, this paper describes how to build a CDC corpus based on the ACE2005 Chinese corpus via automatic generation and manual annotation, which covers all the ACE entity types. The corpus is made publicly available to advance the research on Chinese CDC resolution. In addition, this paper analyses the types and characteristics of CDC in Chinese text as well as proposes the concept of two metrics, i.e., “variation perplexity” and “ambiguity perplexity”, to evaluate the difficulty of Chinese CDC resolution, providing some insights for further CDC research.
[1] Dagan I, Itai A. Automatic Processing of Large Corpora for the Resolution of Anaphora References[C]//Proceedings of the 13th conference on Computational linguistics. Stroudsburg, PA, USA: Hans Karlgren, 1990: 330-332. [2] 王厚峰.汉语指代消解与省略恢复研究[D]. 中国科学院声学研究所2000届博士后出站报告. [3] Mayfield J, Alexander D, Dorr B, et al. Cross-Document Coreference Resolution: A Key Technology for Learning by Reading[C]//Proceedings of the AAAI 2009 Spring Symposium on Learning by Reading and Learning to Read. Stanford, California: March 23, 2009: 65-70. [4] Grishman R, Beth S. Message Understanding Conference-6: a brief history[C]//Proceedings of the 16th Conference on Computational Linguistics (COLING'96). Copenhagen, Denmark: August, 1996: 05-09. [5] NIST Speech Group. The ACE 2008 evaluation plan: Assessment of Detection and Recognition of Entities and Relations Within and Across Documents[EB/OL]. http://www.nist.gov/speech/tests/ace/2008/doc/ace08-evalplan.v1.2d.pdf, 2008. [6] Bagga A, Baldwin B. Entity-Based Cross-Document Coreferencing Using the Vector Space Model[C]//Proceedings of the 36th Annual Meeting of the Association for Computational Linguistics and the 17th International Conference on Computational Linguistics (COLING-ACL'98). Montréal, Québec, Canada: 1998: 79-85. [7] Gooi C H, Allan J. Cross-Document Coreference on a Large Scale Corpus[C]//Proceedings of HLT-NAACL 2004. USA,2004: 9-16. [8] Batista-Navarro R T, Ananiadou S. Building a Coreference-Annotated Corpus from the Domain of Biochemistry[C]//Proceedings of the 2011 Workshop on Biomedical Natural Language Processing, ACL-HLT 2011. Portland, Oregon, USA: June 23-24, 2011: 83-91. [9] Artiles J, Gonzalo J, Sekine S. Web People Search Task at SemEval-2007[EB/OL]. http://nlp.uned.es/weps/weps2007_data_readme_1.1.txt, 2007 [10] CIPS-SIGHAN Joint Conference on Chinese Language Processing (CLP2010)[EB/OL]. http://www.cipsc.org.cn/clp2010/task3_ch.htm, 2010. [11] Singh S, Subramanya A, Pereira F, et al. Large-Scale Cross-Document Coreference Using Distributed Inference and Hierarchical Models[C]//Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics. Portland, Oregon: 2011: 793-803. [12] CLSP Summer Workshop. Exploiting Lexical & Encyclopedic Resources for Entity Disambiguation[EB/OL]. http://www.clsp.jhu.edu/ws2007/groups/elerfed/documents/ELERFED-CDC-Overview.v2.ppt, 2007. [13] Task Description for Knowledge Base Population at TAC 2009[EB/OL]. http://apl.jhu.edu/~paulmac/kbp/090601-KBPTaskGuidelines.pdf, 2009 [14] 赵军, 刘康, 周光有, 等. 开放式文本信息抽取[J]. 中文信息学报, 2011, 25(6): 98-110. [15] Rao D, McNamee P, Dredze M. Entity Linking: Finding Extracted Entities in a Knowledge Base, Multi-source, Multi-lingual Information Extraction and Summarization[M]. Germany, Springer, 2011. [16] Cucerzan S. Large-Scale Named Entity Disambiguation Based on Wikipedia Data[C]//Proceedings of Empirical Methods in Natural Language Processing. Prague: June 28-30, 2007: 708-716. [17] Passoneau R J. Computing reliability for coreference annotation[C]//Proceedings of the International Conference on Language Resouces (LREC). Lisbon, Portugal: May 2004. [18] Krippendorff K H. Content Analysis: An Introduction to Its Methodology[M]. Beverly Hills, CA: Sage Publications, 1980. [19] Popescu O. Person Cross Document Coreference with Name Perplexity Estimates[C]//Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing. Singapore: 6-7 August 2009: 997-1006. [20] Bagga A. Evaluation of Coreferences and Coreference Resolution Systems[C]//Proceedings of the First Language Resource and Evaluation Conference. Granada, Spain: May 1998. [21] Baron A, Freedman M. Who is Who and What is What: Experiments in Cross-Document Co-Reference[C]//Proceedings of the 2008 Conference on Empirical Methods in Natural Language Processing. Honolulu: October 2008: 274-283.