1. Beijing Key Laboratory of Internet Culture and Digital Dissemination Research, Beijing Information Science and Technology University, Beijing 100101, China; 2. Tsinghua National Laboratory for Information Science and TechnologyTNList Center for Speech and Language Technologies, Research Institute of Information Technology, Tsinghua University, Beijing 100084, China
Abstract:Corpus resources are closely related to Natural Language Processing. However, different research institutions have different rules and tags when constructing the copus, which prevents a unified big corpus. This paper investigates the different annotation scheme and presents a method for heterogeneous corpus integration. The experiments on part-of -speech mapping and and disambiguation indicate anaccuracy of 87% after the integration, showing the validness of this method. Key words: corpus construction; data fusion; word mapping; POS disambiguation; 收稿日期: 2015-10-08 定稿日期: 2016-05-25 基金项目: 国家自然科学基金(61271304,61671070);北京成像技术高精尖创新中心项目(BAICIT-2016003);国家社会科学基金(14@ZH036)