Automatic Recognition of Terms in Chinese Patent Literature
YANG Shuanglong1,LV Xueqiang1,LI Zhuo1,XU Liping2
1. Beijing Key Laboratory of Internet Culture and Digital Dissemination Research, Beijing Information Science and Technology University,Beijing 100101,China; 2.Beijing Research Center of Urban System Engineering,Beijing 100089,China
Abstract:Chinese patent literatures contain abundant domain-specific terms, and automatic recognition of terminology is an important task in information extraction and text mining. In this paper, we propose an approach of automatic generation of term formation rules and a novel TermRank algorithm. Firstly, we focus on generating a set of term formation rules automatically through a large number of patent titles and then applied those rules to patent texts for term candidates. Finally, the TermRank algorithm decides the final terms. Experimental results on 9725 Chinese patent literatures demonstrate the effectiveness of the proposed approach.
[1] 游宏梁,张巍,沈钧毅,等. 一种基于加权投票的术语自动识别方法[J]. 中文信息学报,2011,25(3): 9-16. [2] 岳金媛,徐金安,张玉洁等.面向专利文献的汉语分词技术研究[J]. 北京大学学报(自然科学版),2013,49(1):159-164. [3] Frantzi K,Ananiadou S,Mima H. Automatic recognition of multi-word terms: the C-value/NC-value method[J]. International Journal on Digital Libraries,2000,3(2): 115-130. [4] Dagan I,Church K. Termight: Identifying and translating technical terminology[C]//Proceedings of the fourth conference on Applied natural language processing. Association for Computational Linguistics,1994: 34-40. [5] Yang Y,Lu Q,Zhao T. Chinese term extraction using minimal resources[C]//Proceedings of the 22nd International Conference on Computational Linguistics-Volume 1. Association for Computational Linguistics,2008: 1033-1040. [6] 闫兴龙,刘奕群,方奇等.基于网络资源与用户行为信息的领域术语提取[J].软件学报,2013,24(9): 2089-2100. [7] 索红光,刘玉树,曹淑英. 一种基于词汇链的关键词抽取方法[J]. 中文信息学报,2006,20(6): 25-30. [8] 李超,王会珍,朱慕华,等. 基于领域类别信息 C-value 的多词串自动抽取[J]. 中文信息学报,2010,24(1): 94-98. [9] 韩红旗,朱东华,汪雪锋. 专利技术术语的抽取方法[J]. 情报学报,2011,30(12): 1280-1285. [10] 徐川,施水才,房祥等.中文专利文献术语抽取[J].计算机工程与设计,2013,34(6): 2175-2179. [11] 杨洁,季铎,蔡东风,等. 基于联合权重的多文档关键词抽取技术[J]. 中文信息学报,2008,22(6): 75-79. [12] 梁颖红,张文静,周德富. 基于混合策略的高精度长术语自动抽取[J]. 中文信息学报,2009,23(6): 26-30. [13] 贾美英,杨炳儒,郑德权,等. 采用 CRF 技术的军事情报术语自动抽取研究[J]. 计算机工程与应用,2009,45(32): 126-129. [14] Zhang H P,Yu H K,Xiong D Y,et al. HHMM-based Chinese lexical analyzer ICTCLAS[C]//Proceedings of the second SIGHAN workshop on Chinese language processing-Volume 17. Association for Computational Linguistics,2003: 184-187. [15] Brin S,Page L. The anatomy of a large-scale hypertextual Web search engine[J]. Computer networks and ISDN systems,1998,30(1): 107-117.