中文机构名称数目庞大, 层出不穷, 绝大多数未能收入词典, 给自然语言处理带来困扰。但是, 从语言学的角度来看, 机构名称是一种偏正复合式专有名词, 同时又是一类较为简单的偏正名词词组, 有自己的结构规律和形态标记。本文以高校名称为重点,以中国内地、香港和台湾三地实际语料为依据, 从语言学和计算机技术两方面对机构名称的识别与分析展开讨论, 并总结出相应的规则。根据这些规则, 对六百多万字的三地语料库作高校名称识别, 正确率(指前后界定位均正确) 达97.3 % , 召回率为96.9 %。这些规则还可应用于拼音-汉字智能转换和机器翻译等其它领域。
Abstract
As important proper nouns , Chinese names of organizations and institutions play an in-dispensable role in language communication. Unfortunately , due to their infinite quantity , constant creation and disappearance , and relative length and complexity , most of these names have failed to find their way into Chinese dictionaries of computer systems. Linguistically , however , these proper nouns can be viewed as a special group of compound nouns and as a simple category of noun phrase , possessing their own formation rules and physical markers. This paper presents a pioneer discussion on the analysis of Chinese names of organizations and institutions from the computational point of view. Useful linguistic rules has been drawn from the discussion and applied to the identification of names of organizations and institutions in the 6,000,000-character Mainland-Hongkong-Taiwan corpus of modern Chinese developed by Hong Kong Polytechnic University. Preliminary experiments show that both precision and recall rates for identifying names of colleges and universities are over 96%.
关键词
机构名称 /
专有名词 /
短语分析 /
自然语言处理
{{custom_keyword}} /
Key words
Organization and institution names /
Proper nouns /
Phrase analysis /
Natural language processing
{{custom_keyword}} /
{{custom_sec.title}}
{{custom_sec.title}}
{{custom_sec.content}}
参考文献
[1] 孙茂松, 黄昌宁, 高海燕, 方捷, 中文性名的自动识别, 《中文信息学报》, Vol.9 , No.2 , 1995 , pp16 - 27 。
[2] 沈达阳, 孙茂松, 黄昌宁, 中国地名的自动识别, 发表于《计算语言学进展与应用》(陈力为, 袁琦主编) , 清华大学出版社, 1995 , pp68 - 74 。
[3] 郑家恒, 刘开瑛,‘自动分词系统中姓氏人名处理策略探讨’, 《计算语言学进展与应用》(陈力为主编) , 北京语言学院出版社, 1993 , pp139 - 143 。
[4] 胡树鲜,《现代汉语语法理论初探》, 中国人民大学出版, 北京, 1990 , pp282 - 302 。
[5] 武占申, 王勤,《现代汉语词汇概要》, 内蒙古人民出版社, 1983 。
[6] 陈光磊,《汉语词法论》, 学林出版社, 上海, 1994 , pp27 - 37 。
[7] 范晓,《汉语的短语》, 商务印书馆, 北京, 1991 , pp48 - 68 。
[8] 张小衡, “逐步实现中文智能输入”,《中文信息》No.5 1996 , p3 - 5 。
[9] 陆丙甫,《核心推导语法》, 上海教育出版社, 1993 , pp92 - 97 。
[10] 语言文字规范手册, 语文出版社, 北京, 1993 , pp293 - 307。
{{custom_fnGroup.title_cn}}
脚注
{{custom_fn.content}}