本文提出了一种基于小规模语料库和机器可读词典(Machine Readable Dictionary ,MRD)的无指导的动词语义获取方法。该方法不需要使用有义项标注的语料库,而是使用从语料中获得的V+N搭配以及MRD中多义词定义的应用实例中获得的知识。使用两种方法解决数据稀疏问题:首先,将词的相似性度量由直接共现扩展到共现词的共现,以共现聚类而不是共现词来计算词的相似度。其次,从MRD定义中获取名词的IS- A关系。通过这些方法,即使两个词不共享任何词,也可认为是相似的。实验表明,该方法可从很小规模的语料中获取知识,并在不限制词义的情况下达到85.7%的正确排歧率。
Abstract
This paper presents a systemfor unsupervised verb semantic knowledge acquisition using small corpus and a machine-readable dictionary (MRD) . The system does not depend on sense-tagged corpus , but learns a set of typical usages listed in the MRD usage examples for each of the senses of a polysemous verb in the MRD definitions and uses verb-object co-occurrences acquired from the corpus. This paper concentrates on the problemof data sparseness in two ways. First , extending word similarity measures from direct co-occurrences to co-occurrences of co-occurred words , we compute the word similarities using not co-occurred words but co-occurred clusters. Second , we acquire IS-A relations of nouns from the MRD definitions. It is possible to cluster the nouns roughly by the identification of the IS-A relationship. By these methods , two words may be considered similar even if they do not share any word. Experiments show that this method can learn from very small training corpus and achieve over 85.7% correct disambiguation performance without a restriction of word’s senses.
关键词
人工智能 /
自然语言处理 /
机器可读词典 /
二元分布 /
语义 /
知识获取
{{custom_keyword}} /
Key words
artificial intelligence /
natural language processing /
MRD /
dual distribution /
semantic /
knowledge acquisition
{{custom_keyword}} /
{{custom_sec.title}}
{{custom_sec.title}}
{{custom_sec.content}}
参考文献
[1] Gale , W. K. , K. W Church and D. Yarowsky (1993) . A Method for Disambiguation Word Senses in a Large Corpus. Computer and the Humanities[J] . 1993 , 26 : 415 - 439.
[2] Jeong-Mi Cho , Jungyun Seo , Gil Chang Kim. Dual distributional verb sense disambiguation with small corpora and machine readable dictionaries , ACL’99[C] . University of Maryland.
[3] Resnik , Philip Stuart. 1997. Selectional preference and sense disambiguation[A] . In : Proceedings of ANLP Workshop , Tagging Text with Lexical Semantics : Why , What , and How ? [C] .
[4] Pereira , Fernando , Naftali Tishby , and Lillian Lee. 1993. Distributional Clustering Of English Words[M] .
[5] 刘开瑛. 中文文本自动分词和标注[M] . 北京:商务印书馆,2000年.
[6] 刘开瑛,郭炳炎. 自然语言处理[M] . 北京:科学出版社,1991年.
[7] 吕叔湘,等著,马庆株编,语法研究入门[M] . 北京:商务印书馆,1999.
[8] 郝秀兰,杨尔弘,舒鑫柱. 基于HowNet的事件角色语义特征提取[J] . 中文信息学报,2001 ,15 (5) :26 - 32.
[9] 杨尔弘,郝秀兰,李盛. 基于粗集的汉语词语义项知识的获取[J] . 中文信息学报,2002 ,16 (3) :27 - 33.
[10] 知网1.0版.
{{custom_fnGroup.title_cn}}
脚注
{{custom_fn.content}}
基金
山西省青年基金资助项目(20001017)
{{custom_fund}}