本文提出了一个基于现代汉语述语形容词机器词典以及平衡语料库的形容词多信息聚类算法。聚类的过程根据形容词的语料提取了三重信息(所修饰的名词,同义近义词以及反义词),从而使形容词与形容词之间构成网络关系。本文重点描述了如何根据三重信息分别建模计算形容词的相似性并通过计算字面相似度以及路径权值这些辅助信息修正每两个形容词之间的相似度,从而在某种程度上缓解了数据稀疏的问题,实验结果显示该算法是有效的。
Abstract
In this paper we present a method to group adjectives according to their corpora distribution, based on the Machine Tractable Dictionary of Contemporary Chinese Predicate Adjectives. We describe how our system extracts three groups of information for each adjective, which includes: modified nouns, synonyms, and antonyms, and exploits this knowledge to compute a measure of similarity between two adjectives with help of literal similarity and route weight of each adjective to another adjective, which in some extent solve the problem caused by sparse data. We also show how a clustering algorithm can use these similarities to produce the groups of adjectives, and we present results produced by our system for a sample set of adjectives.
关键词
人工智能 /
机器翻译 /
机器学习 /
词聚类 /
搭配对 /
Kendall τ系数法 /
字面相似度 /
路径权值
{{custom_keyword}} /
Key words
artificial intelligence /
machine translation /
machine learning /
clustering /
compositional pairs /
Kendall’s τ coefficient /
literal similarity /
route weight
{{custom_keyword}} /
{{custom_sec.title}}
{{custom_sec.title}}
{{custom_sec.content}}
参考文献
[1] Donald Hindle. Noun Classification from Predicate-Argument Structures[A]. In: Proceedings of the 28th Annual Meeting of the ACL[C]. Pennsylvania: Association for Computational Linguistics, 1990, 268-275.
[2] Kathleen McKeown, Vasileios Hatzivassiloglou. Augmenting lexicons automatically: Clustering semantically related adjectives[A]. In: Proc. ARPA Human Language Technology Workshop 93[C]. Princeton, NJ: ARPA Workshop on Human Language Technology, 1993, 272-277.
[3] Makato Iwayama, Takenobu Tokunaga. Cluster-based text categorization: a comparison of category search strategies[A]. In: Proceedings of SIGIR 95, 18th ACM International Conference on Research and Development in Information Retrieva[C]. New York, US: ACM Press, 1995, 273-281.
[4] Alcala, R., Casillas, J. Cord on, O., et al. Techniques for Learning and Tuning Fuzzy Rule-Based Systems for Linguistic Modeling and Their Application[A]. In: KNOWLEDGE-BASED SYSTEMS. Techniques and Applications Vol III[C]. Europe: Acade-mic Press, 1999, 889-941.
[5] 黄昌宁,李涓子.词义排歧的一种语言模型[J].语言文字应用,2000,3:85-90.
[6] 鲁松.自然语言中词相关性知识无导获取和均衡分类器的构建[D].北京:中国科学院计算技术研究所,2001.
[7] Shlomo Argamon-Engelson, Ido Dagan. Committee-based sample selection for probabilistic classifiers[J]. Journal of Artificial Intelligence Research, 1999, 11:335-360.
[8] 闻扬,苑春法,黄昌宁.基于搭配对的汉语形容词-名词聚类[J].中文信息学报,2000,14(6):45-50.
[9] Kendall, M.G.. A New Measure of Rank Correlation[J]. Biometrika, 1938, 30:81-93.
[10] 郝秀兰,杨尔弘.基于小规模语料库和机器可读词典的二元分布语义获取[J].中文信息学报,2004,18(6):23-29.
{{custom_fnGroup.title_cn}}
脚注
{{custom_fn.content}}
基金
国家自然科学基金资助项目(60573188)
{{custom_fund}}