Chinese Lexical Semantic Similarity Computing Based on Large-scale Corpus
SHI Jing1, WU Yunfang1, QIU Likun2, LV Xueqiang3
1. Institute of Computational Lingustics, Peking University, Beijing 100871, China; 2. School of Chinese Language of Literature, Ludong University, Yantai, Shandong 264025, China; 3. Beijing Key Laboratory of Internet Culture and Digital Dissemination Research, Beijing Information Science and Technology University, Beijing 100101, China
Abstract:Automatic acquisition of similar words is one of the most crucial problems in natural language processing tasks, e.g. the query extension in information retrieval, pattern identification in machine translation, parser analysis and WSD . This paper focuses on Chinese semantic similarity computing based on large corpus, investigating the computation of context feature weight, the vector similarity measures, the window context vs. the dependency context, and the newspaper corpus vs. web corpus. Our experiments show that, in the web corpus, using window-based context combined with PMI weights function, the cosine measures gets the best semantic similarity results. Key wordssemantic similarity; context; weight function; dependency relation
[1] 刘群,李素建. 基于《知网》的词汇语义相似度的计算[C]//第三届汉语词汇语义学研讨会,台北,2002. [2] 张亮,尹存燕,陈家骏. 基于语义树的中文词语相似度计算与分析[J].中文信息学报,2010,24(6):23-30. [3] 刘青磊,顾小丰. 基于《知网》的词语相似度算法研究[J]. 中文信息学报,2010,24(6):31-36. [4] Agirre E, Alfonseca E, Hall K, et al. A study on similarity and relatedness using distributional and WordNet-based approaches[C]//Proceedings of HLT-NAACL, 2009: 19-27. [5] Harris Z. Mathematical structures of language[D]. Wiley, New Jersey,1968. [6] Lin D. Automatic Retrieval and Clustering of Similar Words[C]//Proceedings of COLING/ACL 1998: 768-774. [7] Curran J. Ensemble methods for automatic thesaurus extraction[C]//Proceedings of EMNLP-2002: 222-229. [8] Weeds J, Weir D, McCarthy D. Characterizing measures of lexical distributional similarity[C]//Proceedings of COLING-2004: 1015-1021. [9] Hagiwara M, Ogawa Y, Toyama K. Selection of effective contextual information for automatic synonym acquisition[C]//Proceedings of ACL/ COLING-2006, 2006: 353-360. [10] Geffet M, Dagan I. Bootstrapping distributional feature vector quality[J]. Computational Linguistics, 2009, 35(3):435-461. [11] Kazama J, Saeger S, Kuroda K, et al. A Bayesian method for robust estimation of distributional similarities[C]//Proceedings of COLING-2010, 2010: 247-256. [12] Pad Sebastian, Mirella Lapata. Dependency-based construction of semantic space models[J]. Computational Linguistics, 2007, 33(2):161-199. [13] Chang.P, Tsengb,H, Jurafskya,D and Manning,C. Discriminative Reordering with Chinese Grammatical Relations Features[C]//Proceedings of the Third Workshop on Syntax and Structure in Statistical Translation at NAACL HLT 2009.51-59.