黄岚,杜友福. 一种基于维基百科的中文词语相关度学习算法[J]. 中文信息学报, 2016, 30(3): 36-45.
HUANG Lan,DU Youfu. Learning the Semantic Relatedness of Chinese Words from Wikipedia. , 2016, 30(3): 36-45.
一种基于维基百科的中文词语相关度学习算法
黄岚,杜友福
长江大学 计算机科学学院 湖北 荆州434000
Learning the Semantic Relatedness of Chinese Words from Wikipedia
HUANG Lan,DU Youfu
College of Computer Science,Yangtze University,Jingzhou,Hubei 434000,China
Abstract:Semantic word relatedness measures are fundamental to many text analysis tasks such as information retrieval,classification and clustering. As the largest online encyclopedia today,Wikipedia has been successfully exploited for background knowledge to overcome the lexical differences between words and derive accurate semantic word relatedness measures. In Chinese version,however,the Chinese Wikipedia covers only ten percent of its English counterpart. The sparseness in concept space and associated resources adversely impacts word relatedness computation. To address this sparseness problem,we propose a method that utilizes different types of structured information that are automatically extracted from various resources in Wikipedia,such as article's full-text and their associated hyperlink structures. We use machine learning algorithms to learn the best combination of different resources from manually labeled training data. Experiments on three standard benchmark datasets in Chinese showed that our method is 20%-40% more consistent with an average human labeler than the state-of-the-art methods.
[1] 36Kr.下一代搜索引擎即将来临: 知识图谱的用户体验报告[OL]. 2014[2014-7-12]. http: //www.36kr.com/p/205737.html. [2] Ruiz E L,Manotas I G,GarcíA R V. et al. Financial news semantic search engine[J]. Expert Systems with Applications,2011,38(12): 15565-15572. [3] Milne D,Witten I H,Nichols,D M. A knowledge-based search engine powered by Wikipedia[C]//Proceedings of the 16th CIKM. New York: ACM,2007: 445-454. [4] Gabrilovich E,Markovitch,S Feature generation for text categorization using world knowledge[C]//Proceedings of the 19th IJCAI. SanFrancisco: Kaufmann,2005: 1048-1053. [5] Hu J,Fang L,Cao Y,et al. Enhancing text clustering by leveraging Wikipedia semantics[C]//Proceedings of the 31st ACM SIGIR. New York: ACM,2008: 179-186. [6] Huang A,Milne,D Frank,E Witten,I H Clustering documents with active learning using Wikipedia[C]//Proceedings of the 8th IEEE ICDM. Washington,DC: IEEE Computer Society,2008: 839-844. [7] Pippig K,Burghardt D,Prechtel N. Semantic similarity analysis of user-generated content for theme-based route planning[J]. Journal of Location Based Services,2013,7(4): 223-245. [8] Yan P,Jin W. Improving cross-document knowledge discovery using explicit semantic analysis[C]//Proceedings of the 14th DaWaK. Heidelberg: Springer-Verlag,2012: 378-389. [9] Huang L,Milne D,Frank E,Witten I H. Learning a Concept-Based Document Similarity Measure[J]. Journal of the American Society for Information Science and Technology,2012,63(8): 1593-1608. [10] Agirre E,Alfonseca E,Hall K,et al. A study on similarity and relatedness using distributional and WordNet-based approaches[C] //Proceedings of NAACL. Stroudsburg: ACL,2009: 19-27. [11] Lenat D B. CYC: A large-scale investment in knowledge infrastructure[J]. Communications of the ACM,1995,38: 33-38. [12] 王红玲,吕强,徐瑞. 中文语义相关度计算模型研究[J]. 计算机工程与应用,2009(7): 167-170. [13] Giles J. Internet encyclopaedias go head to head[J]. Nature,2005,438: 900-901. [14] Strube M,Ponzetto S P. WkiRelate! Computing semantic relatedness using Wikipedia[C]//Proceedings of the 21st AAAI. Menlo Park,CA: AAAI Press,2006: 1419-1424. [15] Gabrilovich E,Markovitch S. Computing semantic relatedness using Wikipedia-based explicit semantic analysis[C]//Proceedngs of the 20th IJCAI. San Francisco: Kaufmann,2007: 1606-1611. [16] Milne D,Witten I H. An effective,low-cost measure ofsemantic relatedness obtained from Wikipedia links[C].//Proceedings of the Advancement of Artificial Intelligence Workshop on Wikipedia and Artificial Intelligence. Menlo Park,CA: AAAI Press,2008: 25-30. [17] Yazdani M,Belis A P. Computing text semantic relatedness using the contents and links of a hypertext encyclopedia[J]. Artificial Intelligence,2013,194: 176-202. [18] Yeh E,Ramage D,Manning C D,et al. WikiWalk: Random walks on Wikipedia for semantic relatedness[C]//Proceedings of the 2009 Workshop on Graph-Based Methods for Natural Language Processing. Stroudsburg,PA: ACL,2009: 41-49. [19] 刘群,李素建. 基于知网的词汇语义相似度计算[J]. 中文计算语言学,2002,7(2): 59-76. [20] 李赟,黄开妍,任福继,钟义信. 维基百科的中文语义相关词获取及相关度分析计算[J]. 北京邮电大学学报,2009,32(3): 109-112. [21] 万富强,吴云芳. 基于中文维基百科的词语语义相关度计算. 中文信息学报,2013,27(6): 31-37,109. [22] 汪祥,贾焰,周斌,丁兆云,梁政. 基于中文维基百科链接结构与分类体系的语义相关度计算[J]. 小型微型计算机系统,2011,32(11): 2237-2242. [23] 涂新辉,张红春,周琨峰,何婷婷. 中文维基百科的结构化信息抽取及词语相关度计算方法. 中文信息学报,2012,26(2): 109-114. [24] Milne D,Witten I H.An open-source toolkit for mining Wikipedia[J]. Artificial Intelligence,2013(194): 222-239. [25] Cilibrasi R L,Vitányi P M. The Google similarity distance[J]. IEEE Transactions on Knowledge and Data Engineering,2007,19(3): 370-383. [26] Rasmussen C E,Williams C K I. Gaussian processes formachine learning[M]. Cambridge,MA: MIT Press,2006. [27] Hall M,Frank E,Holmes G,et al. The WEKA Data Mining Software: An Update[J]. SIGKDD Explorations,2009,11(1): 10-18. [28] Finkelstein L,Gabrilovich Y M,Rivlin E. et al. Placing search incontext: The concept revisited[J]. ACM Transactions on Information Systems,2002,20(1): 116-131. [29] Miller G A,Charles W G. Contextual correlates of semantic similarity[J]. Language and Cognitive Processes,1991,6(1): 1-28.