词典是汉语自然语言处理中非常重要的一类资源,它能为汉语词法句法以及语义分析等提供资源支撑。该文采用众包方法构建汉语语义相关性词典,该词典是通过触发词联想的方式间接获取的,因此又称为词汇联想网络。词汇联想网络相比传统词典具有以下特点 (1)获取代价低;(2)面向互联网,易扩展;(3)词语关系从人的认知角度来建立,符合人的直觉。该文详细介绍词汇联想网络的获取方法并对已获取的数据进行分析,另外,将词汇联想网络与《知网》、《同义词词林》以及微博文本ngram进行比较说明其上述特点。
Abstract
Dictionaries are crucial to the natural language processing. Its a fundamental resource for Chinese word segmentation, POS tagging, parsing and so on. This paper presents a method to build semantic relevance dictionary with crowdsourcing, which is triggered by the word association indirectly. Compared with traditional dictionaries, the so called word association network has following advantages1)Low cost; 2)Internet oriented and easy to expend;3)Word relationship is determined from the perspective of human cognition and is consistent with human intuition. In addition to describing the way of building word association network, we also analyzed the data obtained, comparing it with Hownet, TongYiCi CiLin and word ngrams from Weibo to show its characteristics.
Key wordscrowdsourcing; semantic relevance dictionary; word association network
关键词
众包 /
语义相关性词典 /
词汇联想网络
{{custom_keyword}} /
Key words
crowdsourcing /
semantic relevance dictionary /
word association network
/
/
/
/
/
/
/
/
{{custom_keyword}} /
{{custom_sec.title}}
{{custom_sec.title}}
{{custom_sec.content}}
参考文献
[1] 张梅山, 邓知龙, 车万翔,等. 统计与词典相结合的领域自适应中文分词[C]//第十一届全国计算语言学学术会议(CCL2011), 中国洛阳,2011:28-33.
[2] Amit Chandel, P C Nagesh, S Sarawagi. Efficient batch top-k search for dictionary-basedentity recognition[C]//Proceedings of the 22nd International Conference on Data Engineering, 2006:28.
[3] Simonetta Montemagni, Lucy Vanderwende. Structural patterns vs. string patterns for extracting semantic information from dictionaries[C]//Proceedings of the 14th conference on Computational linguistics, August,1992: 23-28.
[4] 董振东,董强. 知网. http://www.keenage.com[M]. 2000.
[5] 梅家驹,竺一鸣, 高蕴琦,等. 同义词词林(第二版)[M]. 上海辞书出版社.1996.
[6] Luis von Ahn, Labeling Images with a Computer Game[C]//ACM Conf. on Human Factors in Computing Systems, CHI 2004: 319-326.
[7] Ann Irvine, Alexandre Klementiev. Using Mechanical Turk to Annotate Lexicons for Less Commonly Used Languages[C]//Proceedings of the NAACL HLT 2010 Workshop on Creating Speech and Language Data with Amazons Mechanical Turk, pages 108-113, Los Angeles, California, June 2010.
[8] Mukund Jha, Jacob Andreas, Kapil Thadani, et al. Corpus creation for new genres: a crowdsourced approach to PP attachment[C]//Proceedings of the NAACL HLT 2010 Workshop on Creating Speech and Language Data with Amazons Mechanical Turk, Los Angeles, California. Bremaud. Markov chains: Gibbs fields, montecarlo simulation, and queues.Springer-Verlag. 1999: 13-20.
[9] Nolan Lawson, Kevin Eustice, Mike Perkowitz, et al. Annotating large email datasets for named entity recognition with mechanical turk[C]//Proceedings of the NAACL HLT 2010 Workshop on Creating Speech and Language Data with Amazons Mechanical Turk, Los Angeles, California, 2010:13-20.
[10] Thad Hughes, Daniel Ramage. Lexical Semantic Relatedness with Random Graph Walk[C]//Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, Prague, June 2007: 581-589.
[11] Bremaud. Markov chains: Gibbs fields, mon-tecarlo simulation, and queues[M]. Springer-Verlag,1999.
[12] 刘群,李素建. 基于“知网”的词汇语义相似度计算[C]//计算语言学与中文语言处理——第三届汉语词汇语义学研讨会论文集. 2002:59-76.
[13] Brendan J Frey, Delbert Dueck. 2007. Clustering by passing messages between data points[J].SCIENCE, 2007, 315: 972-976.
{{custom_fnGroup.title_cn}}
脚注
{{custom_fn.content}}
基金
国家自然科学基金重点资助项目(61133012);国家863重大资助项目(2011AA01A207);国家863先进技术研究资助项目(2012AA011102)
{{custom_fund}}