在进行社会媒体文本挖掘时,传统的基于词表的方法,存在准确率较低、词表难获得等问题。该文提出一种基于依存句法分析的文本挖掘方法,通过规则匹配的方式从社会媒体文本中提取信息。该方法不依赖词表,且实验证明了相比基于词表的方法在准确率上有大幅提高。应用基于依存句法分析的文本挖掘方法,我们在微博文本上进行了饮食习惯特色分析,实现了性别、地区、时间等维度的饮食习惯特色分析并可进行交叉分析,最终用词云的方式展示了结果。
Abstract
For social media text mining, the traditional lexicon method has the problem of lower accuracy and difficulty in lexicon acquisition. This paper proposes a dependency parsing-based text mining method to acquire information from social media text using matching rules. This method can work without lexicons and the experiment results prove a substantial increase in accuracy compared to the lexicon method. Using the dependency parsing-based method, we conducted an eating habits analysis on the Weibo text and achieve results on gender, region, time, including cross-analysis results, which are presented by word clouds.
关键词
依存句法分析 /
文本挖掘 /
社会媒体 /
饮食习惯特色分析
{{custom_keyword}} /
Key words
dependency parsing /
text mining /
social media /
eating habits analysis
{{custom_keyword}} /
{{custom_sec.title}}
{{custom_sec.title}}
{{custom_sec.content}}
参考文献
[1] Miller G. Social scientists wade into the tweet stream[J]. Science, 2011, 333(6051): 1814-1815.
[2] Lazer D, Pentland A S, Adamic L, et al. Life in the network: the coming age of computational social science[J]. Science (New York, NY), 2009, 323(5915): 721.
[3] Schwartz H A, Eichstaedt J C, Kern M L, et al. Personality, Gender, and Age in the Language of Social Media: The Open-Vocabulary Approach[J]. PloS one, 2013, 8(9): e73791.
[4] Asur S, Huberman B A. Predicting the future with social media[C]//Proceedings of Web Intelligence and Intelligent Agent Technology (WI-IAT), 2010 IEEE/WIC/ACM International Conference on. IEEE, 2010, 1: 492-499.
[5] P Pennebaker J W, Francis M E, Booth R J. Linguistic inquiry and word count: LIWC 2001[J]. Mahway: Lawrence Erlbaum Associates, 2001, 71: 2001.
[6] Pennebaker J W, Chung C K, Ireland M, et al. The development and psychometric properties of LIWC2007[OL]www.liwc.net.
[7] Tausczik Y R, Pennebaker J W. The psychological meaning of words: LIWC and computerized text analysis methods[J]. Journal of Language and Social Psychology, 2010, 29(1): 24-54.
[8] 李正华. 依存句法分析统计模型及树库转化研究[D]. 哈尔滨工业大学硕士学位论文,2008.
[9] Che W, Li Z, Liu T. Ltp: A chinese language technology platform[C]//Proceedings of the 23rd International Conference on Computational Linguistics: Demonstrations. Association for Computational Linguistics, 2010: 13-16.
[10] Golder S A, Macy M W. Diurnal and seasonal mood vary with work, sleep, and daylength across diverse cultures[J]. Science, 2011, 333(6051): 1878-1881.
[11] Dodds P S, Harris K D, Kloumann I M, et al. Temporal patterns of happiness and information in a global social network: Hedonometrics and Twitter[J]. PloS one, 2011, 6(12): e26752.
[12] Hannak A, Anderson E, Barrett L F, et al. Tweetinin the Rain: Exploring Societal-Scale Effects of Weather on Mood[C]//Proceedings of ICWSM. 2012.
[13] Fleiss J L. Measuring nominal scale agreement among many raters[J]. Psychological bulletin, 1971, 76(5): 378.
[14] Liu Y, Zhang M, Che W, et al. Micro blogs Oriented Word Segmentation System[J]. CLP 2012, 2012: 85.
[15] Schwartz H A, Eichstaedt J, Dziurzynski L, et al. Choosing the Right Words: Characterizing and Reducing Error of the Word Count Approach[C]//Proceedings of SEM-2013,2013:296-305.
{{custom_fnGroup.title_cn}}
脚注
{{custom_fn.content}}
基金
国家重点基础研究发展计划(973计划)(2014CB340503);国家自然科学基金面上项目(61370164);国家自然科学基金重点项目(61133012)
{{custom_fund}}