该文旨在研究中文微博用户的性别分类问题,即根据微博提供的中文文本信息对注册用户的性别进行识别。虽然基于微博的性别分类已经有一定研究,但是针对中文的性别分类工作还很缺乏。该文首先提出分别利用用户名和微博文本构建两个分类器对用户的性别类型进行判别,并对不同的特征(例如,字特征、词特征等)进行了研究分析;其次,在针对用户名和微博文本的两个分类器的基础上,使用贝叶斯融合方法进行分类器融合,从而达到采用这两种文本分类信息同时对用户性别进行性别判断。实验结果表明该文的方法可以达到较高的识别准确率,并且分类器融合的方法明显优于仅利用用户名或者微博文本的分类方法。
Abstract
This paper investigates the classification of users into male and female with the information provided by Chinese Microblog. Although some researchers have devoted their efforts on gender classification, there is still a lack of researches in Chinese gender classification. In this paper, firstly, a classification method using user names or messages (sent by the users) to recognize male and female is proposed. Different types of features (e.g., character and word features) are adopted into the classification; Secondly, on the basis of the two classifiers trained by user names and messages, Bayes rule is employed to combine the two classifiers so as to make the prediction with the knowledge from both the user names and messages. Experimental results demonstrate that the proposed approach yields a nice performance to gender classification, and the combination method outperforms the individual classifiers trained with only user names or messages.
关键词
性别分类 /
新浪微博 /
文本分类 /
社交网络
{{custom_keyword}} /
Key words
gender classification /
Sina-weibo /
text classification /
social media
{{custom_keyword}} /
{{custom_sec.title}}
{{custom_sec.title}}
{{custom_sec.content}}
参考文献
[1] 文坤梅,徐帅,李瑞轩等. 微博及中文微博信息处理研究综述[J]. 中文信息学报,2012,26(6): 28-36.
[2] 张剑峰,夏云庆,姚建民. 微博文本处理研究综述[J].
中文信息学报,2012,26(4): 21-27.
[3] Burger J, Henderson J, Kim G, et al. Discriminating Gender on Twitter[C]//Proceedings of EMNLP-11, 2011, 1301-1309.
[4] Schler J, M Koppel, S Argamon, et al. Effects of Age and Gender on Blogging[C]//Proceedings of AAAI-06, 2006.
[5] Yan X, L Yan. Gender Classification of Weblog Authors[C]//Proceedings of AAAI-06, 2006.
[6] Mukherjee A, B Liu. Improving Gender Classification of Blog Authors[C]//Proceedings of EMNLP-10, 2010.
[7] Miller Z, B Dickinson, W Hu. Gender Prediction on Twitter Using Stream Algorithms with N-Gram Character Features[C]//Proceedings of International Journal of Intelligence Science, 2012,2(4):143-148.
[8] Nowson S, J Oberlander. The Identity of Bloggers: Openness and Gender in Personal Weblogs[C]//Proceeding of AAAI-06, 2006.
[9] Peersman C, W Daelemans, L Van Vaerenbergh. Predicting Age and Gender in Online Social Networks[C]//Proceedings of SMUC-11, 2011.
[10] Gianfortoni P, D Adamson, C Rosé. Modeling of Stylistic Variation in Social Media with Stretchy Patterns[C]//Proceedings of EMNLP-11, 2011.
[11] Ikeda D, H Takamura, M Okumura. Semi-Supervised Learning for Blog Classification[C]//Proceedings of AAAI-08, 2008.
[12] Corney M, O Vel, A Anderson,et al. Gender-Preferential Text Mining of E-mail Discourse[C]//Proceedings of ACSAC-02, 2002.
[13] Mohammad S, T Yang. Tracking Sentiment in Mail: How Genders Differ on Emotional Axes[C]//Proceedings of the 2nd Workshop on Computational Approaches to Subjectivity and Sentiment Analysis-11, 2011.
[14] Ciot M, M Sonderegger, D Ruths. Gender Inference of Twitter Users in Non-English Contexts[C]//Proceedings of EMNLP-13, 2013.
[15] Alowibdi J, U Buy, P Yu. Language Independent Gender Classification on Twitter[C]//Proceedings of 2013 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining, 2013.
[16] Li S, R Xia, C Zong, et al. A Framework of Feature Selection Methods for Text Categorization[C]//Proceedings of ACL-IJCNLP-09, 2009.
[17] Kittler J, M Hatef, R Duin, et al. On Combining Classifiers[C]//Proceedings of IEEE-98, 1998.
{{custom_fnGroup.title_cn}}
脚注
{{custom_fn.content}}
基金
国家自然科学基金(61375073)
{{custom_fund}}