中英文微博大都以单一语种来表述,而将近80%的藏文微博都是以藏汉混合文本形式呈现,若只针对藏文内容或中文内容进行情感倾向性分析会造成情感信息丢失,无法达到较好效果。根据藏文微博的表述特点,该文提出了基于多特征的情感倾向性分析算法,算法使用情感词、词性序列、句式信息和表情符号作为特征,并针对藏文微博常出现中文表述的情况,将中文的情感信息也作为特征进行情感计算,利用双语情感特征有效提高了情感倾向性分析的效果。实验显示,该方法对纯藏文表述的微博情感倾向性分析正确率可达到79.8%,针对藏汉双语表述的微博在加入中文情感词、中文标点符号等特征后,正确率能够达到82.8%。
Abstract
While most Chinese or English micro-blogs are in just one single language, nearly 80% Tibetan Micro-blogs are mixed text of Tibetan and Chinese languages. If emotion orientation analysis is only targeted at Tibetan or Chinese, this analysis would be partial and fail to achieve its goal. According to the expression features of Tibetan micro-blogs, this paper puts forward the algorithm of multi-feature sentiment analysis, upon such features as emotional words, the sequence of part of speech, sentence information and emoticon signs.Dealing with Tibetan micro-blogs, this algorithm takes into consideration the emotional information of Chinese language and has improved the effect of sentiment analysis with the help bilingual information. The experimental results indicate that the sentiment analysis accuracy concerning monolingual Tibetan expression is 79.8%, which is boosted up to 82.8% after taking into consideration of the features of Chinese emotional words and Chinese punctuations.
关键词
藏文微博 /
混合文本 /
情感倾向 /
情感词 /
词性序列
{{custom_keyword}} /
Key words
Tibetan micro-blog /
mixed text /
sentiment orientation /
emotional words /
part of speech sequence
{{custom_keyword}} /
{{custom_sec.title}}
{{custom_sec.title}}
{{custom_sec.content}}
参考文献
[1] 文坤梅,徐帅,李瑞轩. 微博及中文微博信息处理研究综述[J].中文信息学报,2012,26(6): 27-37.
[2] 谢丽星, 周明, 孙茂松. 基于层次结构的多策略中文微博情感分析和特征抽取[J]. 中文信息学报, 2012, 26(1): 73-83.
[3] 韩忠明,张玉沙,张慧,等. 有效的中文微博短文本倾向性分类算法[J].计算机应用与软件, 2012,29(10): 89-93.
[4] 刘培玉, 张艳辉, 朱振方,等. 融合表情符号的微博文本倾向性分析[J].山东大学学报(理学版),2014,49(11): 8-13.
[5] 吴江,唐常杰,李太勇,等. 基于语义规则的Web金融文本情感分析[J].计算机应用,2014,34(2): 481-485.
[6] 张俊,李应兴. 基于情感词典的藏文微博情感分析研究[J].硅谷, 2014,24(20): 220-222.
[7] Neviarouskaya A, Prendinger H, Ishizuka M. Sentiful: a lexicon for sentiment analysis[J]. Affective Computing, IEEE Transactions on, 2011,2(1): 22-36.
[8] PANG Bo, LEE L,Vaithyanathan S. Thumbs up? sentiment classification using machine learning techniques [C]//Proceedings of the Conference on Empirical Methods in Natural Language Processing.2002: 79-86.
[9] Alec Go,Richa Bhayani, Huang Lei. Twitter Sentiment Classification using Distant Supervision[R].CS224N Project Report, Stanford: 2009.
[10] Jiang Long, Yu Mo, Zhou Ming, et al. Target-dependent Twitter sentiment classification [C]//Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics. Somerset: ACL, 2011: 151-160.
[11] Davidav D, Tsur O, Rappoport A. Enhanced sentiment learning using Twitter hashtags and smileys [C]//Proceedings of the 23rd International Conference on Computational Linguistics. Bejing, 2010: 241-249.
[12] Kouloumpis E, Wilson T, Moore J. Twitter sentiment analysis: the good the bad and the omg! [C]//Proceedings of ICWSM.AAAI Press,2011,11: 538-541.
[13] 刘志明, 刘鲁. 基于机器学习的中文微博情感分类实证研究[J].计算机工程与应用, 2012, 48(1): 1-4.
[14] 李婷婷, 姬东鸿. 基于SVM和CRF多特征组合的微博情感分析[J/OL].计算机应用研究, 2015.
[15] 刘全超, 黄河燕, 冯冲.基于多特征微博话题情感倾向性判定算法研究[J]. 中文信息学报, 2014,28(4): 123-131.
[16] Alina Andreevskaia, Sabine Bergler. Mining WordNet for a fuzzy sentiment: sentiment tag extraction from WordNet Glosses [C]//Proceedings of the 11th Conference of the European Chapter of the Association for Computational Linguistics.Trento: Association for Computational Linguistics, 2006: 209-216.
[17] 卢伟胜, 郭躬德, 陈黎飞. 基于词性标注序列特征提取的微博情感分类[J]. 计算机应用,2014,34(10): 2869-2873.
[18] Pu Qiang , Yang Guo Wei .Short-text classification based on ICA and LSA [C]//Proceedings of International Symposium on Neural Networks, 2006(ISNN 2): 265-270.
[19] Vapnic V. The nature of statistical learning theory [M]. Springer, 2000.
{{custom_fnGroup.title_cn}}
脚注
{{custom_fn.content}}
基金
国家自然基金(61262054);西北民族大学中央专项资金资助研究生项目(Yxm2014001);国家科技支撑计划项目(2014BAK10B03);甘肃省科技重大专项项目(1203FKDA033)
{{custom_fund}}