粗糙集是一种能够有效处理不精确、不完备和不确定信息的数学工具,粗糙集的属性约简可以在保持文本情感分类能力不变的情况下对文本情感词特征进行约简。针对情感词特征空间维数过高、情感词特征表示缺少语义信息的问题,该文提出了RS-WvGv中文文本情感词特征表示方法。利用粗糙集决策表对整个语料库进行情感词特征建模,采用Johnson粗糙集属性约简算法对决策表进行化简,保留最小的文本情感词特征属性集,之后再对该集合中的所有情感特征词进行词嵌入表示,最后用逻辑回归分类器验证RS-WvGv方法的有效性。另外,该文还定义了情感词特征属性集覆盖力,用于表示文本情感词特征属性集合对语料库的覆盖能力。最后,在实验对比的过程中,用统计检验进一步验证了该方法的有效性。
Abstract
Rough set is a mathematical tool that can greatly reduce the dimension and number of text sentiment word features while keeping the ability of text sentiment classification unchanged. Aiming at the problem that the text sentiment word feature dimension is too high and the sentiment word feature representation lacks semantic information, this article proposes a novel Chinese text sentiment word feature representation method named RS-WvGv. The decision table of rough set is used to model the text sentiment word feature of the whole corpus. The Johnson attribute reduction algorithm is applied to simplify the decision table and get the minimum set of text sentiment word feature attributes. And then based on the word embedding of all the sentiment feature words in the set, the RS-WvGv method is verified with logistic regression classifier in the experiment.
关键词
属性约简 /
情感特征提取 /
词向量 /
情感分类
{{custom_keyword}} /
Key words
attribute reduction /
sentiment feature extraction /
word vector /
sentiment classification
{{custom_keyword}} /
{{custom_sec.title}}
{{custom_sec.title}}
{{custom_sec.content}}
参考文献
[1] 赵妍妍, 秦兵, 刘挺. 文本情感分析[J]. 软件学报, 2010, 21(8): 1834-1848.
[2] Pang B, Lee L, Vaithyanathan S. Thumbs up?: Sentiment classification using machine learning techniques[C]//Proceedings of the ACL-02 Conference on Empirical Methods in Natural Language Processing. PA, USA: 2002: 79-86.
[3] 李平, 戴月明, 王艳. 基于混合卡方统计量与逻辑回归的文本情感分析[J]. 计算机工程, 2017, 43(12): 192-196.
[4] Yang S, Xia Z. A convolutional neural network method for Chinese document sentiment analyzing[C]//Proceedings of IEEE International Conference on Computer and Communications. IEEE, 2016: 308-312.
[5] 赵富, 杨洋, 蒋瑞, 等. 融合词性的双注意力BiLSTM情感分析[J]. 计算机应用, 2018, 38(S2): 108-111,152.
[6] Mihalcea R, Tarau P. TextRank: Bringing order into texts[C]//Proceedings of the 2004 Conference on Empirical Methods in Natural Language Processing, 2004: 404-411.
[7] 张克君, 史泰猛, 李伟男. 基于统计语言模型改进的Word2Vec优化策略研究[J]. 中文信息学报, 2019, 33(7): 11-19.
[8] 徐立. 基于加权TextRank的文本关键词提取方法[J]. 计算机科学, 2019, 46(z1): 142-145.
[9] Pawlak Z. Rough sets[J]. International Journal of Computer & Information Sciences, 1982, 11(5): 341-356.
[10] 张志飞, 苗夺谦. 基于粗糙集的文本分类特征选择算法[J].智能系统学报, 2009, 4(5): 453-457.
[11] 孙晓, 高飞, 任福继. 基于深度模型的社会新闻对用户情感影响挖掘[J]. 中文信息学报, 2017, 31(3): 184-190.
[12] Hinton G E. Learning distributed representations of concepts [C]//Proceedings of the 8th Conference of the Cognitive Science Society,1989.
[13] Mikolov T, Chen K, Corrado G, et al. Efficient estimation of word representations in vector space[J]. arXiv preprint arXiv: 1301.3781, 2013.
[14] 唐明, 朱磊, 邹显春. 基于Word2Vec的一种文档向量表示[J].计算机科学, 2016, 43(6): 214-217.
[15] Bojanowski P, Grave E, Joulin A, et al. Enriching word vectors with subword information[J]. arXiv preprint arXiv: 1607.04606, 2016.
[16] Joulin A, Grave E, Bojanowski P, et al. Bag of tricks for efficient text classification[J]. arXiv preprint arXiv: 1607. 01759, 2016.
[17] Pennington J, Socher R, Manning C. Glove: Global vectors for word representation[C]//Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing. Doha, Qatar: Association for Computational Linguistics, 2014: 1532-1543.
[18] Peters M E, Neumann M, Iyyer M, et al. Deep contextualized word representations[J]. arXiv preprint arXiv: 1802.05365, 2018.
[19] Devlin J, Chang M W, Lee K, et al. Bert: Pre-training of deep bidirectional transformers for language understanding[J].arXiv preprint arXiv: 1810.04805, 2018.
[20] Vaswani A, Shazeer N, Parmar N, et al. Attention is all you need[C]//Processings of the Advances in Neural Information Processing Systems. Long Beach, California, USA: 2017: 6000-6010.
[21] Tan S B, Zhang J. An empirical study of sentiment analysis for Chinese documents[J]. Expert Systems with Application, 2008, 34(4): 2622-2629.
[22] 王国胤. Rough集理论与知识获取[M]. 西安: 西安交通大学出版社, 2001.
[23] hrn A. Rosetta technical reference manual[EB/OL].[2001-05-25].http://bioinf.icm.uu.se/rosetta/materials/manual.pdf.
[24] Pedregosa F, Varoquaux G, Gramfort A, et al. Scikit-learn: Machine learning in Python[J]. Journal of Machine Learning Research, 2011, 12(Oct): 2825-2830.
{{custom_fnGroup.title_cn}}
脚注
{{custom_fn.content}}
基金
山西省应用基础研究项目(201801D221190,201801D121144)
{{custom_fund}}