基于WMD距离与近邻传播的新闻评论聚类

官赛萍,靳小龙,徐学可,伍大勇,贾岩涛,王元卓,刘悦

PDF(4229 KB)
PDF(4229 KB)
中文信息学报 ›› 2017, Vol. 31 ›› Issue (5) : 203-214.
情感分析与社会计算

基于WMD距离与近邻传播的新闻评论聚类

  • 官赛萍1,2,靳小龙1,2,徐学可1,2,伍大勇1,2,贾岩涛1,2,王元卓1,2,刘悦1,2
作者信息 +

News Comments Clustering Based on WMD Distance and Affinity Propagation

  • GUAN Saiping1, 2, JIN Xiaolong1, 2, XU Xueke1, 2, WU Dayong1, 2, JIA Yantao1, 2, WANG Yuanzhuo1, 2, LIU Yue1, 2
Author information +
History +

摘要

随着新闻网站的快速发展,网络新闻和评论数据激增,给人们带来了大量有价值的信息。新闻让人们了解发生在国内外的时事,而评论则体现了人们对事件的观点和看法,这对舆情分析和新闻评论推荐等应用很重要。然而,新闻评论数据又多又杂,而且通常比较简短,因此难以快速直观地从中发现评论者的关注点所在。为此,该文提出一种面向新闻评论的聚类方法EWMD-AP,用以自动挖掘社会大众对事件的关注点。该方法利用强化了权重向量的Word Movers Distance(WMD)计算评论之间的距离,进而用Affinity Propagation(AP)对评论进行聚类,从杂乱的新闻评论中得到关注点簇及其代表性评论。特别地,该文提出利用强化权重向量替代传统WMD中的词频权重向量。而强化权重由三部分组成,包括结合词性特征与文本表达特征的词重要度系数、新闻正文作为评论背景的去背景化系数和TFIDF系数。在24个新闻评论数据集上的对比实验表明,EWMD-AP相比Kmeans和Mean Shift等传统聚类算法以及Density Peaks等当前最新算法都具有更好的新闻评论聚类效果。

Abstract

With the rapid development of news websites, the news comments increase sharply, which are very important to public opinion analysis and news comments recommendation. This paper proposes a news comments clustering method, called EWMD-AP, to automatically mine the focuses of the public on the news. This method employs Word Mover's Distance (WMD) with enhanced weight vectors to calculate the distances between news comments. It also adopts Affinity Propagation (AP) to cluster comments, and finally obtains the clusters and their representative comments corresponding to the focuses of the public. Particularly, this paper proposes to replace the traditional word frequency based weight vectors in WMD with enhanced weight vectors, which consist of three components: the importance coefficient of words, the de-contextualization coefficient, and the traditional TFIDF coefficient. Experimental results on 24 news comments datasets demonstrate that EWMD-AP performs much better than both traditional clustering methods (e.g. Kmeans, Mean Shift, etc) and the state-of-the-art ones (e.g. Density Peaks, etc).

关键词

新闻评论聚类 / 强化权重向量 / 去背景化 / Word Mover's Distance / 近邻传播

Key words

news comments clustering / enhanced weight vectors / de-contextualization / Word Mover's Distance / affinity propagation

引用本文

导出引用
官赛萍,靳小龙,徐学可,伍大勇,贾岩涛,王元卓,刘悦. 基于WMD距离与近邻传播的新闻评论聚类. 中文信息学报. 2017, 31(5): 203-214
GUAN Saiping, JIN Xiaolong, XU Xueke, WU Dayong, JIA Yantao, WANG Yuanzhuo, LIU Yue. News Comments Clustering Based on WMD Distance and Affinity Propagation. Journal of Chinese Information Processing. 2017, 31(5): 203-214

参考文献

[1] HAI Z, CONG G, CHANG K, et al. Coarse-to-fine review selection via supervised joint aspect and sentiment model [C]//Proceedings of the 37th International ACM SIGIR Conference on Research and Development in Information Retrieval. New York:ACM, 2014:617-626.
[2] DAYAN A, MOKRYN O, KUFLIK T. A two-iteration clustering method to reveal unique and hidden characteristics of items based on text reviews [C]//Proceedings of the 24th International Conference on World Wide Web. New York:ACM, 2015:637-642.
[3] ZHOU X, WAN X, XIAO J. Representation learning for aspect category detection in online reviews [C]//Proceedings of the 29th AAAI Conference on Artificial Intelligence. Menlo Park, CA:AAAI, 2015:417-423.
[4] NGUYEN T-S, LAUW H W, TSAPARAS P. Using micro-reviews to select an efficient set of reviews [C]//Proceedings of the 22nd ACM International Conference on Information and Knowledge Management. New York:ACM, 2013:1067-1076.
[5] NGUYEN T S, LAUW H W, TSAPARAS P. Review selection using micro-reviews [J]. IEEE Transactions on Knowledge and Data Engineering, 2015, 27(4):1098-1111.
[6] CHONG W-H, DAI B T, LIM E-P. Did you expect your users to say this?:Distilling unexpected micro-reviews for venue owners [C]//Proceedings of the 26th ACM Conference on Hypertext and Social Media. New York:ACM, 2015:13-22.
[7] LU Z, MAMOULIS N, PITOURA E, et al. Sentiment-based topic suggestion for micro-reviews [C]//Proceedings of the 10th International AAAI Conference on Web and Social Media. Menlo Park, CA:AAAI, 2016:231-240.
[8] KUSNER M, SUN Y, KOLKIN N, et al. From word embeddings to document distances [C]//Proceedings of the 32nd International Conference on Machine Learning. New York:ACM, 2015:957-966.
[9] FREY B J, DUECK D. Clustering by passing messages between data points[J]. Science, 2007, 315(5814):972-976.
[10] HARRIS Z S. Distributional structure [J]. Word, 1954, 10:146-162.
[11] HINTON G E. Learning distributed representation of concepts [C]//Proceedings of the 8th Annual Conference of the Cognitive Science Society. Mahwah, New Jersey:Lawrence Erlbaum Associates, 1986:1-12.
[12] BROWN P F, DESOUZA P V, MERCER R L, et al. Class-based n-gram models of natural language [J]. Computational Linguistics, 1992, 18(4):467-479.
[13] JEFFREY P, RICHARD S, MANNING C D. GloVe:Global vectors for word representation[C]//Proceedings of the Conference on Empirical Methods in Natural Language Processing. Stroudsburg, PA:ACL, 2014:1532-1543.
[14] MIKOLOV T, CHEN K, CORRADO G, et al. Efficient estimation of word representations in vector space [J]. arXiv preprint arXiv:13013781, 2013.
[15] MIKOLOV T, SUTSKEVER I, CHEN K, et al. Distributed representations of words and phrases and their compositionality[C]//Proceedings of Advances in Neural Information Processing Systems. Red Hook, NY:Curran Associates Inc, 2013:3111-3119.
[16] 田堃, 柯永红, 穗志方. 基于语义角色标注的汉语句子相似度算法 [J]. 中文信息学报, 2016, 30(6):126-132.
[17] WANG C, SONG Y, LI H, et al. KnowSim:A document similarity measure on structured heterogeneous information networks [C]//Proceedings of IEEE 15th International Conference on Data Mining. New Jersey:IEEE, 2015:1015-1020.
[18] 詹志建, 杨小平. 一种基于复杂网络的短文本语义相似度计算 [J]. 中文信息学报, 2016, 30(4):71-80+9.
[19] SUN Y, LI W, DONG P. Research on text similarity computing based on word vector model of neural networks [C]//Proceedings of IEEE 6th International Conference on Software Engineering and Service Science (ICSESS). New Jersey:IEEE, 2015:994-997.
[20] RUBNER Y, TOMASI C, GUIBAS L J. A metric for distributions with applications to image databases[C]//Proceedings of the 6th International Conference on Computer Vision. New Jersey:IEEE, 1998:59-66.
[21] MACQUEEN J. Some methods for classification and analysis of multivariate observations[C]//Proceedings of the 5th Berkeley Symposium on Mathematical Statistics and Probability:Volume 1 Statistics. Oakland, CA University of California Press, 1967:281-297.
[22] COMANICIU D, MEER P. Mean shift:a robust approach toward feature space analysis [J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2002, 24(5):603-619.
[23] RODRIGUEZ A, LAIO A. Clustering by fast search and find of density peaks [J]. Science, 2014, 344(6191):1492-1496.
[24] 蒋旦, 周文乐, 朱明. 基于语义和图的文本聚类算法研究 [J]. 中文信息学报, 2016, 30(5):121-128.
[25] XIE J, GIRSHICK R, FARHADI A. Unsupervised deep embedding for clustering analysis [C]//Proceedings of the 33rd International Conference on Machine Learning. New York:ACM, 2016:478-487.
[26] ZHANG Y, XIA Y, LIU Y, et al. Clustering sentences with density peaks for multi-document summarization [C]//Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics:Human Language Technologies. Stroudsburg, PA:ACL, 2015:1262.

基金

国家重点研发计划(2016YFB1000902);973计划(2014CB340406);国家自然科学基金(61772501,61572473,61572469,61402442,91646120)
PDF(4229 KB)

Accesses

Citation

Detail

段落导航
相关文章

/