基于半监督学习的中文社交文本事件聚类方法

郭恒睿,王中卿,朱巧明,李培峰

PDF(2345 KB)
PDF(2345 KB)
中文信息学报 ›› 2022, Vol. 36 ›› Issue (2) : 152-159.
情感分析与社会计算

基于半监督学习的中文社交文本事件聚类方法

  • 郭恒睿,王中卿,朱巧明,李培峰
作者信息 +

Event Clustering Method for Chinese Social Text Based on Semi-supervised Learning

  • GUO Hengrui, WANG Zhongqing, ZHU Qiaoming, LI Peifeng
Author information +
History +

摘要

面向社交媒体的事件聚类旨在根据事件特征实现短文本聚类。目前,事件聚类模型主要分为无监督模型和有监督模型。无监督模型聚类效果较差,有监督聚类模型依赖大量标注数据。基于此,该文提出了一种半监督事件聚类模型(SemiEC),该模型在小规模标注数据的基础上,利用LSTM表征事件,并基于线性模型计算文本相似度,进行增量聚类。然后,利用增量聚类产生的标注数据对模型再训练,结束后对不确定样本再聚类。实验表明,SemiEC的性能相比基准模型有较大提升。

Abstract

Event clustering on social text aims to cluster short texts according to event contents. Event clustering models can be divided into unsupervised learning or supervised learning at present. The unsupervised models suffer from poor performance, while the supervised models require lots of labeling data. To address the above issues, this paper proposes a semi-supervised incremental event clustering model SemiEC based on a small-scale annotated dataset. This model encodes the events by LSTM and calculates text similarity by a linear model. In particular, it uses the samples generated by incremental clustering to retrain the model and redistribute the uncertain samples. Experimental results show that the SemiEC model gets a better performance than the critical clustering algorithms.

关键词

社交媒体事件聚类 / 增量聚类 / 文本相似度

Key words

event clustering on social text / incremental clustering / text similarity

引用本文

导出引用
郭恒睿,王中卿,朱巧明,李培峰. 基于半监督学习的中文社交文本事件聚类方法. 中文信息学报. 2022, 36(2): 152-159
GUO Hengrui, WANG Zhongqing, ZHU Qiaoming, LI Peifeng. Event Clustering Method for Chinese Social Text Based on Semi-supervised Learning. Journal of Chinese Information Processing. 2022, 36(2): 152-159

参考文献

[1] Petrovi'c S, Osborne M, Lavrenko V. Streaming first story detection with application to twitter[C]//Proceedings of Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, 2010: 181-189.
[2] Aggarwal C C, Subbian K. Event detection in social streams[C]//Proceedings of the SIAM International Conference on Data Mining. Society for Industrial and Applied Mathematics, 2012: 624-635.
[3] Wang Z, Zhang Y. A neural model for joint event detection and summarization[C]//Proceedings of the IJCAI, 2017: 4158-4164.
[4] Hochreiter S, Schmidhuber J. Long short-term memory[J]. Neural Computation, 1997, 9(8): 1735-1780.
[5] Mathioudakis M, Koudas N. Twitter monitor: trend detection over the Twitter stream[C]//Proceedings of the ACM SIGMOD International Conference on Management of data, 2010: 1155-1158.
[6] Saeed Z, Abbasi R A, Razzak M I, et al. Event detection in Twitter stream using weighted dynamic heartbeat graph approach [applicationnotes][J]. IEEE Computational Intelligence Magazine, 2019, 14(3): 29-38.
[7] Nguyen D T, Jung J J. Real time event detection on social data stream[J]. Mobile Networks and Applications, 2015, 20(4): 475-486.
[8] Li R, Lei K H,Khadiwala R, et al. Tedas: a twitter-based event detection and analysis system[C]//Proceedings of the IEEE 28th International Conference on Data Engineering. IEEE, 2012: 1273-1276.
[9] McMinn A J, Jose J M. Real time entity-based event detection for twitter[C]//Proceedings of the International Conference of the Cross-Language Evaluation Forum for European Languages. Springer, Cham, 2015: 65-77.
[10] Cai D, He X, Han J. Document clustering using locality preserving indexing[J]. IEEE Transactions on Knowledge and Data Engineering, 2005, 17(12): 1624-1637.
[11] Qimin C, Qiao G, Yongliang W, et al. Text clustering using VSM with feature clusters[J]. Neural Computing and Applications, 2015, 26(4): 995-1003.
[12] Zhou P, Cao Z, Wu B, et al. EDM-JBW: A novel event detection model based on JS-ID′ Forder and Bikmeans with word embedding for news streams[J]. Journal of Computational Science, 2018, 28: 336-342.
[13] Arora S, Liang Y, Ma T. A simple but tough-to-beat baseline for sentence embeddings[C]//Proceedings of the 5th International Conference on Learning Representations, 2017.
[14] Xu J, Wang P, Tian G, et al. Short text clustering via convolutional neural networks[C]//Proceedings of the 1st Workshop on Vector Space Modeling for Natural Language Processing, 2015: 62-69.
[15] Xu J, Xu B, Wang P, et al. Self-taught convolutional neural networks for short text clustering[J]. Neural Networks, 2017, 88: 22-31.
[16] Allan J, Carbonell J G, Doddington G, et al. Topic detection and tracking pilot study[C]//Proceedings of the DARPA Broadcast News Transcription and Understanding Workshop, 1998: 194-218.
[17] Wurzer D, Lavrenko V, Osborne M. Twitter-scale new event detection via k-term hashing[C]//Proceedings of the Conference on Empirical Methods in Natural Language Processing, 2015: 2584-2589.
[18] Xie J, Girshick R, Farhadi A. Unsupervised deep embedding for clustering analysis[C]//Proceedings of the International Conference on Machine Learning. PMLR, 2016: 478-487.
[19] Hadifar A, Sterckx L, Demeester T, et al. A self training approach for short text clustering[C]//Proceedings of the 4th Workshop on Representation Learning for NLP, 2019: 194-199.
[20] Finley T,Joachims T. Supervised clustering with support vector machines[C]//Proceedings of the 22nd International Conference on Machine Learning, 2005: 217-224.
[21] Bansal N, Blum A, Chawla S. Correlation clustering[J]. Machine Learning, 2004, 56(1): 89-113.
[22] Haponchyk I, Uva A, Yu S, et al. Supervised clustering of questions into intents for dialog system applications[C]//Proceedings of the Conference on Empirical Methods in Natural Language Processing, 2018: 2310-2321.

基金

国家自然科学基金(61772354,61836007);国家自然科学基金青年基金(61806137);江苏高校优势学科建设工程资助项目
PDF(2345 KB)

1015

Accesses

0

Citation

Detail

段落导航
相关文章

/