基于BERT嵌入与知识蒸馏的层次化课程主题分析研究

PDF(1359 KB)

中文信息学报 ›› 2024, Vol. 38 ›› Issue (7) : 84-94.

信息抽取与文本挖掘

基于BERT嵌入与知识蒸馏的层次化课程主题分析研究

郭振东,林民,李成城

作者信息 +

Research on Hierarchical Topic Analysis for a Course Based on BERT Embedding and Knowledge Distillation

GUO Zhendong, LIN Min, LI Chengcheng

Author information +

History +

摘要

基于变分自编码器的树结构神经主题模型能有效挖掘文本的层次化语义特征,但现有的树结构神经主题模型仅利用了词频等统计特征,忽略了外部先验知识对获取主题的帮助。针对课程主题分析任务,该文融合迁移学习思想,提出了一种基于BERT嵌入与知识蒸馏的树结构神经主题模型。首先,通过构建BERT-CRF分词模型,使用少量领域文本对BERT进行二次训练,优化领域字的表示,动态融合二次训练后的BERT字嵌入,获取粗粒度领域词嵌入,缓解字粒度BERT嵌入与词袋表示不匹配问题;其次,针对词袋表示数据稀疏问题,以文档重构为目标,构建BERT自编码器,蒸馏有监督的文档表示,指导主题模型的文档重构学习,提升主题质量;最后,优化树结构神经主题模型以拟合富含辅助信息的BERT词嵌入,并用有监督的蒸馏知识指导无监督主题模型的文档重构。实验表明,基于BERT嵌入与知识蒸馏的树结构神经主题模型具有预训练模型和主题模型的优良特性,能对课程主题进行更有效的归纳总结。

Abstract

The tree-structured neural topic model based on the variational auto-encoder can effectively mine the hierarchical semantic features of the text. However, the existing tree-structured neural topic model only uses statistical features such as word frequency and ignores the prior external knowledge. Aiming at the topic analysis of a course, we propose a tree-structured neural topic model based on BERT embedding and knowledge distillation by integrating the idea of transfer learning. Firstly, the BERT-CRF word segmentation model is constructed, and a small amount of domain text is used to train BERT twice to optimize the representation of domain words. After the second training, the BERT word embedding is dynamically fused to obtain coarse-grained domain word embedding, alleviating the mismatch between word embedding and a bag-of-words representation. Secondly, the BERT autoencoder is constructed with document reconstruction as the goal to solve the problem of sparse bag-of-words representation data. The supervised document representation is distilled to guide the document reconstruction learning of the topic model and improve the document reconstruction of the quality of the topic. Finally, a tree-structured neural topic model is optimized to fit auxiliary information-rich BERT word embedding, and supervised distillation knowledge is used to guide the document reconstruction of the unsupervised topic model. Experiments show that the proposed method can summarize the course topics more effectively.

导出引用

郭振东,林民,李成城. 基于BERT嵌入与知识蒸馏的层次化课程主题分析研究. 中文信息学报. 2024, 38(7): 84-94

GUO Zhendong, LIN Min, LI Chengcheng. Research on Hierarchical Topic Analysis for a Course Based on BERT Embedding and Knowledge Distillation. Journal of Chinese Information Processing. 2024, 38(7): 84-94

参考文献

[1] BLEI D M, NG A Y, JORDAN M I. Latent dirichlet allocation[J]. The Journal of Machine Learning Research, 2003, 3: 993-1022.
[2] KINGMA D P, WELLING M. Auto-encoding variational Bayes[J]. arXiv preprint arXiv:1312.6114, 2013.
[3] GRIFFITHS T L,BLEI D M, JORDAN M I, et al. Hierarchical topic models and the nested Chinese restaurant process[C]//Proceedings of the 16th International Conference on Neural Information Processing Systems, 2003: 17-24.
[4] ISONUMA M, MORI J, BOLLEGALA D, et al. Tree-structured neural topic model[C]//Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, 2020: 800-806.
[5] DAVID A M, TOMMI S J. Tree-structured decoding with doubly recurrent neural networks[C]//Proceedings of the 5th International Conference on Learning Representations,2017.
[6] YIN J, WANG J. A Dirichlet multinomial mixture model-based approach for short text clustering[C]//Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2014: 233-242.
[7] DEVLIN J, CHANG M W, LEE K, et al. Bert: Pre-training of deep bidirectional transformers for language understanding[J]. arXiv preprint arXiv:1810.04805, 2018.
[8] VASWANI A, SHAZEER N, PARMAR N, et al. Attention is all you need[C]//Proceedings of the 31st Conference on Neural Information Processing Systems, 2017: 5998-6008.
[9] 郭振东,林民,李成城,等.基于BERT-CRF的领域词向量生成研究[J/OL].计算机工程与应用,2022:1-9[2022-01-07]http://kns.cnki.net/kcms/detail/11.2127.TP.20210706.1547.023.html.
[10] 王堃,林民,李艳玲.端到端对话系统意图语义槽联合识别研究综述[J].计算机工程与应用,2020,56(14):14-25.
[11] HOYLE A M, GOEL P, RESNIK P. Improving neural topic models using knowledge distillation[C]//Proceedings of the Conference on Empirical Methods in Natural Language Processing, 2020: 1752-1771.
[12] GHAHRAMANI Z,ADAMS R P, JORDAN M I. Tree-structured stick breaking processes for hierarchical data[J]. arXiv preprint arXiv:1006.1062, 2010.
[13] ZAVITSANOS E, PALIOURAS G, VOUROS G A.Non-parametric estimation of topic hierarchies from texts with hierarchical dirichlet processes[J]. Journal of Machine Learning Research, 2011, 12: 2749-2775.
[14] KIM J H, KIM D, KIM S, et al. Modeling topic hierarchies with the recursive Chinese restaurant process[C]//Proceedings of the 21st ACM International Conference on Information and Knowledge Management, 2012: 783-792.
[15] 刘佳琦,李阳.基于信息最大化变分自编码器的孪生神经主题模型[J].计算机应用与软件,2020,37(09):118-125.
[16] MIAO Y, GREFENSTETTE E, BLUNSOM P. Discovering discrete latent topics with neural variational inference[C]//Proceedings of the 34th International Conference on Machine Learning.PMLR,2017: 2410-2419.
[17] GOYAL P, HU Z, LIANG X, et al. Nonparametric variational auto-encoders for hierarchical representation learning[C]//Proceedings of the IEEE International Conference on Computer Vision, 2017: 5094-5102.
[18] ZHENG S, JAYASUMANA S, ROMERA PAREDES B, et al. Conditional random fields as recurrent neural networks[C]//Proceedings of the IEEE International Conference on Computer Vision,2015: 1529-1537.
[19] HINTON G, VINYALS O, DEAN J. Distilling the knowledge in a neural network[J]. arXiv preprint arXiv:1503.02531, 2015.
[20] CHANG J, GERRISH S, WANG C, et al. Reading tea leaves[C]//Proceedings of the 22nd International Conference on Neural Information Processing Systems, 2009: 288-296.
[21] NEWMAN D, LAU J H, GRIESER K, et al. Automatic evaluation of topic coherence[C]//Proceedings of the Annual Conference of the North American Chapter of the Association for Computational Linguistics, 2010: 100-108.
[22] CARD D, TAN C, SMITH N A. Neural models for documents with metadata[C]//Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, 2018: 2031-2040.

基金

国家自然科学基金(61806103,61562068);内蒙古自然科学基金(2017MS0607,2021LHMS06010);国家242信息安全专项(2019A114)

PDF(1359 KB)

Accesses

Citation

Detail

段落导航

摘要
Abstract
关键词
Key words
引用本文
参考文献
基金

选择文件类型/文献管理软件名称

选择包含的内容

摘要

Abstract

关键词

Key words

引用本文

{{custom_sec.title}}

{{custom_sec.title}}

参考文献

{{custom_fnGroup.title_cn}}

脚注

基金

Published
2024-08-26
Issue Date
2024-08-27