由于维吾尔语形态丰富且资源匮乏,因此直接使用现有的深度学习模型并不能很好地完成文本分类任务。基于此,该文提出了MDPLC文本分类模型,即首先将预先训练的词向量和经Bi-LSTM处理得到的语义信息进行融合,进而得到全句语义依赖,然后通过组合池化的CNN进一步加强局部语义学习,同时以双通道的方式使用多卷积核DPCNN捕获文本语义信息,最后融合两种模型提取到的信息完成文本分类任务。为验证该模型的有效性,该文分别采用中文、英文和维吾尔文短、长文本数据集进行实验,实验结果表明,该模型在多个分类任务中取得的性能都高于现有主流深度学习模型,验证了该模型在不同语种、语义表达稀疏和语义丰富各种情况下的鲁棒性。
Abstract
Uyghur is rich in form and scarce in resources, which challenges the existing deep learning models for Uyghur text classification. This paper proposes a text classification model called MDPLC combining both Bi-LSTM+CNN and DPCNN. Firstly, the pre-trained word vector is fused with the semantic information processed by Bi-LSTM to obtain the semantic dependency of the whole sentence, and the local semantic learning is further strengthened by a layer of pooled CNN. Meanwhile, the text semantic information is captured by using multi-convolution kernel DPCNN in a dual-channel way. Experiments on short and long text data sets of Chinese, English, and Uyghur show that the accuracy of the proposed model is better than that of the existing popular deep learning models.
关键词
维吾尔语 /
文本分类 /
多卷积核DPCNN /
Bi-LSTM+CNN
{{custom_keyword}} /
Key words
Uyghur /
text classification /
multi-convolution kernel DPCNN /
Bi-LSTM+CNN
{{custom_keyword}} /
{{custom_sec.title}}
{{custom_sec.title}}
{{custom_sec.content}}
参考文献
[1] Zhang H, Ni W, Zhao M, et al. Cluster-gated convolutional neural network for short text classification[C]//Proceedings of the 23rd Conference on Computational Natural Language Learning. 2019: 1002-1011.
[2] Choi B J, Park J H, Lee S K. Adaptive convolution for text classification[C]//Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2019: 2475-2485.
[3] 陈洋. 基于加权改进贝叶斯算法的维吾尔文文本分类[J]. 计算机工程与设计, 2014(06):1999-2003.
[4] Parhat, Ablimit, Hamdulla. A Robust morpheme sequence and convolutional neural network-based Uyghur and Kazakh short text classification[J]. Information (Switzerland), 2019, 10(12):387.
[5] 沙尔旦尔·帕尔哈提, 米吉提·阿不里米提, 艾斯卡尔·艾木都拉. 基于稳健词素序列和LSTM的维吾尔语短文本分类[J]. 中文信息学报, 2020, 034(01):63-70.
[6] Li Z, Li X, Sheng J, et al.AgglutiFiT: efficient low-resource agglutinative language model fine-tuning[J]. IEEE Access, 2020, 8: 148489-148499.
[7] Xu S. Bayesian Nave Bayes classifiers to text classification[J]. Journal of Information Science, 2018, 44(1): 48-59.
[8] Gola J, Webel J, Britz D, et al. Objective microstructure classification by support vector machine (SVM) using a combination of morphological parameters and textural features for low carbon steels[J]. Computational Materials Science, 2019, 160:186-196.
[9] Tan Y,Shenoy P P. A bias-variance based heuristic for constructing a hybrid logistic regression-nave Bayes model for classification[J]. International Journal of Approximate Reasoning, 2020, 117: 15-28.
[10] Kim Yoon. Convolutional neural networks for sentence classification[C]//Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, 2014:1746-1751.
[11] Xie J, Hou Y, Wang Y, et al. Chinese text classification based on attention mechanism and feature-enhanced fusion neural network[J]. Computing, 2020, 102(3): 683-700.
[12] Wu X, Cai Y, Li Q, et al.Combining contextual information by self-attention mechanism in convolutional neural networks for text classification[C]//Proceedings of the International Conference on Web Information Systems Engineering. Springer, Cham, 2018: 453-467.
[13] Vaswani A, Shazeer N, Parmar N, et al. Attention is all you need[C]//Proceedings of the 31st International Conference on Neural Information Processing Systems, 2017: 5998-6008.
[14] 董孝政,宋睿,洪宇,等.基于多模型的新闻标题分类[J].中文信息学报,2018,32(10):69-77.
[15] Kalchbrenner N, Grefenstette E, Blunsom P. A convolutional neural network for modelling sentences[C]//Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics, 2014: 655-665.
[16] Liu P,Qiu X, Huang X. Recurrent neural network for text classification with multi-task learning[C]//Proceedings of the 25th International Joint Conference on Artificial Intelligence, 2016:2873-2879.
[17] Liu B, Zhou Y, Sun W, et al. Character-level text classification via convolutional neural network and gated recurrent unit[J]. International Journal of Machine Learning and Cybernetics, 2020: 1-11.
[18] Nowak J,Taspinar A, Scherer R. LSTM recurrent neural networks for short text and sentiment classification[C]//Proceedings of the International Conference on Artificial Intelligence and Soft Computing, 2017:553-562.
[19] 阿不都萨拉木·达吾提,于斯音·于苏普,艾斯卡尔·艾木都拉.类别区分词与情感词典相结合的维吾尔文句子情感分类[J].清华大学学报(自然科学版),2017,57(02):197-201.
[20] 吐尔地·托合提,艾克白尔·帕塔尔,艾斯卡尔·艾木都拉,等.语义词特征提取及其在维吾尔文文本分类中的应用[J].中文信息学报,2014,28(04):140-144.
[21] Johnson Rie, Tong Zhang. Deep pyramid convolutional neural networks for text categorization[C]//Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics, 2017:562-570.
[22] Joulin A, Grave E, Bojanowski P, et al. Bag of tricks for efficient text classification[C]//Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics, 2017:427-431.
[23] 古丽尼格尔·阿不都外力,买合木提·买买提,吐尔根·依布拉音,等.字符序列标注的维吾尔语词干提取方法[J].现代电子技术,2020,43(12):151-154,160.
[24] 胡玉兰,赵青杉,陈莉,等.面向中文新闻文本分类的融合网络模型[J].中文信息学报,2021,35(03):107-114.
[25] 张晓辉, 于双元, 王全新. 基于对抗训练的文本表示和分类算法[J]. 计算机科学, 2020,47(S1):12-16.
{{custom_fnGroup.title_cn}}
脚注
{{custom_fn.content}}
基金
国家重点研发计划子课题(2017YFB1002103); 国家自然科学基金(61762084)
{{custom_fund}}