一种应用于文本分类的段落向量正向激励方法

PDF(6189 KB)

中文信息学报 ›› 2023, Vol. 37 ›› Issue (7) : 51-60.

信息抽取与文本挖掘

钱亚冠^1,2,方科彬^1,2,康明³,顾钊铨⁴,潘俊¹,王滨²,Wassim Swaileh⁵

作者信息 +

A Text Classification Method Based on Positive Incentive Paragraph Vector

QIAN Yaguan^1,2, FANG Kebin^1,2, KANG Ming³, GU Zhaoquan⁴, PAN Jun¹, WANG Bin², Wassim Swaileh⁵

Author information +

History +

摘要

文本分类广泛应用于文档检索、网络搜索等领域,其中文本的向量化表示对于分类性能的提高具有重要的影响。在将变长文本表示成定长向量时,传统的段落向量化算法Doc2Vec忽视了该算法每轮训练的次数与段落长度高度相关的问题,以及长段落包含短段落信息的情况,限制了分类模型准确率的进一步提升。针对该问题,该文提出一种应用于文本分类的基于段落向量正向激励的方法。首先,根据中位数划分长、短段落向量,然后在分类模型输入过程中提升长段落向量的权重,实现提高模型分类准确率的目的。在Stanford Sentiment Treebank、IMDB和Amazon Reviews三个数据集上的实验结果表明,通过选择适当的激励系数,采用段落向量正向激励的分类模型可以获得更高的分类准确率。

Abstract

Text classification is widely used in document retrieval, web search and other fields, where the vectorization of text plays an important role. Doc2Vec, a typical paragraph vectorization algorithm, neglects such problems as the facts that the number of training epochs is related to the length of the paragraph, and long paragraphs generally contain short paragraph. When converting texts into fixed-length vectors, the further improvement of the classification accuracy is somewhat prohibited. To solve this problem, a paragraph vector positive excitation method for text classification is proposed. Specifically, long paragraph vectors and short paragraph vectors are distinguished according to their median, then the weight of long paragraph vector is increased in the input of classification model. Experiments on Stanford Sentiment Treebank, IMDB, and Amazon Reviews datasets show that, with proper incentive coefficients, the proposed model can obtain higher classification accuracy.

导出引用

钱亚冠,方科彬,康明,顾钊铨,潘俊,王滨,Wassim Swaileh. 一种应用于文本分类的段落向量正向激励方法. 中文信息学报. 2023, 37(7): 51-60

QIAN Yaguan, FANG Kebin, KANG Ming, GU Zhaoquan, PAN Jun, WANG Bin, Wassim Swaileh. A Text Classification Method Based on Positive Incentive Paragraph Vector. Journal of Chinese Information Processing. 2023, 37(7): 51-60

参考文献

[1] 马路遥,夏博,肖叶,等. 面向句法结构的文本检索方法研究[J]. 电子学报,2020,48(5): 833-839.
[2] 马成龙,颜永红. 基于概率语义分布的短文本分类[J]. 自动化学报,2016,42(11): 1711-1717.
[3] 胡小娟,刘磊,邱宁佳. 基于主动学习和否定选择的垃圾邮件分类算法[J]. 电子学报,2018,46(1): 203-209.
[4] WANG S I,MANNING C D. Baselines and bigrams: Simple,good sentiment and topic classification[C]//Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics,2012: 90-94.
[5] PANG B,LEE L,VAITHYANATHAN S. Thumbs up?: Sentiment classification using machine learning techniques[J]. Empirical Methods in Natural Language Processing,2002: 79-86.
[6] KUSNER M,SUN Y,KOLKIN N,et al. From word embeddings to document distances[C]//Proceedings of the 32th International Conference on Machine Learning,2015: 957-966.
[7] GREENE D,CUNNINGHAM P. Practical solutions to the problem of diagonal dominance in kernel document clustering[C]//Proceedings of the 23rd International Conference on Machine Learning,2006: 377-384.
[8] 梁泳诗,黄沛杰,岑洪杰,等. 向量模型和多源词汇分类体系相结合的词语相似性计算[J]. 中文信息学报,2018,32(4): 31-39.
[9] 李晓光,于戈,王大玲. 基于混合语言模型的文档相似性计算模型[J]. 中文信息学报,2006(4): 41-48.
[10] SOCHER R,HUANG E H,PENNINGTON J,et al. Dynamic pooling and unfolding recursive autoencoders for paraphrase detection[C]//Proceedings of the 25th Advances in Neural Information Processing Systems,2011,24: 801-809.
[11] 李波,高文君,邱锡鹏. 基于语法分析和统计方法的答案排序模型[J]. 中文信息学报,2009,23(2): 23-27.
[12] LE Q,MIKOLOV T. Distributed representations of sentences and documents[C]//Proceedings of the 31th International Conference on Machine Learning,2014: 1188-1196.
[13] 王宝勋,王晓龙,刘秉权,等. 一种基于无监督学习的词变体识别方法[J]. 中文信息学报,2008(3): 32-36.
[14] MIKOLOV T,SUTSKEVER I,CHEN K,et al. Distributed representations of words and phrases and their compositionality[C]//Proceedings of the 27th Advances in Neural Information Processing Systems,2013: 3111-3119.
[15] 付健,孔芳. 融入结构信息的指代消解[J]. 计算机科学,2020,47(3): 231-236.
[16] ELMER L. Hierarchical paragraph vectors[D].MA thesis,Zurich: Swiss Federal Institute of Technology,2015.
[17] GEREK A,YNEY M C,ERKAYA E,et al. Effects of positivization on the paragraph vector model[C]//Proceedings of IEEE International Symposium on Innovations in Intelligent Systems and Applications,2019: 1-5.
[18] 霍朝光,司湘云,王婉如. 基于Doc2Vec和SVM的作者姓名消歧研究: 以PubMed Central为例[J]. 情报科学,2021,39(07):91-98.
[19] MIKOLOV T,CHEN K,CORRADO G,et al. Efficient estimation of word representations in vector space[C]//Proceedings of International Conference on Learning Representations,2013.
[20] SOCHER R,PERELYAGIN A,WU J,et al. Recursive deep models for semantic compositionality over a sentiment treebank[C]//Proceedings of Conference on Empirical Methods in Natural Language Processing,2013: 1631-1642.
[21] MAAS A,DALY R E,PHAM P T,et al. Learning word vectors for sentiment analysis[C]//Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics,2011: 142-150.

基金

科技部重点研究与发展计划项目(2018YFB2100400);国家自然科学基金(61902082)

PDF(6189 KB)

463

Accesses

Citation

Detail

段落导航

摘要
Abstract
关键词
Key words
引用本文
参考文献
基金

选择文件类型/文献管理软件名称

选择包含的内容

摘要

Abstract

关键词

Key words

引用本文

{{custom_sec.title}}

{{custom_sec.title}}

参考文献

{{custom_fnGroup.title_cn}}

脚注

基金



Issue Date
2023-09-17