基于预训练的谷歌搜索结果判定

PDF(3006 KB)

中文信息学报 ›› 2024, Vol. 38 ›› Issue (3) : 102-112.

信息抽取与文本挖掘

基于预训练的谷歌搜索结果判定

张恩伟^1,2,3,胡凯^1,3,卓俊杰²,陈志立²

作者信息 +

Google Search Result Classification Based on Pre-training

ZHANG Enwei ^1,2,3, HU Kai^1,3, ZHUO Junjie²,CHEN Zhili²

Author information +

History +

摘要

对搜索引擎返回的结果进行初步判定有利于优化语义搜索过程,提高搜索的准确性和效率。谷歌搜索引擎在所有的搜索引擎中占据主导地位,然而其返回的结果往往非常复杂,目前并没有有效的方法能够对搜索页面的结果做出准确的判断。针对以上问题,该文从数据特征和模型结构设计出发,制作了一个适用于谷歌搜索结果判定的数据集,接着基于预训练模型设计了一种双通道模型(DCFE)用于实现对谷歌搜索结果的判定。该文提出的模型在自建数据集上的准确率可以达到85.74%,相较于已有的模型拥有更高的精度。

Abstract

The preliminary judgment of the results returned by the search engine is of substantial significance to optimizing the search process. As a dominant search engine, Google often returns very complex results, for which there is no effective way to make accurate judgments on the results of search pages. This paper first constructs a data set suitable for Google search result classification, and then, proposes a dual-channel model (DCFE) based on the pre-training model to determine the Google search results. The accuracy of our model on the self-built dataset reach 85.74%, which has higher accuracy the existing models.

导出引用

张恩伟,胡凯,卓俊杰,陈志立. 基于预训练的谷歌搜索结果判定. 中文信息学报. 2024, 38(3): 102-112

ZHANG Enwei, HU Kai, ZHUO Junjie,CHEN Zhili. Google Search Result Classification Based on Pre-training. Journal of Chinese Information Processing. 2024, 38(3): 102-112

参考文献

[1] 黄晓斌, 邱明辉. 网络信息过滤系统研究[J]. 情报学报, 2004, 23(3): 326-332.
[2] 谢丽星, 周明, 孙茂松. 基于层次结构的多策略中文微博情感分析和特征抽取[J]. 中文信息学报, 2012, 26(1): 73-84.
[3] 赵军, 刘康, 周光有, 等. 开放式文本信息抽取[J]. 中文信息学报, 2011, 25(6): 98-111.
[4] 马萌, 金鹏. 浅析网站推广的搜索引擎优化[J]. 黑龙江对外经贸, 2008 (4): 102-103.
[5] HAWKING D,CRASWELL N, THISTLEWAITE P, et al. Results and challenges in web search evaluation[J]. Computer Networks, 1999, 31(11-16): 1321-1330.
[6] CHO J, ROY S, ADAMS R E. Page quality: In search of an unbiased web ranking[C]//Proceedings of the ACM SIGMOD International Conference on Management of Data, 2005: 551-562.
[7] 肖自乾,陈经优,符天. 基于多项式朴素贝叶斯的文本分类及应用研究[J]. 电脑知识与技术,2022,18(27):61-63.
[8] 田苗苗. 基于决策树的文本分类研究[J]. 吉林师范大学学报(自然科学版), 2008, 29(1): 54-56.
[9] JOACHIMS T. Text categorization with suport vector machines: Learning with many relevant features[C]//Proceedings of the 10th European Conference on Machine Learning: ECML-98,Lecture Notes in Computer Science, 2005: 137-142.
[10] WU J. Introduction to convolutional neural networks[J]. National Key Lab for Novel Software Technology. Nanjing University. China, 2017, 5(23): 495.
[11] MEDSKER L R, JAIN L C. Recurrent neural networks[J]. Design and Applications, 2001,5(2):64-67.
[12] SHI X, CHEN Z, WANG H, et al. Convolutional LSTM network: A machine learning approach for precipitation nowcasting[C]//Proceedings of the 28th International Conference on Neural Information Processing Systems, 2015:802-810.
[13] 黄磊, 杜昌顺. 基于递归神经网络的文本分类研究[J]. 北京化工大学学报(自然科学版), 2017, 44(01): 98-104.
[14] ZHAO W, YE J, YANG M, et al. Investigating capsule networks with dynamic routing for text classification[J].arXiv preprint arXiv:1804.00538, 2018.
[15] QIN L, CHE W, LI Y, et al. Dcr-net: A deep co-interactive relation network for joint dialog act recognition and sentiment classification[C]//Proceedings of the AAAI Conference on Artificial Intelligence, 2020, 34(05): 8665-8672.
[16] DENG Z, PENG H, HE D,et al. HTCInfoMax: A global model for hierarchical text classification via information maximization[J]. arXiv Preprine arXiv:2104.05220,2021.
[17] DEVLIN J, CHANG M W, LEE K, et al. BERT: Pre-training of deep bidirectional transformers for language understanding[J]. arXiv Preprine arXiv:1810.04805,2018.
[18] CROCE D, CASTELLUCCI G, BASILI R. GAN-BERT: Generative adversarial learning for robust text classification with a bunch of labeled examples[C]//Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, 2020:2114-2119.
[19] JIN D, JIN Z, ZHOU J T, et al. Is bert really robust?: A strong baseline for natural language attack on text classification and entailment[C]//Proceedings of the AAAI Conference on Artificial Intelligence, 2020, 34(05): 8018-8025.
[20] CHUNG J, GULCEHRE C, CHO K H,et al. Empirical evaluation of gated recurrent neural networks on sequence modeling[J]. arXiv Preprine arXiv:1412.3555,2014.
[21] CHEN J, XIA M, WANG D, et al. Double branch parallel network for segmentation of buildings and waters in remote sensing images[J]. Remote Sensing, 2023, 15(6): 1536.
[22] HU K, ZHANG E, XIA M, et al. Mcanet: A multi-branch network for cloud/snow segmentation in high-resolution remote sensing images[J]. Remote Sensing, 2023, 15(4): 1055.
[23] ZHANG E, HU K, XIA M, et al. Multilevel feature context semantic fusion network for cloud and cloud shadow segmentation[J]. Journal of Applied Remote Sensing, 2022, 16(4): 046503-046503.
[24] CHEN J, ZHOU Y, GE J. Inspection text classification of power equipment based on textCNN[M/OL]//Lecture Notes in Electrical Engineering,The proceedings of the 16th Annual Conference of China Electrotechnical Society, 2022: 390-398.
[25] ZHANG X, ZHAO J, LECUN Y. Character-level convolutional networks for text classification[C]//Proceedings of the 28th International Conference on Neural Information Processing Systems, 2015: 649-657.
[26] LIU J, CHANG W C, WU Y, et al. Deep learning for extreme multi-label text classification[C]//Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval, 2017: 115-124.
[27] 刘龙飞, 杨亮, 张绍武, 等. 基于卷积神经网络的微博情感倾向性分析[J]. 中文信息学报, 2015, 29(6): 159-165.
[28] LIU P, QIU X, HUANG X. Recurrent neural network for text classification with multi-task learning[J].arXiv Preprint arXiv.1605.05101,2016.
[29] JOHNSON R, ZHANG T. Supervised and semi-supervised text categorization using LSTM for region embeddings[C]//Proceedings of the International Conference on Machine Learning, 2016: 526-534.
[30] TAI K S, SOCHER R, MANNING C D. Improved semantic representations from tree-structured long short-term memory networks[J].arXiv Preprint arXiv:1503.00075, 2015.
[31] BAHDANAU D, CHO K, BENGIO Y. Neural machine translation by jointly learning to align and translate[J]. arXiv Preprint arXiv:1409.0473,2014.
[32] ALSHUBAILY I. Text cnn with attention for text classification[J]. arXiv Preprint arXiv:2108.01921,2021.
[33] YANG Z, YANG D, DYER C, et al. Hierarchical attention networks for document classification[C]//Proceedings of the Conference of the North American Chapter of the Association for Computational Lin-guistics: Human Language Technologies, 2016: 1480-1489.
[34] JAWAHAR G, SAGOT B, DJAM S. What does BERT learn about the structure of language?[C]//Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, 2019:3651-3657.
[35] 刘勇,兴艳云.基于改进随机森林算法的文本分类研究与应用[J].计算机系统应用, 2019, 28(5):6-17.
[36] 寇莎莎, 魏振军. K-最近邻的改进及其在文本分类中的应用[J]. 河南师范大学学报(自然科学版), 2005, 33(3): 134-136.
[37] TAUD H, MAS J F. Multilayer perceptron (MLP)[M]//Geomatic Approaches for Modeling Land Change Scenarios, Spriner, 2018: 451-455.
[38] KIM Y .Convolutional neural networks for sentence classification[C]//Proceedings of the Conference on Empirical Methods in Natural Language Processing, 2014: 1746-1751.
[39] SAMATIN N A N, ZHAO H. Chartec-net: An efficient and lightweight character-based convolutional network for text classification[J]. Journal of Electrical and Computer Engineering, 2020, 2020:1-7.
[40] LU Z, DU P, NIE J Y. VGCN-BERT: Augmenting BERT with graph embedding for text classification[C]//Proceedings of the 42nd European Conference on IR Research, Springer International Publishing, 2020: 369-382.
[41] VASWANI A,SHAZEER N, PARMAR N, et al. Attention is all you need[C]//Proceedings of the 31st International Conference on Neural Information Processing Systems, 2017:6000-6010.

基金

2023年江苏省研究生科研与实践创新计划(SJCX23_0394)

PDF(3006 KB)

402

Accesses

Citation

Detail

段落导航

摘要
Abstract
关键词
Key words
引用本文
参考文献
基金

选择文件类型/文献管理软件名称

选择包含的内容

摘要

Abstract

关键词

Key words

引用本文

{{custom_sec.title}}

{{custom_sec.title}}

参考文献

{{custom_fnGroup.title_cn}}

脚注

基金

Published
2024-04-26
Issue Date
2024-04-29