基于预训练语言模型的IPC与高相似CLC类目自动映射

PDF(1834 KB)

中文信息学报 ›› 2025, Vol. 39 ›› Issue (2) : 153-161.

自然语言处理应用

基于预训练语言模型的IPC与高相似CLC类目自动映射

黄敏¹,魏嘉琴¹,李茂西^1,2

作者信息 +

Automatic Mapping between IPC and CLC Categories Based on Pre-trained Language Models

HUANG Min¹, WEI Jiaqin¹, LI Maoxi^1,2

Author information +

History +

摘要

专利和图书期刊是产业界与学术界的科技创新信息来源,专利通常采用国际专利分类法(International Patent Classification, IPC)标识,而中文图书期刊则采用中国图书馆分类法(Chinese Library Classification,CLC),不同的分类标识体系给专利、图书期刊信息整合共享和跨库检索浏览带来了挑战。针对IPC类目和高相似的CLC类目难以准确映射的问题,对于计算资源受限的场景,该文提出结合预训练语言模型BERT和文本蕴含模型ESIM的IPC与CLC类目自动映射方法;对于计算资源充足的场景,该文提出了基于大语言模型ChatGLM2-6B的IPC与CLC类目自动映射方法。在公开的IPC与CLC类目映射数据集和在其基础上构建的IPC类目与高相似的CLC类目映射数据集上的实验结果表明,该文所提出的两种方法均统计显著地优于对比的基线方法,包括当前最先进的Sia-BERT等基于深度神经网络的科技文献类目自动映射方法。消融实验和详细的映射实例分析进一步揭示了该文所提方法的有效性。

Abstract

Patents are typically classified by the International Patent Classification (IPC), while Chinese books and journals are grouped by the Chinese Library Classification (CLC). To address the problem of accurately mapping IPC categories and CLC categories, we propose a method combining the pre-trained language model BERT and the text entailment model ESIM for scenarios with limited computational resources. For scenarios with sufficient computational resources, we propose an automatic mapping method for IPC and CLC categories based on the large language model ChatGLM2-6B. Experimental results demonstrate that both proposed methods significantly outperform baseline methods, including the state-of-the-art Sia-BERT etc.

导出引用

黄敏,魏嘉琴,李茂西. 基于预训练语言模型的IPC与高相似CLC类目自动映射. 中文信息学报. 2025, 39(2): 153-161

HUANG Min, WEI Jiaqin, LI Maoxi. Automatic Mapping between IPC and CLC Categories Based on Pre-trained Language Models. Journal of Chinese Information Processing. 2025, 39(2): 153-161

参考文献

[1] 国家图书馆《中国图书馆分类法》编辑委员会. 中国图书馆分类法[M]. 5版. 北京: 国家图书馆出版社,2010.
[2] 戴剑波,侯汉清. 文献分类法自动映射系统的构建: 以《中国图书馆分类法》与《杜威十进分类法》为例[J].情报学报,2006,25(5): 594-599.
[3] ZHANG Y, PENG J, HUANG D, et al. Design of automatic mapping system between DDC and CLC[C]//Proceedings of ICADL,2011: 357-366.
[4] 靳雪茹,齐建东,王立臣,等. 基于机器学习的类目映射方法: 国际专利分类法与中国图书馆分类法[J]. 计算机应用,2011,31(7): 1781-1784.
[5] 陈瑞,贾君枝. 基于众包模式的分类法映射研究[J]. 情报理论与实践,2020,43(7): 137-143.
[6] 何贤敏,李茂西,何彦青.基于孪生BERT网络的科技文献类目映射[J].计算机研究与发展,2021,58(08):1751-1760.
[7] 钟易佳.引入源端信息的IPC和CLC类目映射研究[D]. 南昌: 江西师范大学硕士学位论文,2023.
[8] DEVLIN J, CHANG M W, LEE K, et al. BERT: pre-training of deep bidirectional transformers for language understanding[C]//Proceedings of NAACL, 2018: 4171-4186.
[9] CHEN Q, ZHU X, LING Z, et al. Enhanced LSTM for natural language inference[C]//Proceedings of the ACL, 2017: 1657-1668.
[10] ZENG A, LIU X, DU Z, et al. Glm-130b: An open bilingual pre-trained model[J]. arXiv preprint arXiv:2210.02414, 2022.
[11] DU Z, QIAN Y, LIU X, et al. GLM: General language model pretraining with autoregressive blank infilling[C]//Proceedings of ACL, 2022: 320-335.
[12] HU E J, SHEN Y, WALLIS P, et al. LoRA: Low-rank adaptation of large language models[C]//Proceedings of ICLR, 2021: 1-16.
[13] 滕少华,黄文彪,张巍,等.标签与样本双语义增强的跨模态检索[J].江西师范大学学报(自然科学版),2023,47(3): 296-306.
[14] VASWANI A,SHAZEER N,PARMAR N,et al. Attention is all you need[C]//Proceedings of NIPS,2017: 5998-6008.
[15] OPENAI. Introducing ChatGPT[EB/OL]. [2022-11-30]. https://openai.com/blog/chatgpt.
[16] OPENAI. GPT-4 technical report[J]. arXiv preprint arXiv: 2303.08774, 2023.
[17] SUN T, ZHANG X, HE Z, et al. MOSS: Training conversational language models from synthetic data[EB/OL]. https://github.com/OpenLMLab/MOSS. [2023-07-10].
[18] TOUVRON H, LAVRIL T, IZACARD G, et al. LLaMA: Open and efficient foundation language models[J].arXiv preprint arXiv: 2023.13971,2023.
[19] TOUVRON H, MARTIN L, STONE K, et al. LLaMA 2: Open foundation and fine-tuned chat models.[J] arXiv preprint arXiv:2307.09288,2023.
[20] ZHAO, W X, et al. A survey of large language models[J]. arXiv preprint arXiv: 2303.18223, 2023.
[21] 车万翔,窦志成,冯岩松,等.大模型时代的自然语言处理:挑战、机遇与发展[J].中国科学(信息科学),2023,53(09):1645-1687.
[22] SU J, LU Y, PAN S, et al. Roformer: Enhanced transformer with rotary position embedding[J]. arXiv preprint arXiv:2104.09864, 2021.
[23] LIN T Y,ROYCHOWDHURY A,MAJI S. Bilinear CNN models for fine-grained visual recognition[C]//Proceedings of ICCV. Piscataway,NJ: IEEE,2015: 1449-1457.
[24] ZHOU P,SHI W,TIAN J,et al. Attention-based bidirectional long short-term memory networks for relation classification[C]//Proceedings of ACL, Stroudsburg,PA,2016: 207-212.
[25] WOLF T,CHAUMOND J,DEBUT L,et al. Transformers: State-of-the-art natural language processing[C]//Proceedings of EMNLP, Stroudsburg,PA,2020: 38-45.
[26] 翟煜锦,李培芸,项青宇,等.基于QE的机器翻译重排序方法研究[J].江西师范大学学报(自然科学版),2020,44(1):46-50.
[27] LIU X, JI K, FU Y, et al. P-tuning v2: Prompt tuning can be comparable to fine-tuning universally across scales and tasks[C]//Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics, 2022: 61-68.