基于预训练语言模型的商品属性抽取

PDF(2311 KB)

中文信息学报 ›› 2022, Vol. 36 ›› Issue (1) : 56-64.

信息抽取与文本挖掘

基于预训练语言模型的商品属性抽取

张世奇,马进,周夏冰,贾昊,陈文亮,张民

作者信息 +

Pre-trained Language Models for Product Attribute Extraction

ZHANG Shiqi, MA Jin, ZHOU Xiabing, JIA Hao, CHEN Wenliang, ZHANG Min

Author information +

History +

摘要

属性抽取是构建知识图谱的关键一环,其目的是从非结构化文本中抽取出与实体相关的属性值。该文将属性抽取转化成序列标注问题,使用远程监督方法对电商相关的多种来源文本进行自动标注,缓解商品属性抽取缺少标注数据的问题。为了对系统性能进行精准评价,构建了人工标注测试集,最终获得面向电商的多领域商品属性抽取标注数据集。基于新构建的数据集,该文进行多组实验并进行实验结果分析。特别地,基于多种预训练语言模型,进行了领域内和跨领域属性抽取。实验结果表明,预训练语言模型可以较好地提高抽取性能,其中ELECTRA在领域内属性抽取表现最佳,而在跨领域实验中BERT表现最佳。同时,该文发现增加少量目标领域标注数据可以有效提高跨领域属性抽取效果,增强了模型的领域适应性。

Abstract

Attribute extraction is a key step of constructing a knowledge graph. In this paper, the task of attribute extraction is converted into a sequence labeling problem. Due to a lack of labeling data in product attribute extraction, we use the distant supervision to automatically label multiple source texts related to e-commerce. In order to accurately evaluate the performance of the system, we construct a manually annotated test set, and finally obtain a new data set for product attribute extraction in multi-domains. Based on the newly constructed data set, we carried out intra-domain and cross-domain attribute extraction for a variety of pre-trained language models. The experimental results show that the pre-trained language models can better improve the extraction performance. Among them, ELECTRA performs the best in attribute extraction in in-domain experiments, and BERT performs the best in cross-domain experiments. we also find that adding a small amount of target domain annotation data can effectively improve the performance cross-domain attribute extraction and enhance the domain adaptability of the model.

导出引用

张世奇,马进,周夏冰,贾昊,陈文亮,张民. 基于预训练语言模型的商品属性抽取. 中文信息学报. 2022, 36(1): 56-64

ZHANG Shiqi, MA Jin, ZHOU Xiabing, JIA Hao, CHEN Wenliang, ZHANG Min. Pre-trained Language Models for Product Attribute Extraction. Journal of Chinese Information Processing. 2022, 36(1): 56-64

参考文献

[1] Pujara J, Miao H, Getoor L, et al. Knowledge graph identification[C]//Proceedings of the 12th International Semantic Web Conference, 2013: 542-557.
[2] 康睿智,郝文宁,程恺,等.面向军事装备实体的属性抽取[J].计算机应用研究,2016,33(12):3721-3724.
[3] 张巧,熊锦华,程学旗.基于弱监督学习的主页人物属性抽取方法[J].山西大学学报(自然科学版),2015,38(01): 8-15.
[4] Mesquita F, Cannaviccio M, Schmidek J, et al. KnowledgeNet: A benchmark dataset for knowledge base population[C]//Proceedings of the Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, 2019: 749-758.
[5] Vandic D, Dam J W V, Frasincar F. Faceted product search powered by the Semantic Web[J]. Decision Support Systems, 2012, 53(3): 425-437.
[6] Ghani R, Probst K, Liu Y, et al. Text mining for product attribute extraction[J]. ACM SIGKDD Explorations Newsletter, 2006, 8(1): 41-48.
[7] Sutton C,McCallum A. An introduction to conditional random fields[J]. Foundations and Trends in Machine Learning, 2012, 4(4): 267-373.
[8] 马进,杨一帆,陈文亮.基于远程监督的人物属性抽取研究[J].中文信息学报,2020,34(06): 64-72.
[9] Guineng Zheng, Subhabrata Mukherjee, Xin Luna Dong, et al. Opentag: Open attribute value extraction from product profiles[C]//Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2018: 1049-1058.
[10] Xu H,Wang W,Mao X,et al. Scaling up open tagging from tens to thousands: Comprehension empowered attribute value extraction from product title[C]//Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, 2019: 5214-5223.
[11] Devlin J, Chang M W, Lee K, et al. BERT: pre-training of deep bidirectional transformers for language understanding[C]//Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2019: 4171-4186.
[12] Lan Z, Chen M, Goodman S, et al. ALBERT: A lite BERT for self-supervised learning of language representations[C]//Proceedings of International Conference on Learning Representations, 2019.
[13] Liu Y, Ott M, Goyal N, et al. RoBERTa: A robustly optimized BERT pretraining approach[J]. arXiv preprint arXiv: 1907.11692.2019.
[14] Clark K, Luong M T, Le Q V, et al. ELECTRA: Pre-training text encoders as discriminators rather than generators[C]//Proceedings of the International Conference on Learning Representations, 2019.
[15] Yang Z, Dai Z, Yang Y, et al. XLNet: generalized autoregressive pretraining for language understanding [C]//Proceedings of Advances in Neural Information Processing Systems, 2019: 5754-5764.
[16] Peters M,Neumann M,Iyyer M, et al. Deep contextualized word representations[C]//Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2018: 2227-2237.
[17] Vaswani A,Shazeer N,Parmar N,et al. Attention is all you need[C]//Proceedings of Advances in Neural Information Processing Systems, 2017: 5998-6008.

基金

国家自然科学基金(61876115)

PDF(2311 KB)

1825

Accesses

Citation

Detail

段落导航

摘要
Abstract
关键词
Key words
引用本文
参考文献
基金

Received	Published
2021-01-29	2022-02-28
Issue Date
2022-02-28

选择文件类型/文献管理软件名称

选择包含的内容

摘要

Abstract

关键词

Key words

引用本文

{{custom_sec.title}}

{{custom_sec.title}}

参考文献

{{custom_fnGroup.title_cn}}

脚注

基金