基于预训练的藏医药实体关系抽取

周青,拥措,拉毛东只,尼玛扎西

PDF(1568 KB)
PDF(1568 KB)
中文信息学报 ›› 2024, Vol. 38 ›› Issue (8) : 76-83.
民族、跨境及周边语言信息处理

基于预训练的藏医药实体关系抽取

  • 周青1,2,3,拥措1,2,3,拉毛东只1,2,3,尼玛扎西1,2,3
作者信息 +

Entity Relation Extraction Based on Pre-trained Language Model for Tibetan Medicine

  • ZHOU Qing1,2,3, YONG Tso1,2,3, LAMAO Dongzhi1,2,3, NYIMA Trashi1,2,3
Author information +
History +

摘要

藏医药领域的文本主要以非结构化形式保存,藏医药文本的信息抽取对挖掘藏医药的知识有重要作用。针对现有藏文实体关系抽取模型语义表达能力差、嵌套实体抽取准确率低的问题,该文介绍了一种基于预训练模型的实体关系抽取方法,使用TibetanAI_ALBERT_v2.0预训练语言模型,使得模型更好地识别实体,使用Span方法解决实体嵌套问题。在Dropout的基础上,增加了一个KL散度损失函数项,提升了模型的泛化能力。在TibetanAI_TMIE_v1.0藏医药数据集上进行了实验,实验结果表明,精确率、召回率和F1值分别达到了84.5%、80.1%和82.2%,F1值较基线提升了4.4个百分点,实验结果证明了该文方法的有效性。

Abstract

The texts in the field of Tibetan medicine are mainly stored in unstructured form. The information extraction of Tibetan medicine texts plays an important role in excavating the knowledge of famous Tibetan medicine. In response to the problems of poor semantic expression ability and low accuracy of nested entity extraction in existing Tibetan entity relation extraction models, this paper introduces a pre-trained entity relation extraction method. The TibetanAI_ALBERT_v2.0 pre-trained language model is used to enable the model to better recognize entities, and the Span method is used to solve the problem of entity nesting. On the basis of Dropout, a KL divergence loss function is added to enhance the model's generalization ability. Experiments on the TibetanAI_TMIE_v1.0 dataset of Tibetan medicine show that the precision, recall, and F1 score have reached 84.5%, 80.1%, and 82.2%, respectively. The F1 score has increased by 4.4 percentage points compared to the baseline. The results demonstrate the effectiveness of the proposed method.

关键词

藏医药 / 实体关系抽取 / 预训练语言模型

Key words

Tibetan medicine / entity relation extraction / pre-trained language model

引用本文

导出引用
周青,拥措,拉毛东只,尼玛扎西. 基于预训练的藏医药实体关系抽取. 中文信息学报. 2024, 38(8): 76-83
ZHOU Qing, YONG Tso, LAMAO Dongzhi, NYIMA Trashi. Entity Relation Extraction Based on Pre-trained Language Model for Tibetan Medicine. Journal of Chinese Information Processing. 2024, 38(8): 76-83

参考文献

[1] LIN Y, LIU Z, SUN M, et al. Learning entity and relation embeddings for knowledge graph completion[C]//Proceedings of the AAAI Conference on Artificial Intelligence, 2015.
[2] XU K, REDDY S, FENG Y, et al. Question answering on freebase via relation extraction and textual evidence[C]//Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, 2016: 2326-2336.
[3] STRAKOV J, STRAKA M, HAJIC J. Neural architectures for nested NER through linearization[J]. arXiv preprint arXiv:1908.06926, 2019.
[4] 朱秀宝,周刚,陈静,等.基于增强序列标注策略的单阶段联合实体关系抽取方法[J].计算机科学,2023,50(08): 184-192.
[5] LI X, FENG J, MENG Y, et al. A unified mrc framework for named entity recognition[C]//Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, 2020: 5849-5859.
[6] HUMPHREYS K, GAIZAUSKAS R, AZZAM S, et al. Description of the LaSIE-II system as used for MUC-7[C]//Proceedings of the 7th Message Understanding Conference, 1998.
[7] ZENG D, LIU K, LAI S, et al. Relation classification via convolutional deep neural network[C]//Proceedings of the 25th International Conference on Computational Linguistics: Technical Papers, 2014: 2335-2344.
[8] TURIAN J, RATINOV L, BENGIO Y. Word representations: A simple and general method for semi-supervised learning[C]//Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics, 2010: 384-394.
[9] ZHENG S, WANG F, BAO H, et al. Joint extraction of entities and relations based on a novel tagging scheme[C]//Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics, 2017: 1227-1236.
[10] 夏天赐,孙媛. 基于联合模型的藏文实体关系抽取方法研究[J]. 中文信息学报,2018,32(12): 76-83.
[11] 曹明宇,杨志豪,罗凌,等. 基于神经网络的药物实体与关系联合抽取[J]. 计算机研究与发展,2019,56(07): 1432-1440.
[12] VASWANI A, SHAZEER N, PARMAR N, et al. Attention is all you need[C]//Proceedings of the Advances in Neural Information Processing Systems, 2017.
[13] DEVLIN J, CHANG M W, LEE K, et al. BERT: Pre-training of deep bidirectional transformers for language understanding[C]//Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2019: 4171-4186.
[14] LAN Z, CHEN M, GOODMAN S, et al. ALBERT: A lite BERT for self-supervised learning of language representations[J]. arXiv preprint arXiv: 1909.11942,2019.
[15] EBERTS M, ULGES A. Span-based joint entity and relation extraction with transformer pre-training[C]//Proceedings of the European Conference on Artificial Intelligence.IOS Press, 2020: 2006-2013.
[16] WEI Z, SU J, WANG Y, et al.A novel cascade binary tagging framework for relational triple extraction[C]//Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics,2020: 1476-1488.
[17] 王丽客,孙媛,夏天赐. 基于远程监督的藏文实体关系抽取[J]. 中文信息学报,2020,34(03): 72-79.
[18] 于韬,尼玛次仁,拥措,等. 基于藏文Albert预训练语言模型的图采样与聚合实体关系抽取[J].中文信息学报,2022,36(10): 63-72.
[19] 张鑫,冼广铭,梅灏洋,等. 基于Span方法和多叉解码树的实体关系抽取[J]. 计算机技术与发展,2023,33(05): 152-158.
[20] 李智杰,韩瑞瑞,李昌华,等. 融合预训练模型和注意力的实体关系抽取方法[J]. 计算机科学与探索,2023,17(06): 1453-1462.
[21] 邓成汝,凌捷. 融合预训练模型与神经网络的实体关系抽取[J]. 计算机工程与设计,2023,44(07): 2023-2029.
[22] 黄晓芳,陈剑秋,周祖宏,等. 基于BERT的电子病历实体关系联合抽取研究[J]. 医学信息学杂志,2023,44(02): 28-34.
[23] 李正辉,廖光忠. 基于多层次特征提取的中文医疗实体识别[J]. 计算机技术与发展,2023,33(09): 119-125.
[24] WU L, LI J, WANG Y, et al. R-drop: Regularized dropout for neural networks[J]. Advances in Neural Information Processing Systems, 2021, 34: 10890-10905.
[25] YANG Z, XU Z, CUI Y, et al. CINO: A Chinese minority pre-trained language model[C]//Proceedings of the 29th International Conference on Computational Linguistics, 2022: 3937-3949.
[26] LIU S, DENG J, SUN Y, et al. TiBERT: Tibetan pre-trained language model[C]//Proceedings of the IEEE International Conference on Systems, 2022: 2956-2961.

基金

西藏自治区科技厅项目(XZ202401JD0010);科技创新2030——“新一代人工智能”重大项目(2022ZD0116100)
PDF(1568 KB)

514

Accesses

0

Citation

Detail

段落导航
相关文章

/