面向少量标注数据的中文命名实体识别

张昀,黄橙,张玉瑶,黄经纬,张宇德,黄丽亚,刘艳,丁可柯,王秀梅

PDF(3686 KB)
PDF(3686 KB)
中文信息学报 ›› 2023, Vol. 37 ›› Issue (3) : 101-111.
信息抽取与文本挖掘

面向少量标注数据的中文命名实体识别

  • 张昀,黄橙,张玉瑶,黄经纬,张宇德,黄丽亚,刘艳,丁可柯,王秀梅
作者信息 +

Chinese Named Entity Recognition with few Labeled Data

  • ZHANG Yun, HUANG Cheng, ZHANG Yuyao, HUANG Jingwei, ZHANG Yude,
    HUANG Liya, LIU Yan, DING Keke, WANG Xiumei
Author information +
History +

摘要

训练数据的缺乏是目前命名实体识别存在的一个典型问题。实体触发器可以提高模型的成本效益,但这种触发器需要大量的人工标注,并且只适用于英文文本,缺少对其他语言的研究。为了解决现有TMN模型实体触发器高成本和适用局限性的问题,提出了一种新的触发器自动标注方法及其标注模型GLDM-TMN。该模型不仅能够免去人工标注,而且引入了Mogrifier LSTM结构、Dice损失函数及多种注意力机制增强触发器匹配准确率及实体标注准确率。在两个公开数据集上的仿真实验表明: 与TMN模型相比,在相同的训练数据下,GLDM-TMN模型的F1值在Resume NER数据集和Weibo NER数据集上分别超出TMN模型0.0133和0.034。同时,该模型仅使用20%训练数据比例的性能就可以优于使用40%训练数据比例的BiLSTM-CRF模型性能。

Abstract

The lack of training data is a typical problem of named entity recognition today. To apply TMN model that requiring labelled triggers in Chinese, a new automatic annotation method GLDM-TMN is proposed. This method introduces Mogrifier LSTM structure, Dice loss function and various attention mechanisms to enhance the accuracy of trigger matching and entity annotation. Simulated experiments on two publicly available datasets show that GLDM-TMN has better improved the F1 value by 0.013 3 to 0.034 than TMN model with the same small amount of labeled data. Meanwhile, the proposed method with 20% of training data outperforms BiLSTM-CRF model with 40% of training data.

关键词

中文命名实体识别 / 实体触发器 / Mogrifier LSTM结构 / 联合损失函数 / 注意力机制

Key words

Chinese NER / entity triggers / mogrifier LSTM structure / dice loss function / attentional mechanism

引用本文

导出引用
张昀,黄橙,张玉瑶,黄经纬,张宇德,黄丽亚,刘艳,丁可柯,王秀梅. 面向少量标注数据的中文命名实体识别. 中文信息学报. 2023, 37(3): 101-111
ZHANG Yun, HUANG Cheng, ZHANG Yuyao, HUANG Jingwei, ZHANG Yude,
HUANG Liya, LIU Yan, DING Keke, WANG Xiumei.
Chinese Named Entity Recognition with few Labeled Data. Journal of Chinese Information Processing. 2023, 37(3): 101-111

参考文献

[1] ZHANG X Y, WANG T, CHEN H W. Research on named entity recognition[J]. Computer Science, 2005, 32(04): 44-48.
[2] AN Y, XIA X, CHEN X, et al. Chinese clinical named entity recognition via multi-head self-attention based BiLSTM-CRF[J]. Artificial Intelligence in Medicine, 2022, 127: 102282-102282.
[3] LIN B Y, LEE D H, SHEN M, et al. TriggerNER: Learning with entity triggers as explanations for named entity recognition[C]//Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, 2020: 8503-8511.
[4] ZENG X, LI Y, ZHAI Y, et al. Counterfactual generator: A weakly-supervised method for named entity recognition[C]//Proceedings of the Conference on Empirical Methods in Natural Language Processing, 2020: 7270-7280.
[5] SAK H, SENIOR A, BEAUFAYS F. Long short-term memory based recurrent neural network architectures for large vocabulary speech recognition[J]. Computer Science, 2014: 338-342.
[6] LUONG M T, PHAM H, MANNING C D. Effective approaches to attention-based neural machine translation[C]//Proceedings of the Conference on Empirical Methods in Natural Language Processing, 2015: 1412-1421.
[7] SHANNON C E. A mathematical theory of communication[J]. Bell System Technical Journal, 1948,27(3): 379-423.
[8] DICE L R. Measures of the amount of ecologic association between species[J]. Ecology, 1945, 26(3) : 297-302..
[9] MA X, HOVY E. End-to-end sequence labeling via bi-directional LSTM-CNNs-CRF[C]//Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, 2016: 1064-1074.
[10] YUN H, HWANG Y, JUNG K. Improving context-aware neural machine translation using self-attentive sentence embedding[C]//Proceedings of the AAAI Conference on Artificial Intelligence, 2020, 34(05): 9498-9506.
[11] ZHANG Y, YANG J. Chinese NER using lattice LSTM[C]//Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, 2018: 1554-1564.
[12] PENG N,DREDZE M. Improving named entity recognition for Chinese social media with word segmentation representation learning[C]//Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, 2016: 149-155.
[13] TIAN Y, SHEN W, SONG Y, et al. Improving biomedical named entity recognition with syntactic information[J]. BMC Bioinformatics, 2020, 21(1): 1-17.
[14] LIU W, FU X, ZHANG Y, et al. Lexicon enhanced Chinese sequence labeling using BERT adapter[C]//Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, 2021: 5847-5858.
[15] MILLER A, FISCH A, DODGE J, et al. Key-value memory networks for directly reading documents[C]//Proceedings of the Conference on Empirical Methods in Natural Language Processing, 2016: 1400-1409.
[16] NIE Y,TIAN Y, SONG Y, et al. Improving named entity recognition with attentive ensemble of syntactic information[G]//Findings of the Association for Computational Linguistics: EMNLP 2020, 2020: 4231-4245.

基金

国家自然科学基金(61977039)
PDF(3686 KB)

1035

Accesses

0

Citation

Detail

段落导航
相关文章

/