融入词集合信息的跨境民族文化实体识别方法

杨振平,毛存礼,雷雄丽,高盛祥,陆杉,张勇丙

PDF(4572 KB)
PDF(4572 KB)
中文信息学报 ›› 2022, Vol. 36 ›› Issue (10) : 88-96.
民族、跨境及周边语言信息处理

融入词集合信息的跨境民族文化实体识别方法

  • 杨振平1,2,毛存礼1,2,雷雄丽2,3,高盛祥1,2,陆杉1,2,张勇丙1,2
作者信息 +

Cross-border National Cultural Entity Recognition Method with Word Set Information

  • YANG Zhenping1,2, MAO Cunli1,2, LEI Xiongli2,3, GAO Shengxiang1,2, LU Shan1,2, ZHANG Yongbing1,2
Author information +
History +

摘要

跨境民族文化领域实体通常由描述民族文化特征的领域词汇组合构成,使用当前主流的基于字符表征的实体识别方法会面临领域实体边界模糊问题,造成实体识别错误。为此,该文提出一种融入词集合信息的跨境民族文化实体识别方法,利用领域词典获取的词集合增强领域实体的词边界和词语义信息。首先,构建跨境民族文化领域词典,用于获取词集合信息;其次,通过词集合注意力机制获取词集合向量之间的权重,并融入位置编码增强词集合位置信息;最后,在特征提取层融入词集合信息,增强领域实体边界信息并缓解仅使用字符特征表示所带来的词语义缺失问题。实验结果表明,在跨境民族文化文本数据集上所提出方法相比于基线方法的F1值提升了2.71%。

Abstract

Cross-border national cultural entities are usually composed of domain words that describe national cultural characteristics. This paper proposes a cross-border national cultural entity recognition method with word set information obtained from domain lexicon. Firstly, a cross-border national cultural domain lexicon is constructed to obtain the word set information. Secondly, the weight between the word set vectors is obtained through attention mechanism, and the positional encoding is adopted. Finally, the word set information is incorporated into the feature extraction layer to enhance the domain entity boundary information and alleviate the problem of word information loss caused by using only character features. Experimental results show that, the F1 value of the proposed method is improved by 2.71% compared with the baseline method.

关键词

跨境民族文化 / 实体识别 / 词集合信息 / 领域词典 / 注意力机制

Key words

cross-border national culture / entity recognition / word set information / domain lexicon / attention mechanism

引用本文

导出引用
杨振平,毛存礼,雷雄丽,高盛祥,陆杉,张勇丙. 融入词集合信息的跨境民族文化实体识别方法. 中文信息学报. 2022, 36(10): 88-96
YANG Zhenping, MAO Cunli, LEI Xiongli, GAO Shengxiang, LU Shan, ZHANG Yongbing. Cross-border National Cultural Entity Recognition Method with Word Set Information. Journal of Chinese Information Processing. 2022, 36(10): 88-96

参考文献

[1] 郭家骥.云南周边跨境民族文化交流互动与边疆繁荣稳定[J].云南社会科学, 2015,208(06): 122-127.
[2] Huang Z, Xu W, Yu K. Bidirectional LSTM-CRF for named entity recognition[C]//Proceedings of the 32nd Pacific Asia Conference on Language, Information and Computation, Hongkong, 2018: 531-540.
[3] Rei M. Semi-supervised multitask learning for sequence labeling[C]//Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics, Vancouver, 2017: 2121-2130.
[4] Zhang Y, Yang J. Chinese NER using lattice LSTM[C]// Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, Melbourne,2018: 1554-1564.
[5] Gui T, Ma R T, Zhang Q, et al. CNN-based Chinese NER with lexicon rethinking[C]//Proceedings of the 28th International Joint Conference on Artificial Intelligence, Macao,2019: 4982-4988.
[6] Ma R T, Peng M L, Zhang Q, et al. Simplify the usage of lexicon in Chinese NER[C]//Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Seattle,2020: 5951-5960.
[7] Nadeau D, Sekine S. A survey of named entity recognition and classification[J]. Lingvisticae Investigationes,2007, 30(1): 3-26.
[8] Wang X, Zhang Y, Li Q, et al. PENNER: Pattern-enhanced nested named entity recognition in biomedical literature[C]//Proceedings of the IEEE International Conference on Bioinformatics and Biomedicine. Madrid, Spain,2018: 540-547.
[9] Lample G. Neural architectures for named entity recognition[C]//Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, California, 2016: 260-270.
[10] Lafferty J, Mccallum A, Pereira F. Conditional random fields: Probabilistic models for segmenting and labeling sequence data[C]// Proceedings of the 18th International Conference on Machine Learning,2001: 282-289.
[11] Hovy E, Ma X Z. End-to-end sequence labeling via bi-directional LSTM-CNNs-CRF[C]//Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, Berlin, 2016: 1064-1074.
[12] 李明扬,孔芳. 融入自注意力机制的社交媒体命名实体识别[J].清华大学学报(自然科学版),2019, 59(6): 461-467.
[13] Gui T, Zou Y C, Zhang Q, et al. A lexicon-based graph neural network for Chinese NER[C]//Proceedings of the Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, Hongkong, 2019: 1039-1049.
[14] Sui D B, Chen Y B, Liu K, et al. Leverage lexical knowledge for Chinese named entity recognition via collaborative graph network[C]//Proceedings of the Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, Hongkong, 2019: 3821-3831.
[15] Li X N, Yan H, Qiu X P, et al. FLAT: Chinese NER using flat-lattice transformer[C]//Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Seattle,2020: 6836-6842.
[16] Robertson S. Understanding inverse document frequency: on the oretical arguments for IDF[J]. Journal of Documentation, 2004,60(5): 503-520.
[17] Mikolov T, Chen K, Corrado G, et al. Efficient estimation of word representations in vector space[J/OL]. arXiv preprint arXiv: 1301.3781, 2013.
[18] Devlin J, Chang M W, Lee K, et al. Bert: Pre-training of deep bidirectional transformers for language understanding[C]// Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Minnesota, 2019: 4171-4186.
[19] Cho K,Merrienboer B V, Gulcehre C, et al. Learning phrase representations using RNN encoder-decoder for statistical machine translation[C]//Proceedings of the Conference on Empirical Methods in Natural Language Processing, Doha, Qatar, 2014: 1724-1734.

基金

国家自然科学基金(61732005,61866019,61761026,61972186);云南省应用基础研究计划重点项目(2019FA023);云南特色产业数字化研究与应用示范(202002AD080001);云南省中青年学术和技术带头人后备人才项目(2019HB006)
PDF(4572 KB)

1227

Accesses

0

Citation

Detail

段落导航
相关文章

/