面向朝鲜语命名实体识别的多粒度融合方法

黄政豪,金光洙,高君龙

PDF(2956 KB)
PDF(2956 KB)
中文信息学报 ›› 2023, Vol. 37 ›› Issue (8) : 66-74.
民族、跨境及周边语言信息处理

面向朝鲜语命名实体识别的多粒度融合方法

  • 黄政豪1,2,金光洙2,高君龙2
作者信息 +

Multi-Granularity Fusion for Korean Named Entity Recognition

  • HUANG Zhenghao1,2, JIN Guangzhu2, GAO Junlong2
Author information +
History +

摘要

该文从朝鲜语语法和构成特点出发,研究在音素、音节和词素三个不同粒度下朝鲜语实体的有效表征,提出一种基于多粒度融合的朝鲜语命名实体识别方法。该方法从不同粒度的联系和差异两方面进行多粒度特征的融合。首先,对朝鲜语的音素特征进行编码,并基于CNN架构构建将音素粒度与音节粒度融合的模型,获得音节向量。其次,使用fastText预训练模型对得到的音节向量进行编码,获取其顺序特征。同时,使用KLUE-BERT预训练模型对词素粒度特征进行建模,得到词素向量。最后,将之前得到的音节向量与词素向量进行融合,形成包含多粒度特征的文本表征,并利用基于Transformer的NER模型TENER完成朝鲜语命名实体识别。为了验证所提出方法的有效性,该文在Klpexpo 2016和KLUE-NER语料库上进行了实验,结果表明所提出的不同粒度表征及融合方法能够很好地提取出朝鲜语的实体特征,取得了很好的效果,其中在Klpexpo 2016语料库中的F1值为89.45%,KLUE-NER语料库中的F1值为88.82%。

Abstract

This paper investigates an effective representation of Korean entities at three different granularities, i.e. jamo, syllable and morpheme, and proposes a multi-granularity fusion-based named entity recognition method for Korean. Firstly, it encodes the jamo-leval features of Korean and builds a CNN-based model to fuse jamo-level and syllable-level features to obtain syllable vectors. Secondly, the fastText pre-trained model is employed to encode the obtained syllable vectors to obtain their sequential features. And the KLUE-BERT is utilized to obtain morpheme vectors. Finally, the previously obtained syllable vectors and morpheme vectors are jointly applied to the task of named entity recognition for Korean via a Transformer-based NER model named TENER. Experiments on Klpexpo 2016 and KLUE-NER corpora show that the proposed method achieves 89.45% F1 score on Klpexpo 2016 corpus and 88.82% on KLUE-NER corpus.

关键词

朝鲜语 / 命名实体识别 / 多粒度融合 / 预训练模型

Key words

Korean / NER / multi-granularity fusion / pre-trained model

引用本文

导出引用
黄政豪,金光洙,高君龙. 面向朝鲜语命名实体识别的多粒度融合方法. 中文信息学报. 2023, 37(8): 66-74
HUANG Zhenghao, JIN Guangzhu, GAO Junlong. Multi-Granularity Fusion for Korean Named Entity Recognition. Journal of Chinese Information Processing. 2023, 37(8): 66-74

参考文献

[1] LI J, SUN A, HAN J, et al. A survey on deep learning for named entity recognition[J]. IEEE Transactions on Knowledge and Data Engineering, 2020, 34(1): 50-70.
[2] LEE S H, JANG D P, SUNG K K, et al. Donguibogam-based pattern diagnosis using natural language processing and machine learning[J]. Journal of Korean Medicine, 2020, 41(3): 1-8.
[3] 李光日. 关于中国朝鲜语和韩国语的隔写法[J]. 中国朝鲜语文,2022,240(04): 74-81.
[4] AN J, KIM H W. Building a Korean sentiment lexicon using collective intelligence[J]. Journal of Intelligence and Information Systems, 2015, 21(2): 49-67.
[5] LEE D Y, YU W, LIM H S. Bi-directional lstm-cnn-crf for Korean named entity recognition system with feature augmentation[J]. Journal of the Korea Convergence Society, 2017, 8(12): 55-62.
[6] SHORTEN C, KHOSHGOFTAAR T M. A survey on image data augmentation for deep learning[J]. Journal of Big Data, 2019, 6(1): 1-48.
[7] KIM H, YANG S, KO Y. How to utilize syllable distribution patterns as the input of LSTM for Korean morphological analysis[J]. Pattern Recognition Letters, 2019, 120: 39-45.
[8] NA S H, KIM H, MIN J, et al. Improving LSTM CRFs using character-based compositions for Korean named entity recognition[J]. Computer Speech & Language, 2019, 54: 106-121.
[9] OH H S, LEE H. Named entity recognition for pet disease Q & A system[J]. Journal of Digital Contents Society, 2022, 23(4): 765-771.
[10] ZHOU G, SU J. Named entity recognition using an HMM-based chunk tagger[C]//Proceedings of the 40th Annual Meeting on Association for Computational Linguistics. Association for Computational Linguistics, 2002: 473-480.
[11] ISOZAKI H, KAZAWA H. Efficient support vector classifiers for named entity recognition[C]//Proceedings of the COLING: The 19th International Conference on Computational Linguistics, 2002.
[12] LIN Y F, TSAI T H, CHOU W C, et al. A maximum entropy approach to biomedical named entity recognition[C]//Proceedings of the 4th International Conference on Data Mining in Bioinformatics, Seattle, Washington, USA, 2004: 56-61.
[13] HUANG Z, XU W, YU K. Bidirectional LSTM-CRF models for sequence tagging[J/OL]. arXiv preprint arXiv:1508.01991, 2015: 1-10.
[14] YAO L, LIU H, LIU Y, et al. Biomedical named entity recognition based on deep neutral network[J]. Int. J. Hybrid inf. Technol, 2015, 8(8): 279-288.
[15] CHIU J P C, NICHOLS E. Named entity recognition with bidirectional LSTM-CNNs[J]. Transactions of the Association for Computational Linguistics, 2016, 4: 357-370.
[16] KWON S, KO Y, SEO J. Effective vector representation for the Korean named-entity recognition[J]. Pattern Recognition Letters, 2019, 117: 52-57.
[17] VASWANI A, SHAZEER N, PARMAR N, et al. Attention is all you need[C]//Proceedings of the Advances in Neural Information Processing Systems, 2017, 30: 5998-6008.
[18] YAN H, DENG B, LI X, et al. TENER: Adapting transformer encoder for named entity recognition[J/OL]. arXiv preprint arXiv:1911.04474, 2019.
[19] DEVLIN J, CHANG M W, LEE K, et al. Bert: Pre-training of deep bidirectional transformers for language understandding[J/OL]. arXiv preprint arXiv:1810.04805, 2018.
[20] 杨飘,董文永. 基于BERT嵌入的中文命名实体识别方法[J]. 计算机工程,2020,46(04): 40-45.
[21] PARK S, MOON J, KIM S, et al. KLUE: Korean language understanding evaluation[J/OL]. arXiv preprint arXiv:2105. 09680, 2021.
[22] [M]. .2006, 58-63.
[23] 金永寿.中国朝鲜语规范原则与规范细则研究[M]. 北京: 人民出版社,2012: 50-55.
[24] NAM S H. Fraudulent transaction detection in secondhand product market platform using dialogue data[D]. Master thesis, Seoul: Seoul National University, 2020.
[25] GRAVE E, BOJANOWSKI P, GUPTA P, et al. Learning word vectors for 157 languages[J/OL]. arXiv preprint arXiv:1802. 06893, 2018: 1-5.
[26] CHOI H, KWON S, SEO J. Korean named entity recognition using clustered according to part of speech[C]//Proceedings of HCI KOREA, 2016: 397-400.
[27] NAM S, HAHM Y, CHOI K S. Application of word vector with Korean specific feature to Bi-LSTM model for named entity recognition[C]//Proceedings of the Annual Conference on Human and Language Technology. Human and Language Technology, 2017: 147-150.
[28] YU H, KO Y. Expansion of word representation for named entity recognition based on bidirectional LSTM CRFs[J]. Journal of KIISE, 2017, 44(3): 306-313.
[29] JIN G, YU Z. A Korean named entity recognition method using Bi-LSTM-CRF and masked self-attention[J]. Computer Speech & Language, 2021, 65: 101-134.

基金

国家哲学社会科学基金(18ZDA306);延边大学外国语言文学世界一流学科建设攻关科研项目(18YLGG01)
PDF(2956 KB)

571

Accesses

0

Citation

Detail

段落导航
相关文章

/