命名实体识别是藏文自然语言处理中的一项关键任务,该文提出了结合三种藏文预训练模型(Word2Vec、ELMo、ALBERT)的Casade-BiLSTM-CRF结构。级联技术(Cascade)将藏文命名实体识别划分为两个子任务(实体边界划分,实体类别判断)分阶段进行,简化了模型结构;使用藏文预训练模型,能更好地学习藏文先验知识。实验表明,Cascade-BiLSTM-CRF模型相比于BiLSTM-CRF模型训练一轮时间缩短了28.30%;而将级联技术与预训练技术相结合,在取得更好识别效果的同时还缩短了模型训练时间。
Abstract
Named entity recognition is a key task in Tibetan processing. This paper proposes a Casaded BiLSTM-CRF method combining three Tibetan pre-training models (Word2Vec, ELMo, ALBERT). The cascade Tibetan named entity recognition refers to treat this task by two sub-tasks, i.e. entity boundary delineation and entity class determination. Experiments show that the proposed model decreases the training time by 28.30% compared with the BiLSTM-CRF model, and combining the pre-training technique achieves better recognition results.
关键词
藏文命名实体识别 /
级联 /
预训练
{{custom_keyword}} /
Key words
Tibetan NER /
cascade /
pre-training
{{custom_keyword}} /
{{custom_sec.title}}
{{custom_sec.title}}
{{custom_sec.content}}
参考文献
[1] HUANG Z, XU W. Bidirectional LSTM-CRF models for sequence tagging[J]. CoRR,2015,abs/ 1508.01991.
[2] MIKOLOV T, CORRADO G, et al. Efficient estimation of word representations in vector space[J]. CoRR,2013,abs/1301.3781.
[3] PETERS M E, et al. Deep contextualized word representations[C]//Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2018: 2227-2237.
[4] LAN Z, CHEN M, GOODMAN S, et al. Albert: A lite bert for self-supervised learning of language representations[C]//Proceedings of the International Conference on Learning Representations, 2020: 1-17.
[5] YU H Z, JIANG J T, MA N. Named entity recognition for tibetan texts using case-auxiliary grammars[C]//Proceedings of International Muliti Conference of Engineers and Computer Scientists.2010:601-604.
[6] SUN Y, YAN X, ZHAO X, et al. Research on automatic recognition of Tibetan personal names based on multi-features[C]//Proceedings of International Conference on Natural Language Processing and Knowledge Engineering, 2010, 1-5.
[7] 华却才让,姜文斌,赵海兴,等. 基于感知机模型藏文命名实体识别[J]. 计算机工程与应用,2014,50(15):172-176.
[8] 珠杰,李天瑞,刘胜久. 基于条件随机场的藏文人名识别技术研究[J]. 南京大学学报(自然科学),2016,52(02):289-299.
[9] 刘飞飞,王志娟. 基于层次特征的藏文人名识别研究[J]. 计算机应用研究,2018,35(09):2583-2587.
[10] 珠杰,李天瑞. 深度学习模型的藏文人名识别方法[J].高原科学研究,2017,1(01):112-124.
[11] 王志娟,刘飞飞,赵小兵,等. 基于置信度的藏文人名识别的主动学习模型研究[J]. 中文信息学报,2019,33(08):53-59.
[12] 孙朋. 基于弱监督学习的藏文命名实体识别研究[D].北京: 中央民族大学硕士学位论文,2020.
[13] 李晓敏. 基于深度学习的藏文命名实体识别研究[D].兰州: 兰州大学硕士学位论文,2021.
[14] 环科尤. 基于深度学习的格萨尔史诗命名实体识别关键技术研究[D].西宁: 青海师范大学硕士学位论文,2022.
[15] 洛桑嘎登,群诺,索南尖措,等. 融合音节部件特征的藏文命名实体识别方法[J]. 厦门大学学报(自然科学版),2022,61(04):624-629.
[16] GRAVES A, SCHMIDHUBER J. Framewise phoneme classification with bidirectional LSTM and other neural network architectures[J]. Neural Networks. 2005, 18(5-6):602-610.
[17] WEI Z, SU J, WANG Y, et al. A novel cascade binary tagging framework for relational triple extraction[J]. arXiv preprint arXiv:1909.03227, 2019.
[18] WANG. Named entity recognition practice and exploration[OL]. https://github.com/wavewangyue/ner,2020.
[19] DEVLIN J, CHANG M W, LEE K, et al. Bert: Pre-training of deep bidirectional transformers for language understanding[C]//Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2019: 4171-4186.
{{custom_fnGroup.title_cn}}
脚注
{{custom_fn.content}}
基金
西藏大学提升计划项目(ZDTSJH21-07);国家自然科学基金(62066042);教育部人文社会科学研究项目(21YJCZH059);西藏大学培育计划项目(ZDCZJH21-10);2021年西藏自治区高校人文社会科学研究项目(SK2021-24);西藏大学珠峰学科建设计划项目(zf22002001)
{{custom_fund}}