乌兹别克语命名实体数据集构建研究

艾孜海尔江·玉素甫,姬东鸿,李霏,滕冲,艾孜尔古丽

PDF(6034 KB)
PDF(6034 KB)
中文信息学报 ›› 2023, Vol. 37 ›› Issue (9) : 83-91.
民族、跨境及周边语言信息处理

乌兹别克语命名实体数据集构建研究

  • 艾孜海尔江·玉素甫1,姬东鸿1,李霏1,滕冲1,艾孜尔古丽2
作者信息 +

Construction of Uzbek Named Entity Dataset

  • AIZIHAIERJIANG Yusufu1, JI Donghong1, LI Fei 1, TENG Chong1, Aizierguli2
Author information +
History +

摘要

命名实体识别(NER)是自然语言处理领域的一个重要任务,用于在文本中识别实体并将其分类为预定义的类型。乌兹别克语(简称乌语)命名实体识别在国内外相关研究中处于初级阶段,目前为止尚没有公开、通用的乌语命名实体识别数据集,导致了乌语命名实体识别的进展受到了限制。该文旨在构建一个基于乌兹别克语新闻文本的NER数据集,收集了500篇乌兹别克语新闻文章,并人工标注了其中的人名、地名和组织机构名。随后,利用实体命名识别的主流深度学习模型在该数据集上进行了实验与比较分析。实验结果表明,主流深度学习模型的F1值均在90%以上,证明了该文构建的数据集的有效性和可用性。该文旨在推动乌语命名实体识别领域的研究发展,为该领域提供数据集和基线模型,以扩展相关研究。

Abstract

Named entity recognition (NER) is an important task in the field of natural language processing So far, there is no public and general-purpose Uzbek named entity recognition dataset, which has limited the progress of Uzbek named entity recognition. This paper aims to build a NER dataset based on Uzbek news texts. We collecte 500 Uzbek news articles and manually annotate the names of people, places and organizations. Meanwhile, experiments and comparative analysis are carried out on this dataset using the mainstream deep learning model of NER. The experimental results show that the F1 values of the mainstream models all surpass 90%, which proves the validity and usability of the dataset we constructed.

关键词

自然语言处理 / 乌兹别克语 / 实体命名识别

Key words

natural language processing / Uzbek language / named entity recognition

引用本文

导出引用
艾孜海尔江·玉素甫,姬东鸿,李霏,滕冲,艾孜尔古丽. 乌兹别克语命名实体数据集构建研究. 中文信息学报. 2023, 37(9): 83-91
AIZIHAIERJIANG Yusufu, JI Donghong, LI Fei , TENG Chong, Aizierguli. Construction of Uzbek Named Entity Dataset. Journal of Chinese Information Processing. 2023, 37(9): 83-91

参考文献

[1] BAISA V,SUCHOMEL V. Large corpora for Turkic languages and unsupervised morphological analysis[C]//Proceedings of the 8th Conference on International Language Resources and Evaluation. Istanbul: Europe an Language Resources Association,2012.
[2] KING B,ABNEY S. Labeling the languages of words in mixed-language documents using weakly supervised methods [C]//Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Morristown: Association for Computational Linguistics,2013: 1110-1119.
[3] LI X,TRACEY J,GRIMES S,et al. Uzbek-English and Turkish-English morpheme alignment corpora[C]//Proceedings of the 10th International Conference on Language Resources and Evaluation. Portoro: ELRA,2016: 2925-2930.
[4] TSAI C T, MAYHEW S, SONG Y, et al. Illinois CCG LoReHLT 2016 named entity recognition and situation frame systems[J]. Machine Translation, 2018, 32: 91-103.
[5] SHARIPOV M, MATTIEV J, SOBIROV J, et al. Creating a morphological and syntactic tagged corpus for the Uzbek language[J]. arXiv preprint arXiv: 2210.15234, 2022.
[6] SALAEV U, KURIYOZOV E, GMEZ-RODRGUEZ C. SimRelUz: similarity and relatedness scores as a Semantic evaluation dataset for Uzbek language[J]. arXiv preprint arXiv: 2205.06072, 2022.
[7] MATLATIPOV S, RAHIMBOEVA H, RAJABOV J, et al. Uzbek sentiment analysis based on local restaurant reviews[J]. arXiv preprint arXiv: 2205.15930, 2022.
[8] SHARIPOV M, MATTIEV J, SOBIROV J, et al. Creating a morphological and syntactic tagged corpus for the Uzbek language[J]. arXiv preprint arXiv: 2210.15234, 2022.
[9] 帕提古丽·艾合买提,艾孜尔古丽,阿不都热依木,等.基于信息处理的乌兹别克语语音变化现象自动还原技术研究[J].电脑知识与技术,2016,12(32): 177-179.
[10] 阿西穆·托合提,早克热·卡德尔,吐尔根·依布拉音,等.乌兹别克语-维吾尔语双语语料库构建平台的设计与实现[J].电脑知识与技术,2017,13(07): 1-2.
[11] 胡创业,黄欣欣.基于翻译API的HSK汉-乌平行词库构建方法研究[J].电脑知识与技术,2021,17(14): 201-203.
[12] 吐拉克孜·吐尔逊,艾孜尔古丽,玉素甫·艾白都拉.乌孜别克语动词的基本特征[J].北方文学,2018(33): 227-228.
[13] 艾孜海尔江,祖力克尔江,艾孜尔古丽,等.基于多策略的乌孜别克语名词词干识别研究[J].中文信息学报,2018,32(9): 35-40.
[14] 玉素甫·艾白都拉,艾孜海尔江,祖力克尔江,等.面向自然语言处理的现代乌兹别克语名词词缀研究[J].电脑知识与技术,2018,14(20): 200-201.
[15] 吾买尔江·买买提明,古丽尼格尔·阿不都外力,买合木提·买买提,等.乌兹别克语词干提取算法的比较研究[J].中文信息学报,2020,34(01): 45-50.
[16] 原伟.基于情感词典和标注语料库的乌兹别克语短文本情感分析[J].中央民族大学学报(自然科学版),2022,31(02): 5-12.
[17] 艾斯卡尔·肉孜, 宗成庆, 姑丽加玛丽·麦麦提艾力,等. 基于条件随机场的维吾尔人名识别方法[J]. 清华大学学报(自然科学版), 2013(6): 873-877.
[18] 塔什甫拉提·尼扎木丁. 维吾尔语文本信息中人名实体识别研究[D].乌鲁木齐: 新疆大学硕士学位论文,2016.
[19] 阿迪来·艾合买提,冯向萍.基于条件随机场的维吾尔语音乐实体识别[J].智能计算机与应用,2017,7(2): 59-62.
[20] 买买提阿依甫,吾守尔·斯拉木,帕丽旦·木合塔尔,等.基于BiLSTM-CNN-CRF模型的维吾尔文命名实体识别[J].计算机工程,2018,44(8): 230-236.
[21] 王路路,艾山·吾买尔,吐尔根·依布拉音,等.基于深度神经网络的维吾尔文命名实体识别研究[J].中文信息学报,2019,33(3): 64-70.
[22] 孔祥鹏,吾守尔·斯拉木,杨启萌,等.基于迁移学习的维吾尔语命名实体识别[J].东北师大学报(自然科学版),2020,52(2): 58-65.
[23] PEI Y,ZHIHAO Y, LING L, et al. An attention-based approach for chemical compound and drug named entity recognition[J]. Journal of Computer Research and Development, 2018, 55(7): 1548-1556.
[24] HUANG Z, XU W, YU K. Bidirectional LSTM-CRF models for sequence tagging[J].arXiv preprint arXiv: 1508.01991, 2015.
[25] DENG J, CHENG L, WANG Z. Self-attention-based BiGRU and capsule network for named entity recognition[J]. arXiv preprint arXiv: 2002.00735, 2020.
[26] YU F,KOLTUN V. Multi-scale context aggregation by dilated convolutions[J]. arXiv: 1511.07122[cs],2015.

基金

国家自然科学基金(62176187,61662081);国家重点研究与发展计划(2017YFC1200500);教育部基金(18JZD015);新疆师范大学青年拔尖人才项目(XJNUQB2022-22)
PDF(6034 KB)

811

Accesses

0

Citation

Detail

段落导航
相关文章

/