藏文分词是藏语自然语言处理的一项基础性任务,其性能影响藏文自动摘要、自动分类以及搜索引擎等多个方面。基于词位标注的藏文分词方法通常使用四词位标签集,为了更全面地提取特征信息和更深层次的语义信息,该文提出了一种八词位标签集,采用BiLSTM_CRF模型得到一种基于八词位标签的BiLSTM_CRF藏文分词方法。实验结果表明,该方法取得较好的分词效果,在测试数据集上的准确率、召回率和F1值分别达95.07%、95.57%和95.32%。
Abstract
Tibetan word segmentation is a fundamental task of Tibetan natural language processing affecting such tasks as Tibetan automatic summary, automatic classification, and search engines. Tibetan word segmentation at present uses the four-word-position tagging method. This paper proposes an eight-word-position tag approach to extract feature and deeper semantic information more comprehensively. The whole segmentation system adopts the BiLSTM_CRF framework. The experimental results demonstrate that the proposed method achieves 95.07% Tibetan word semination accuracy, 95.57% recall and 95.32% F-measure, respectively.
关键词
自然语言处理 /
藏文分词 /
BiLSTM_CRF /
八词位标签
{{custom_keyword}} /
Key words
NLP /
Tibetan word segmentation /
BiLSTM_CRF /
eight-word-position based tag
{{custom_keyword}} /
{{custom_sec.title}}
{{custom_sec.title}}
{{custom_sec.content}}
参考文献
[1] XUE N.Chinese word segmentation as character tagging[J]. Computational Linguistics and Chinese Language Processing,2003,8(1): 29-48.
[2] 黄昌宁,赵海.中文分词十年回顾[J].中文信息学报,2007,21(03): 8-19.
[3] 于江德,睢丹,樊孝忠.基于字的词位标注汉语分词[J].山东大学学报(工学版),2010,40(05): 117-122.
[4] 王希杰,黄勇杰.基于三词位的字标注汉语分词[J].安阳师范学院学报,2013(05): 49-52.
[5] LIU H D,ZHAO W N,NUO M H,et al.Tibetan number identification based on classification of number components in Tibetan word segmentation[C]//Proceedings of the 23rd International Conference on Computational Linguistics, 2010: 719-724.
[6] 李亚超,加羊吉,宗成庆,等.基于条件随机场的藏语自动分词方法研究与实现[J].中文信息学报,2013,27(04): 52-58.
[7] 康才畯.藏语分词与词性标注研究[D].上海: 上海师范大学博士学位论文,2014.
[8] 洛桑嘎登,杨媛媛,赵小兵.基于知识融合的CRFs藏文分词系统[J].中文信息学报,2015,29(06): 213-219.
[9] 李博涵,刘汇丹,龙从军,等.基于深度学习的藏文分词方法[J].计算机工程与设计,2018,39(01): 194-198.
[10] 桑杰端珠,才让加.神经网络藏文分词方法研究[J].青海科技,2018,25(06): 15-21.
[11] 王康.基于神经网络的藏语分词与词性标注研究[D]. 兰州: 兰州大学硕士学位论文, 2020.
[12] 王莉莉,王宏渊,白玛曲珍,等.基于BiLSTM_CRF模型的藏文分词方法[J].重庆邮电大学学报(自然科学版),2020,32(04): 648-654.
[13] CHEN X,QIU X,ZHU C,et al. Long short-term memory neural networks for Chinese word segmentation[C]//Proceedings of the Conference on Empirical Methods in Natural Language Processing,2015: 1197-1206.
[14] YAO Y-SH, HUANG ZH. BiLSTM recurrent neural network for Chinese word segmentation[C]//Proceedings of the International Conference on Neural Information Processing,2016: 345- 353.
[15] 高定国,杨晓龙,杨宇帆,等.MLWS2021藏文分词评测报告[J].高原科学研究,2022,6(01): 82-89.
[16] 赵小兵,高璐,高定国,等.少数民族语言分词技术评测数据集MLWS2021[J].中国科学数据(中英文网络版),2022(2): 2-10.
{{custom_fnGroup.title_cn}}
脚注
{{custom_fn.content}}
基金
国家自然科学基金(61966031,61866032);青海省科技厅资助项目(2019-SF-129,2021-ZJ-727);青海省藏文信息处理与机器翻译重点实验室(2020-ZJ-Y05);藏文信息处理教育部重点实验室(2013-Z-Y17,2014-Z-Y32,2015-Z-Y03)
{{custom_fund}}