自动分句在自然语言处理中具有重要的应用价值,是机器翻译、句法分析和语义分析等任务的重要前期工作环节。当前藏文自动分句中采用的基于词典的分句方法,以及基于词典和统计模型相结合的分句方法因受句尾词兼类现象和数据稀疏等问题的影响,分句效率较低。对此,该文提出了一种基于Bi-LSTM和Self-Attention的藏文自动分句方法。通过实验对比,该方法的宏准确率、宏召回率和宏F1值分别到达了97.7%、98.06%和97.88%,其结果优于所有对比方法。另外,在实验过程中还发现,当模型使用序列前端截补方式定长的数据时,其性能优于使用后端截补方式定长的数据;当模型使用基于Skip-gram的音节字表示时,其性能优于基于CBOW和随机生成的音节字表示。
Abstract
Sentence boundary identification is an essential task in natural language processing. Because of issues such as concurrent ending words and data sparse, the existing Tibetan sentence boundary identification methods based on the dictionary or the statistical model are less efficient. This paper proposes an automatic Tibetan sentence boundary identification method based on Bi-LSTM and Self-Attention. Experiments reveal this method outperforms other method by achieving 97.7%, 98.06% and 97.88% in terms of macro accuracy, macro recall and macro F1, respectively. The experimental results also demonstrate that front-end truncation for fixed sentence length , and the skip-gram syllable word representations are more effective.
关键词
藏文句子 /
分句 /
TSRM_BS模型
{{custom_keyword}} /
Key words
Tibetan sentence /
sentence segmentation /
TASSM_BS model
{{custom_keyword}} /
{{custom_sec.title}}
{{custom_sec.title}}
{{custom_sec.content}}
参考文献
[1] 布龙菲尔德.语言论[M].袁家骅,等,译.北京: 商务印书馆,2017: 231-232.
[2] 朱德熙.语法讲义[M].北京: 商务印刷馆,1982: 21-24.
[3] 黄伯荣,廖序东.现代汉语(增订六版): 下册[M].北京: 高等教育出版社,2017: 154-156.
[4] 赵维纳,刘汇丹,于新,等.基于法律文本的藏语句子边界识别[C]. 中国中文信息学会.第五届全国青年计算语言学研讨会论文集.北京: 中国中文信息学会,2010: 480-486.
[5] 赵维纳,于新,刘汇丹,等.现代藏语助动词结尾句子边界识别方法[J].中文信息学报,2013,27(01): 115-119.
[6] 马伟珍,完么扎西,尼玛扎西.藏语句子边界识别方法[J].西藏大学学报(自然科学版),2012,27(02): 70-76.
[7] 李响,才藏太,姜文斌,等.最大熵和规则相结合的藏文句子边界识别方法[J].中文信息学报,2011,25(04): 39-44.
[8] 才藏太.基于最大熵分类器的藏文句子边界自动识别方法研究[J].计算机工程与科学,2012,34(06): 187-190.
[9] 却措卓玛,华却才让,才让当知,等.基于混合策略的藏文句子边界识别[J].内蒙古师范大学学报(自然科学汉文版),2019,48(05): 400-405.
[10] 徐涛,加羊吉,于洪志.统计与规则相结合的藏文句子自动断句方法[J].云南大学学报(自然科学版),2012,34(06): 653-657.
[11] 邵浩,刘一峰.预训练语言模型[M].北京: 电子工业出版社,2021: 13.
[12] 多杰卓玛.N元模型在藏文文本局部查错中的应用研究[J].计算机工程与科学,2009,31(04): 117-119.
[13] 王贺福. 统计语言模型应用与研究[D].复旦大学硕士学位论文,2012.
[14] BENGIO Y,SCHWENK H,et al.Neural probabilistic language models[J].Studies in Fuzziness & Soft Computing,2006,194: 137-186.
[15] MIKOLOV T, KARAFIáT M, BURGET L, et al. Recurrent neural network based language model[C]//Proceedings of the 11th Annual Conference of the Internet Speech Communication Association, 2010.
[16] SAK H,SENIOR A,BEAUFAYS F.Long short-term memory recurrent neural network architectures for large scale acoustic modeling[C]//Proceedings of the 15th Annual Conference of the International Speech Communication Association,2014.
[17] GRAVES A,SCHMIDHUBER J.Framewise phoneme classification with bidirectional LSTM and other neural network architectures[J].Neural Networks,2005,18(5-6): 602-610.
[18] VASWANI A,SHAZEER N,PARMAR N,et al.Attention is all you need[J]. arXiv: 1706.03762v4,2017.
[19] YIMING Y. An evaluation of statistical approaches to text categorization[C]//Proceedings of the Information Retrieval,1999.
[20] 约阿夫·戈尔德贝格, 著.车万翔,等,译.基于深度学习的自然语言处理[M].北京: 机械工业出版社,2018.
[21] 郑婳,刘扬,殷雅琦,等.基于词信息嵌入的汉语构词结构识别研究[J].中文信息学报,2022,36(05): 31-40.
[22] 朱娜娜,王航,张家乐,等.基于预训练语言模型的政策识别研究[J].中文信息学报,2022,36(02): 104-110.
[23] MIKOLOV T,SUTSKEVER I,KAI C,et al.Distributed representations of words and phrases and their compositionality[J]. arXiv: 1310.4546,2013.
[24] MIKOLOV T,CHEN K,CORRADO G,et al.Efficient estimation of word representations in vector space[J].arXiv: 1301.3781v1,2013.
{{custom_fnGroup.title_cn}}
脚注
{{custom_fn.content}}
基金
国家社会科学基金(21VJXT013);国家自然科学基金(62066042);西藏大学校级培育基金项目(ZDCZJH19-19,ZDCZJH19-20,ZDCZJH18-16);西藏自治区自然基金(XZ202101ZR0108G);西藏大学在职攻读博士学位资助项目(藏财预指〔2022〕1号);青海师范大学中青年科研基金(2019zr013)
{{custom_fund}}