老挝语是一种无空格切分的字母语言,在进行自然语言处理工作时需要首先进行分词处理。现有分词算法主要为首先使用规则进行音节切分,然后根据音节切分结果进行老挝语分词,存在错误传递等问题。该文提出一种基于神经网络的端到端老挝语分词方法,基于多任务联合学习思想,将老挝语音节切分与分词工作进行结合,实现了基于双向长短时记忆循环神经网络(BiLSTM)的端到端老挝语分词模型。实验表明,端到端的老挝语分词模型准确率达到89.02%,较以往分词模型有所提升。
Abstract
Laotian is a non-space separated alphabetic language. The existing segmentation algorithms for Laotian mainly use rules to segment syllables first, and then segment words according to the results of syllable segmentation. This paper proposes an end-to-end Laotian word segmentation method based on neural networks. With multi-task joint learning, the Lao syllable segmentation and word segmentation are jointly processed via BiLSTM. Experiments show that the precision of the proposed method reaches 89.02%, out-performing previous word segmentation models.
关键词
老挝语分词 /
音节切分 /
多任务学习 /
端到端模型
{{custom_keyword}} /
Key words
Laotian word segmentation /
syllable segmentation /
multi-task learning /
end-to-end model
{{custom_keyword}} /
{{custom_sec.title}}
{{custom_sec.title}}
{{custom_sec.content}}
参考文献
[1] 张良民. 老挝语实用语法[M]. 北京: 外语教学与研究出版社, 2001.
[2] Xue N, Shen L. Chinese word segmentation as LMR tagging[C]//Proceedings of the 2nd SIGHAN Workshop on Chinese Language Processing. Association for Computational Linguistics, 2003: 176-179.
[3] 邓丽萍, 罗智勇. 基于半监督CRF的跨领域中文分词[J]. 中文信息学报, 2017, 31(4): 9-19.
[4] 杨蓓. 老挝语分词和词性标注方法研究[D].昆明: 昆明理工大学硕士学位论文,2016.
[5] Vanthanavong S, Haruechaiyasak C. LaoWS: Lao word segmentation based on conditional random fields[C]//Proceedings of Conference on Human Language Technology for Development, 2011: 21-26.
[6] Xu J, Sun X. Dependency-based gated recursive neural network for Chinese word segmentation[C]//Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, 2016: 567-572.
[7] Yang J, Zhang Y, Liang S.Subword encoding in lattice LSTM for Chinese word segmentation[C]//Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2019: 2720-2725.
[8] Chen X, Shi Z,Qiu X, et al. Adversarial multi-criteria learning for Chinese word segmentation[C]//Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics, 2017: 1193-1203.
[9] 何力,周兰江,周枫,等.基于双向长短期记忆神经网络的老挝语分词方法[J].计算机工程与科学,2019,41(07): 1312-1317.
[10] Zhang R, Kikui G, Sumita E. Subword-based tagging for confidence-dependent Chinese word segmentation[C]//Proceedings of the COLING/ACL on Main Conference Poster Sessions. Association for Computational Linguistics, 2006: 961-968.
[11] Rei M. Semi-supervised multitask learning for sequence labeling[C]//Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics, 2017: 2121-2130.
{{custom_fnGroup.title_cn}}
脚注
{{custom_fn.content}}
基金
国家自然科学基金(61662040,61562049)
{{custom_fund}}