融合词结构特征的多任务老挝语词性标注方法

王兴金,周兰江,张建安,周枫

PDF(1786 KB)
PDF(1786 KB)
中文信息学报 ›› 2019, Vol. 33 ›› Issue (11) : 39-45.
语言分析与计算

融合词结构特征的多任务老挝语词性标注方法

  • 王兴金,周兰江,张建安,周枫
作者信息 +

A Multi-task Lao Part-of-Speech Tagging Method Fusing Structural Features of Word

  • WANG Xingjin, ZHOU Lanjiang, ZHANG Jianan, ZHOU Feng
Author information +
History +

摘要

目前,老挝语词性标注研究处于初期,可用标注语料有限,且老挝语吸收了多种外来词,导致标注语料库存在大量稀疏词。多任务学习是有效识别稀疏词的一种方法,该文研究了老挝词的结构特征,并构建了结合词性标注损失和主辅音辅助损失的多任务老挝语词性标注模型。老挝词有很多词缀可以表达词性信息,因此模型还采用了字符级别的词向量来获取这些词缀信息。特别地,老挝语的句式较长,模型用注意力机制防止长远上下文特征丢失。实验结果表明: 相比其他研究方法,该模型的词性标注准确率在有限标注语料下取得更好的表现(93.24%)。

Abstract

At present, the research on Lao part-of-speech tagging is in its infancy, with limited tagged corpus available. In particular, Lao has absorbed a variety of foreign words, resulting in the presence of a large number of rare words. This paper studies the structure characteristics of Lao words and proposes a multi-task Lao part-of-speech tagging model with a combination of part-of-speech tagging loss with the main consonant auxiliary loss. To capture the rich affixes indicating part of speech clues in Lao, the model also uses character-level word vectors. In addition, the attention mechanism is employed to deal with the long sentence pattern of Lao. The experimental results show that the proposed method achieves better accuracy of 93.24%.

关键词

老挝语词性标注 / 稀疏词 / 主辅音辅助损失 / 注意力机制

Key words

Lao part-of-speech tagging / rare words / the main consonant auxiliary loss / attention mechanism

引用本文

导出引用
王兴金,周兰江,张建安,周枫. 融合词结构特征的多任务老挝语词性标注方法. 中文信息学报. 2019, 33(11): 39-45
WANG Xingjin, ZHOU Lanjiang, ZHANG Jianan, ZHOU Feng. A Multi-task Lao Part-of-Speech Tagging Method Fusing Structural Features of Word. Journal of Chinese Information Processing. 2019, 33(11): 39-45

参考文献

[1] Sarkar K,Gayen V. A trigram HMM-based POS tagger for indian languages[C]//Proceedings of the International Conference on Frontiers of Intelligent Computing: Theory and Applications (FICTA). Heidelberg,Berlin: Springer,2013: 205-212.
[2] 杨蓓,周兰江,余正涛,等. 半监督学习的老挝语词性标注方法研究[J]. 计算机科学,2016,43(9): 103-106.
[3] Murata M,Ma Q,Isahara H. Comparison of three machine-learning methods for thai part-of-speech tagging[J]. ACM Transactions on Asian Language Information Processing,2002,1(2): 145-158.
[4] Sun X,Huang D,Ren F. Detecting new words from chinese text using latent semi-CRF models[J]. IEICE Transactions on Information & Systems,2010,93(6): 1386-1393.
[5] 买合木提·买买提,卡哈尔江·阿比的热西提,艾山·吾买尔,等.CRF与规则相结合的维吾尔文地名识别研究[J].中文信息学报,2017,31(06): 110-118.
[6] Huang Z,Xu W,Yu K. Bidirectional LSTM-CRF models for sequence tagging[J].arXiv preprint arXiv: 1508.01991,2015.
[7] Wang X,Zhang Y,Ren X,et al. Cross-type biomedical named entity recognition with deep multi-task learning[J]. Bioinformatics,2018,35(10): 1-9.
[8] Plank B,Sgaard A,Goldberg Y. Multilingual part-of-speech tagging with bidirectional long short-term memory models and auxiliary loss[C]//Proceedings of the 54th Annual Meeting of Association for Computational Linguistics. Berlin,Germany: ACL Press,2016: 412-418.
[9] Rei M. Semi-supervised multitask learning for sequence labeling[C]//Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics. Vancouver,Canada: ACL Press,2017: 2121-2130.
[10] Bahdanau D,Cho K,Bengio Y. Neural machine translation by jointly learning to align and translate[C]//Proceedings of the 3rd International Conference on Learning Representations (ICLR02015). San Diego,California,2015: 1-15.
[11] Firat O,Cho K,Bengio Y. Multi-way,multilingual neural machine translation with a shared attention mechanism[C]//Proceedings of NAACL-HLT. San Diego,CA,USA: ACL Press,2016: 866-875.
[12] Ling W,Luís T,Marujo L,et al. Finding function in form: Compositional character models for open vocabulary word representation[C]//Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing. Lisbon,Portugal: ACL Press,2015: 1520-1530.
[13] Phissamay P,Dalolay V,Chanhsililath C,et al. Syllabification of Lao script for line breaking[R]. Lao PDR: Science technology and environment agency,2004.
[14] Mikolov T,Sutskever I,Chen Kai,et al. Distributed representations of words and phrases and their compositionality[C]//Proceedings of the 26th Advances in Neural Information Processing Systems. Nevada,USA: MIT Press,2013: 3111-3119.
[15] Yang J,Liang S,Zhang Y. Design challenges and misconceptions in neural sequence labeling[C]//Proceedings of the 27th International Conference on Computational Linguistics. Santa Fe,New Mexico,USA: ACL Press,2018: 3879-3889.
[16] 郑亚楠,珠杰.基于词向量的藏文词性标注方法研究[J].中文信息学报,2017,31(01): 112-117.
[17] Viterbi A. Error bounds for convolutional codes and an asymptotically optimum decoding algorithm[J]. IEEE Trans on Information Theory,1967,13(2): 260-269.
[18] Insisiengmay Alivanh. Word Segmentation and Part-of-Speech Tagging for Lao Language[D]. Nara Institute of Science and Technology,2017.
[19] Krizhevsky A,Sutskever I,Hinton G E. ImageNet classification with deep convolutional neural networks[C]//Proceedings of Advances in Neural Information Processing Systems. Cambridge,MA: MIT Press,2012: 1106-1114.
[20] Kim Y. Convolutional neural networks for sentence classification[C]//Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing. Doha,Qatar: ACL Press,2014: 1746-1751.

基金

国家自然科学基金(61662040,61562049);云南省自然科学基金(2016FB101)
PDF(1786 KB)

711

Accesses

0

Citation

Detail

段落导航
相关文章

/