词性标注是自然语言处理领域的基础任务之一。语料稀缺、词形复杂、存在大量低频词和未登录词,句式较长,在数据传递过程中信息易丢失,这些都是导致老挝语词性标注不准确的主要原因。因此,该文提出一种融合多粒度特征的老挝语词性标注方法,构建了融合老挝词、字符和音节特征的Transformer-CRF模型。首先,在传统词向量的基础上融合老挝语字符和音节特征向量,使模型在三个粒度级别上充分利用语料信息;其次,使用Transformer对老挝语句子进行长远上下文信息提取,解决重要信息丢失问题;最后,使用CRF提取相邻词性约束关系,从而获取最优词性标签。实验结果表明,在语料有限的情况下,该模型与其他主流模型相比达到了更显著的效果,精确率、召回率和F1值分别为94.76%、93.93%、94.34%。
Abstract
Part-of-speech tagging in Lao language is challenged by the lack of corpus, the complexity of word form, the large number of low-frequency words and out-of-vocabulary(OOV) words, and the long distance dependency, etc. This paper proposes a multi-granularity feature fusion approach via Transformer-CRF model, integrating the features of Lao words, characters and syllables. Firstly, it integrates the feature vectors of the Laos language characters and syllables based on the classical word vector. Secondly, it uses Transformer to extract the long-range context information of Lao sentences. Finally, it uses CRF to extract the constraints of adjacent parts-of-speech to determine the optimal tags. The experimental results show that, compared with other mainstream models, this model achieves much better accuracy rate, recall rate and F1 value, reaching 94.64%, 93.82% and 94.23%, respectively.
关键词
多粒度 /
老挝语 /
词性标注 /
Transformer
{{custom_keyword}} /
Key words
multi-granularity /
Lao /
part-of-speech tagging /
Transformer
{{custom_keyword}} /
{{custom_sec.title}}
{{custom_sec.title}}
{{custom_sec.content}}
参考文献
[1] 张良民. 老挝语言用语法[M]. 北京: 外语教学与研究出版社,2009.
[2] RATNAPARKHI A. A Maximum entropy model for part-of-speech tagging[C]//Proceedings of the Conference on Empirical Method for Natural Language Processings, 1996: 133-143.
[3] AYOGU I I, ADETUNMBI A O, OJOKOH B A, et al. A comparative study of hidden Markov model and conditional random fields on a Yorùba part-of-speech tagging task[C]//Proceedings of the International Conference on Computing Networking and Informatics, 2017.
[4] WANG P,QIAN Y, SOONG F K, et al. A unified tagging solution: Bidirectional LSTM recurrent neural network with word embedding[J]. arXiv preprint arXiv:7511.00215, 2015.
[5] VASWANI A, BISK Y, SAGAE K, et al. Supertagging with LSTMs[C]//Proceedings of the 15th Annual Conference of the North American Chapter of the Association for Computational Linguistic, 2016: 232-237.
[6] CHANG C H. HMM-based part-of-speech tagging for chinese corpora[C]//Proceedings of the Workshop on Very Large Corpora, 1993: 40-47.
[7] BAUM L E. , PETRIE T, Statistical inference for probabilistic functions of finite state Markov chains[J]. Ann. Math. Stat, 1966: 1554-1563.
[8] ANDREW M, DAYNE F, FERNANDO C, et al. Maximum entropy markov models for information extraction and segmentation[C]//Proceedings of the International Conference on Machine Learning. Morgan Kaufmann, 2000: 591-596.
[9] PHUONG L H, PHAN X H, The trung tran. on the effect of the label bias problem in part-of-speech tagging[C]//Proceedings of the IEEE Rivf International Conference on Computing & Communication Technologies. IEEE, 2013: 103-108.
[10] LAFFERTY J, MCCALLUM A, Pereira F. 2001 Conditional random fields: Probabilistic models for segmenting and labeling sequence data[C]//Proceedings of the International Conference on Machine Learning, Williams, MA. 2001: 282-289.
[11] HUANG Z,XU W, YU K, et al. Bidirectional LSTM-CRF models for sequence tagging.[J]. arXiv: preprint arXiv:1508.01991, 2015.
[12] MA X,HOVY E. End-to-end sequence labeling via bi-directional LSTM-CNNs-CRF[C]//Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, 2016: 1064-1074.
[13] JUERGEN S. Gradient Flow in recurrent nets: The difficulty of learning long-term dependencies[M].Gradient Flow in Recurrent Nets: The Difficulty of Learning Long Term Dependencies. Wiley-IEEE Press, 2001.
[14] WU G, TANG G, WANG Z, et al. An attention-basedBiLSTM-CRF model for Chinese clinic named entity recognition[J]. IEEE Access, 2019, 7:113942-113949.
[15] VASWANI A, SHAZEER N, PARMAR N, et al. Attention is all you need[C]//Proceedings of the 31st International Conference on Neural Information Processing Systems, 2017: 5998-6008.
[16] LI H, ADELE Y C, LIU Y, et al. An augmented transformer architecture for natural language generation tasks[C]//Proceedings of the International Conference on Data Mining, 2019: 1131-1137.
[17] RAGANATO A, TIEDEMANN J. An analysis of encoder representations in transformer-based machine translation[C]//Proceedings of the EMNLP Workshop Blackbox NLP: Analyzing and Interpreting Neural Networks for NLP, 2018: 287-297.
[18] YAN H, DENG B, LI X, et al. TENER: Adapting transformer encoder for named entity recognition[J]. arXiv preprint arXiv: 1911.04474,2019.
[19] QIU X, PEI H, YAN H, et al. Multi-criteria Chinese word segmentation with transformer[J]. arXiv preprint arXiv:1906.12035,2019.
[20] 杨蓓. 老挝语分词和词性标注方法研究[D]. 昆明: 昆明理工大学硕士学位论文,2016.
[21] 王兴金,周兰江,张金鹏,等. 融合词预测的半监督老挝语词性标注研究[J]. 小型微型计算机系统,2019,40(12): 2500-2505.
[22] 王兴金,周兰江,张建安,等. 融合词结构特征的多任务老挝语词性标注方法[J]. 中文信息学报,2019,33(11): 39-45.
[23] International Organization for Standardization, Information Technology-Universal Coded Character Set (UCS), ISO/IEC 10646[S/OL]. http://standards.iso.org/ittf/PubliclyAvailableStandards/2017.
[24] RASHADUL H R, AMINUL I, EVANGELOS M. Improving text relatedness by incorporating phrase relatedness with word relatedness[J]. Computational Intelligence, 2018, 34(3):939-966.
[25] LABEAU M, LOSER K, ALLAUZEN A, et al. Non-lexical neural architecture for fine-grained POS tagging[C]//Proceedings of the Conference on Empirical Methods in Natural Language Processing, 2015: 232-237.
[26] WANG L, CHRIS D, ALAN W B, et al. Finding function in form: Compositional character models for open vocabulary word representation[C]//Proceedings of the Conference on Empirical Methods in Natural Language Processing, 2015: 1520-1530.
[27] MIKOLOV T. Distributed representations of words and phrases and their compositionality[C]//Proceedings of the 26th International Conference on Neural Information Processing Systems,2013,26: 3111-3119.
{{custom_fnGroup.title_cn}}
脚注
{{custom_fn.content}}
基金
国家自然科学基金(61662040)
{{custom_fund}}