词向量表示是机器学习的基础性工作,其目标是以优化的向量表示词,以便计算机能更好地理解自然语言。随着神经网络技术的发展,词向量在自然语言处理领域发挥着重要作用。藏文词向量表示技术的研究对藏文特征分析以及用深度学习技术处理藏文具有重要意义。该文提出了一种构件、字和词多基元联合训练的藏文词向量表示方法,设计了多基元联合训练藏文词向量的模型TCCWE,并采用内部评测中的词相似度/相关性评价方式验证了其有效性。实验表明,该文提出的藏文词向量表示方法有效,其性能在TWordSim215上提高了3.35%,在TWordRel215上提高了4.36%。
Abstract
Word Embedding representation is to represent words as an optimized vector so that computers can understand natural language. The study of Tibetan word embedding representation technology is of great significance for the analysis of Tibetan features and the use of deep learning techniques to process Tibetan. This paper proposes a Tibetan word embedding representation method for joint training of components, characters and words as multi-primitives, named as multi-primitives joint training model (TCCWE). This method is verified by words similarity/relevance task, and the results shows the proposed method improves the performance by 3.35% on TWordSim215, and 4.36% on TWordRel215.
关键词
自然语言处理 /
藏文 /
神经网络 /
词向量表示
{{custom_keyword}} /
Key words
natural language processing /
Tibetan /
neural network /
word embedding representation
{{custom_keyword}} /
{{custom_sec.title}}
{{custom_sec.title}}
{{custom_sec.content}}
参考文献
[1] Z SHarris.Distributional structure[J].Word,1954,14:3-22.
[2] J R Firth. A synopsis of linguistic theory[J].Studies in Linguistic Analysis,1957:1930-1955.
[3] M Baroni, G Dinu, G Kruszewski. Don’t count,predict! a systematic comparison of context-counting vs. context-predicting semantic vectors[C]//Proceedings of the 52th ACL, 2014(1): 238-247.
[4] P F Brown,P V Desouza,R L Mercer,et al. Class-based n-gram models of natural language[J]. Computational Linguistics, 1992,18(4):467- 479.
[5] E H Huang,R Sociier,C D Manning, et al. Improving word representations via global context and multiple word prototypes[C]//Proceedings of 50th Annual meeting of the Association for Computational Linguistics 2012, 2012:873-882.
[6] J Pennington, R Socher, C Manning. GloVe: Global vectors for word representation[C]//Proceedings of the Empiricial Methods in Natural Language Processing, 2014: 12.
[7] W Xu, A Rudnicky. Can artificial neural networks learn language models[C]//Proceedings of International Conference on Statistical Language Processing,2000:1-13.
[8] T Mikolov, I Sutskever I, K Chen,et al. Distributed Representations of words and phrases and their compositionality[C]//Proceedings of the 2013 International Conference on Neural Information Processing Systems,2013.
[9] X Chen, L Xu, Z Liu, et al. Joint learning of character and word embeddings[C]//Proceedings of the IJCAI, 2015.
[10] 来斯惟.基于神经网络的词和文档语义向量表示方法研究[D].北京:中国科学院自动化研究所博士学位论文, 2016.
[11] 李伟康,李炜,吴云芳.深度学习中汉语字向量和词向量结合方式探究[J].中文信息学报, 2017,31(6):140-146.
[12] J Yu,X Jian,H Xin,et al. Joint embeddings of Chinese words, characters, and fine-grained subcharacter components[C]//Proceedings of the EMNLP, 2017.
[13] T R Su, H Y Lee. Learning Chinese word representations from glyphs of characters[C]//Proceedings of the Empirical Methods in Natural Language Processing(EMNLP),2017.
[14] S Cao, W Lu, J Zhou, et al. Cw2vec: Learning Chinese word embeddings with stroke n-gram information[C]//Proceedings of the AAAI,2018.
[15] W Wang,F Bao,G Gao. Mongolian named entity recognition with bidirectional recurrent neural networks[C]//Proceedings of the ICTAI,2016.
[16] 才智杰,才让卓玛.藏文字符的向量模型及构件特征分析[J].中文信息学报, 2016,30(2): 202-206.
[17] 才智杰,孙茂松,才让卓玛. 一种基于向量模型的藏文字拼写检查方法[J]. 中文信息学报, 2018,32(9):47-55.
[18] 才智杰,孙茂松,才让卓玛.藏文词向量相似度和相关性评测集构建[J].中文信息学报, 2019,33 (7): 81-87,100.
[19] 才智杰,才让卓玛.基于语料库的藏文字属性分析系统设计[J].计算机工程,2011,37(22): 270-272.
[20] 才让卓玛,李永明,才智杰.藏语语音合成单元选择[J].软件学报,2015,26(6):1409-1420.
{{custom_fnGroup.title_cn}}
脚注
{{custom_fn.content}}
基金
国家自然科学基金(61866032,61966031,61163018,61262051);国家社会科学基金(13BYY141,16BYY167);教育部“春晖计划”合作科研项目(Z2012093,Z2016077);青海省基础研究项目(2017-ZJ-767,2019-SF-129,2017-GX-146);“长江学者和创新团队发展计划”创新团队资助项目(IRT1068);青海省重点实验室项目(2013-Z-Y17、2014-Z-Y32、2015-Z-Y03);藏文信息处理与机器翻译重点实验室(2013-Y-17)
{{custom_fund}}