赵浩新,俞敬松,林杰. 基于笔画中文字向量模型设计与研究[J]. 中文信息学报, 2019, 33(5): 17-23.
ZHAO Haoxin, YU Jingsong, LIN Jie. Design and Research on Chinese Word Embedding Model Based on Strokes. , 2019, 33(5): 17-23.
Design and Research on Chinese Word Embedding Model Based on Strokes
ZHAO Haoxin1, YU Jingsong1, LIN Jie2
1.School of Software and Microelectronics, Peking University, Beijing 102600, China; 2.School of Information, Renmin University of China, Beijing 100872, China
Abstract:Chinese characters have a two-dimensional complex structure that spreads horizontally and vertically. Most of the studies about Chinese word embedding explore Chinese character level without considering the strokes sequence. This paper proposes a novel Stroke2Vec model that generates word embedding based on its stroke sequence. The model expands CBOW of Word2Vec model by using CNN and attention model instead of matrix. The Stroke2Vec aims to simulate the rule of strokes structure of Chinese characters and produce better character embedding with only strokes sequence. Compared with the Word2Vec and GloVe in NER task, the results show that our model achieves 81.49% F1-score, out-performing Word2Vec by 1.21%, and GloVe by nearly 0.21%. And combining Stroke2Vecs and Word2Vecs leads to an F1-score of 81.55%.
[1] Bengio Y,Ducharme R,Vincent P,et al. A neural probabilistic language model[J]. Journal of Machine Learning Research,2006,3(6):1137-1155. [2] Mnih A,Hinton G. A scalable hierarchical distributed language model[C]//Proceedings of the 21st International Conference on Neural Information Processing Systems. Curran Associates Inc,2008:1081-1088. [3] Mikolov T,Karafiát M,Burget L,et al. Recurrent neural network based language model[C]//Proceedings of the INTERSPEECH 2010,Conference of the International Speech Communication Association,Makuhari,Chiba,Japan,September. DBLP,2010:1045-1048. [4] Mikolov T,Chen K,Corrado G,et al. Efficient estimation of word representations in vector space[J]. arXiv preprint arXiv: 1301.3781,2013. [5] Mikolov T. Statistical language models based on neural networks[D]. Czech,Brno University of Technology,2012. [6] Mikolov T,Sutskever I,Chen K,et al. Distributed representations of words and phrases and their compositionality[J].arXiv preprint arXiv: 1310.4546,2013. [7] Levy O,Goldberg Y. Neural word embedding as implicit matrix factorization[J]. Advances in Neural Information Processing Systems,2014(3):2177-2185. [8] Ji S H,Yun H,Yanardag P,et al. WordRank: Learning word embeddings via robust ranking[C]//Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing,2016:658-668. [9] Pennington J,Socher R,Manning C. Glove: Global vectors for word representation[C]//Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing,2014:1532-1543. [10] Bojanowski P,Grave E,Joulin A,et al. Enriching word vectors with subword information[J]. Transactions of the Association of Computational Linguistics,2017,5(1):135-146. [11] Pinter Y,Guthrie R,Eisenstein J. Mimicking word embeddings using subword RNNs[C]//Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing,2017:102-112. [12] Chen X,Xu L,Liu Z,et al. Joint learning of character and word embeddings[C]//Proceedings of the 24th International Conference on Artificial Intelligence. AAAI Press,2015:1236-1242. [13] Sun Y,Lin L,Yang N,et al. Radical-enhanced Chinese character embedding[J].arXiv preprint arXiv: 1404.4714,2014. [14] Cao S,Lu W,Zhou J,et al. Cw2Vec: Learning Chinese word embeddings with stroke n-gram information[C]//Proceedings of the 32nd AAAI Conference on Artificial Intelligence,2018. [15] CAO S,LU W. Improving word embeddings with convolutional feature learning and subword information[C]//Proceedings of the 31st AAAI Conference on Artificial Intelligence,North America,feb.2017. [16] Lecun Y,Bottou L,Bengio Y,et al. Gradientbased learning applied to document recognition[J]. Proceedings of the IEEE,1998,86(11):2278-2324. [17] Gehring J,Auli M,Grangier D,et al. Convolutional sequence to sequence learning[J].arXiv preprint arXiv: 1705.03122.2017. [18] Huang Z,Xu W,Yu K. Bidirectional LSTM-CRF models for sequence tagging[J].arXiv preprint arXiv: 1508.01991.2015.