|
|
Embedding for Words and Word Senses Based on Human Annotated #br# Knowledge Base: A Case Study on HowNet |
SUN Maosong 1,2; CHEN Xinxiong 1 |
1. State Key Lab. of Intelligent Technology and Systems, National Lab. on Information Science and Technology,
Department of Computer Science and Technology, Tsinghua University, Beijing 100084, China;
2. Beijing Advanced Innovation Center for Imaging Technology, Capital Normal University, Beijing 100048, China |
|
|
Abstract This paper aims to address the necessity and effectiveness of encoding a human annotated knowledge base into a neural network language model, using HowNet as a case study. Traditional word embedding is derived from neural network language model trained on a large-scale unlabeled text corpus, which suffers from the quality of resulting vectors of low frequent words is not satisfactory, and the sense vectors of polysemous words are not available. We propose neural network language models that can systematically learn embedding for all the semantic primitives defined in HowNet, and consequently, obtain word vectors, in particular for low frequent words, and word sense vectors in terms of the semantic primitive vectors. Preliminary experimental results show that our models can improve the performance in tasks of both word similarity and word sense disambiguation. It is suggested that the research on neural network language models incorporating human annotated knowledge bases would be a critical issue deserving our attention in the coming years.
|
Received: 27 September 2016
|
|
|
|
|
[1] 董强,董振东.《知网》[DB]. http://www.keenage.com.
[2] Wang Y, Liu Z, Sun M. Incorporating Linguistic Knowledge for Learning Distributed Word Representations[J]. PloS one, 2015. 10(4): e0118437.
[3] Chen X, Liu Z, Sun M. A Unified Model for Word Sense Representation and Disambiguation[C]//Proceedings of EMNLP. 2014: 1025-1035.[4] Rothe S, Schütze H. AutoExtend: Extending Word Embeddings to Embeddings for Synsets and Lexemes[C]//Proceedings of ACL. 2015: 1793-1803.
[5] 唐共波, 于东荀, 恩东. 基于知网义原词向量表示的无监督词义消歧方法[J]. 中文信息学报, 2015,29(6): 23-29.
[6] Mikolov T, Yih W, Zweig G. Linguistic Regularities in Continuous Space Word Representations[C]//Proceedings of HLT-NAACL. 2013: 746-751.
[7] Pennington J, Socher R, Manning C D. Glove: Global vectors for word representation[C]//Proceedings of EMNLP. 2014: 1532-1543
[8] Li W, McCallum A. Semi-supervised Sequence Modeling with Syntactic Topic Models[C]//Proceedings of AAAI. 2005: 813.
[9] Wang J, Liu J, Zhang P. Chinese Word Sense Disambiguation with PageRank and HowNet[C]//Proceedings of the Sixth SIGHAN Workshop on Chinese Language Processing, 2008. |
|
|
|