借重于人工知识库的词和义项的向量表示: 以HowNet为例

孙茂松;陈新雄

PDF(1362 KB)
PDF(1362 KB)
中文信息学报 ›› 2016, Vol. 30 ›› Issue (6) : 1-6.
综述

借重于人工知识库的词和义项的向量表示: 以HowNet为例

  • 孙茂松1,2;陈新雄1
作者信息 +

Embedding for Words and Word Senses Based on Human Annotated #br# Knowledge Base: A Case Study on HowNet

  • SUN Maosong 1,2; CHEN Xinxiong 1
Author information +
History +

摘要

该文旨在以HowNet为例,探讨在表示学习模型中引入人工知识库的必要性和有效性。目前词向量多是通过构造神经网络模型,在大规模语料库上无监督训练得到,但这种框架面临两个问题: 一是低频词的词向量质量难以保证;二是多义词的义项向量无法获得。该文提出了融合HowNet和大规模语料库的义原向量学习神经网络模型,并以义原向量为桥梁,自动得到义项向量及完善词向量。初步的实验结果表明该模型能有效提升在词相似度和词义消歧任务上的性能,有助于低频词和多义词的处理。作者指出,借重于人工知识库的神经网络语言模型应该成为今后一段时期自然语言处理的研究重点之一。

Abstract

This paper aims to address the necessity and effectiveness of encoding a human annotated knowledge base into a neural network language model, using HowNet as a case study. Traditional word embedding is derived from neural network language model trained on a large-scale unlabeled text corpus, which suffers from the quality of resulting vectors of low frequent words is not satisfactory, and the sense vectors of polysemous words are not available. We propose neural network language models that can systematically learn embedding for all the semantic primitives defined in HowNet, and consequently, obtain word vectors, in particular for low frequent words, and word sense vectors in terms of the semantic primitive vectors. Preliminary experimental results show that our models can improve the performance in tasks of both word similarity and word sense disambiguation. It is suggested that the research on neural network language models incorporating human annotated knowledge bases would be a critical issue deserving our attention in the coming years.

关键词

词向量 / 义项向量 / 义原向量 / HowNet / 神经网络语言模型

Key words

word embedding / word sense embedding / sematic primitive embedding / HowNet / neural network language model
 
/   /   /
 
/   /   /
 
/   /  

引用本文

导出引用
孙茂松;陈新雄. 借重于人工知识库的词和义项的向量表示: 以HowNet为例. 中文信息学报. 2016, 30(6): 1-6
SUN Maosong ; CHEN Xinxiong. Embedding for Words and Word Senses Based on Human Annotated #br# Knowledge Base: A Case Study on HowNet. Journal of Chinese Information Processing. 2016, 30(6): 1-6

参考文献

[1] 董强,董振东.《知网》[DB]. http://www.keenage.com.
[2] Wang Y, Liu Z, Sun M. Incorporating Linguistic Knowledge for Learning Distributed Word Representations[J]. PloS one, 2015. 10(4): e0118437.
[3] Chen X, Liu Z, Sun M. A Unified Model for Word Sense Representation and Disambiguation[C]//Proceedings of EMNLP. 2014: 1025-1035.[4] Rothe S, Schütze H. AutoExtend: Extending Word Embeddings to Embeddings for Synsets and Lexemes[C]//Proceedings of ACL. 2015: 1793-1803.
[5] 唐共波, 于东荀, 恩东. 基于知网义原词向量表示的无监督词义消歧方法[J]. 中文信息学报, 2015,29(6): 23-29.
[6] Mikolov T, Yih W, Zweig G. Linguistic Regularities in Continuous Space Word Representations[C]//Proceedings of HLT-NAACL. 2013: 746-751.
[7] Pennington J, Socher R, Manning C D. Glove: Global vectors for word representation[C]//Proceedings of EMNLP. 2014: 1532-1543
[8] Li W, McCallum A. Semi-supervised Sequence Modeling with Syntactic Topic Models[C]//Proceedings of AAAI. 2005: 813.
[9] Wang J, Liu J, Zhang P. Chinese Word Sense Disambiguation with PageRank and HowNet[C]//Proceedings of the Sixth SIGHAN Workshop on Chinese Language Processing, 2008.

基金

国家社会科学基金(13&ZD190);国家自然科学基金(61133012)
PDF(1362 KB)

898

Accesses

0

Citation

Detail

段落导航
相关文章

/