该文从训练词向量的语言模型入手,研究了经典skip-gram、CBOW语言模型训练出的词向量的优缺点,引入TFIDF文本关键词计算法,提出了一种基于关键词改进的语言模型。研究发现,经典skip-gram、CBOW语言模型只考虑到词本身与其上下文的联系,而改进的语言模型通过文本关键词建立了词本身与整个文本之间的联系,在词向量训练结果的查准率和相似度方面,改进模型训练出的词向量较skip-gram、CBOW语言模型有一个小幅度的提升。通过基于维基百科1.5GB中文语料的词向量训练实验对比后发现,使用CBOW-TFIDF模型训练出的词向量在相似词测试任务中结果最佳;把改进的词向量应用到情感倾向性分析任务中,正向评价的精确率和F1值分别提高了4.79%、4.92%,因此基于统计语言模型改进的词向量,对于情感倾向性分析等以词向量为基础的应用研究工作有较为重要的实践意义。
Abstract
ZHANG Kejun1, SHI Taimeng1, LI Weinan1,2, QIAN Rong1z
关键词
词向量 /
统计语言模型 /
TFIDF /
文本关键词 /
CBOW-TFIDF
{{custom_keyword}} /
Key words
word vector /
statistical language model /
TFIDF /
key words /
CBOW-TFIDF
{{custom_keyword}} /
{{custom_sec.title}}
{{custom_sec.title}}
{{custom_sec.content}}
参考文献
[1] 奚雪峰,周国栋,面向自然语言处理的深度学习研究[J].自动化学报,2016,42(10):1445-1465.
[2] Mikolov T.Word2Vec project[CP],2013.https://code.google.com/p/word2vec/.
[3] Joseph Turian,Lev Ratinov,Yoshua Bengio.Word representations:A simple and general method for semi -supervised learning[C]//Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics.Uppsala,Sweden:[s.n],2010:384-394.
[4] Mikolov T,Chen K,Corrado G,et al.Efficient estimation of word representations in vector space[J].arXiv preprint arXiv:1301.3791,2013.
[5] Mikolov T,Yih W T,Zweig G.Linguistic regularities in continuous space word representation[C]//Proceedings of NAACL HLT 2013-2013 Conference of North American Chapter of the Association for Computational linguistics:Human Language Technologies,2013:746-751.
[6] Mikolov T,Sutskever I,Chen K,et al.Distributed representations of words and phrases and their compositionality[C]//Proceedings of Advances in Neural Information Processing Systems,2013,26:3111-3119.
[7] 唐明,朱磊,邹显春.基于Word2Vec的一种文档向量表示[J].计算机科学,2016,43(06):214-217,269.
[8] Xue B,Fu C,Shao Z.A study on sentiment computing and classification of Sina Weibo with Word2vec[C]//Proceedings of the 2014 IEEE International Congress on Tology,2014:358-363.
[9] Ghiyasian B,Guo Y F.Sentiment Analysis Using Semi Supervised Recursive Auto encoders and Support Vector Machines[EB/OL].[2018-08-15].http://cs229.stanford.edu/proj2014/Bahareh%20Ghiyasian,%20Yun%20Fei%20Guo,%20Sentiment%20Analysis%20Using%20Semi-Supervised%20Recursive%20Autoencoders%20and%20Support%20Vector%20Machines.pdf.
[10] 李志宇,梁循,周小平.基于属性主题分割的评论短文本词向量构建优化算法[J].中文信息学报,2016,30(05):101-110,120.
[11] Lai S W,Liu K,He S Z,et al.How to generate a good word embedding?[J].IEEE Intelligent Systems,2016,31(6):5-14.
[12] 王飞,谭新.一种基于Word2Vec的训练效果优化策略研究[J].计算机应用与软件,2018,35(1):97-102,174.
[13] 熊富林,邓怡豪,唐晓晟.Word2Vec的核心架构及其应用[J].南京师范大学学报(工程技术版),2015,15(1):43-48.
[14] 邢永康,马少平.统计语言模型综述[J].计算机科学,2003,30(9):22-26.
[15] Bengio Y,Ducharme R,Vincent P,et al.A neural probabilistic language model[J].The Journal of Machine Learning Research,2003,3:1137-1155.
[16] Mikolov T,Le Q V,Sutskever I.Exploiting Similarities among Languages for Machine Translation[J].arXiv preprint arXiv:1309.4168,2013.
[17] Beck_zhou.Word2vec使用指导[EB/OL],2014-04-22[2018-08-15].http://blog.csdn.net/zhoubl668/article/details/24314769.
[18] 陈钊.面向中文文本的情感分析方法研究[D].哈尔滨:哈尔滨工业大学硕士学位论文,2016.
{{custom_fnGroup.title_cn}}
脚注
{{custom_fn.content}}
基金
国家重点研发计划(2018YFB1004101);国家自然科学基金(61170037)
{{custom_fund}}