基于属性主题分割的评论短文本词向量构建优化算法

PDF(3994 KB)

中文信息学报 ›› 2016, Vol. 30 ›› Issue (5) : 101-110.

综述

基于属性主题分割的评论短文本词向量构建优化算法

李志宇,梁循,周小平

作者信息 +

Improving the Word2vec on Short Text by Topic: Partition

LI Zhiyu, LIANG Xun, ZHOU Xiaopin

Author information +

History +

摘要

从词向量的训练模式入手,研究了基于语料语句分割(BWP)算法,分隔符分割(BSP)算法以及属性主题分割(BTP)算法三种分割情况下的词向量训练结果的优劣。研究发现,由于评论短文本的自身特征,传统的无分割(NP)训练方法,在词向量训练结果的准确率和相似度等方面与BWP算法、BSP算法以及BTP算法具有明显的差异。通过对0.7亿条评论短文本进行词向量构建实验对比后发现,该文所提出的BTP算法在同义词(属性词)测试任务上获得的结果是最佳的,因此BTP算法对于优化评论短文本词向量的训练,评论短文本属性词的抽取以及情感倾向分析等在内的,以词向量为基础的应用研究工作具有较为重要的实践意义。同时,该文在超大规模评论语料集上构建的词向量(开源)对于其他商品评论文本分析的应用任务具有较好可用性。

Abstract

We propose a method for Word2vec training on the short review textsby a partition according to the topic. We examine three kinds of partition methods, i.e. Based on Whole-review (BWP), Based on sentence-Separator (BSP) and Based on Topic(BTP), to improve the result of Word2vec training. Our findings suggest that there is a big difference on accuracy and similarity rates between the None Partition Model (NP) and BWP, BSP, BTP, due to the characteristic of the review short text. Experiment on various models and vector dimensions demonstrate that the result of word vector trained by Word2vec model has been greatly enhanced by BTP.

导出引用

李志宇,梁循,周小平. 基于属性主题分割的评论短文本词向量构建优化算法. 中文信息学报. 2016, 30(5): 101-110

LI Zhiyu, LIANG Xun, ZHOU Xiaopin. Improving the Word2vec on Short Text by Topic: Partition. Journal of Chinese Information Processing. 2016, 30(5): 101-110

参考文献

[1] Yuan Y, He L, Peng L, et al. A New Study Based on Word2vec and Cluster for Document Categorization[J]. Journal of Computational Information Systems, 2014, 10: 9301-9308.
[2] 张剑峰, 夏云庆, 姚建民. 微博文本处理研究综述[J]. 中文信息学报, 2012, 26(4): 21-27.
[3] 杨铭, 祁巍, 闫相斌, 等. 在线商品评论的效用分析研究[J]. 管理科学学报, 2012, 15(5): 65-75.
[4] 陈燕方, 李志宇. 基于评论产品属性情感倾向评估的虚假评论识别研究[J]. 现代图书情报技术, 2014, 9: 81-90.
[5] 任亚峰, 尹兰, 姬东鸿. 基于语言结构和情感极性的虚假评论识别[J]. 计算机科学与探索, 2014, 8(3): 313-320.
[6] Pang B, Lee L. Opinion mining and sentiment analysis[J]. Foundations and trends in information retrieval, 2008, 2: 1-135.
[7] Mikolov T. Word2vec project[CP].2013, https://code.google.com/p/word2vec/.
[8] Xue B, Fu C, Shaobin Z. A Study on Sentiment Computing and Classification of Sina Weibo with Word2vec[C]//Proceedings of the 2014 IEEE International Congress on, 2014: 358-363.
[9] Tang D, Wei F, Yang N, et al. Learning sentiment-specific word embedding for twitter sentiment classification[C]//Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics, 2014: 1555-1565.
[10] Godin F, Vandersmissen B, Jalalvand A, et al. Alleviating Manual Feature Engineering for Part-of-Speech Tagging of Twitter Microposts using Distributed Word Representations[C]//Proceedings of NIPS 2014Workshop on Modern Machine Learning and Natural Language Processing (NIPS 2014), 2014: 1-5.
[11] Ghiyasian B, Guo Y F. Sentiment Analysis Using SemiSupervised Recursive Autoencoders and Support Vector Machines[EB/OL],Stanford.edu,2014: 1-5.
[12] 张林, 钱冠群, 樊卫国, 等. 轻型评论的情感分析研究[J]. 软件学报, 2014, 12: 2790-2807.
[13] 周泓, 刘金岭, 王新功. 基于短文本信息流的回顾式话题识别模型[J]. 中文信息学报, 2015, 291: 015.
[14] 郑小平. 在线评论对网络消费者购买决策影响的实证研究[D].中国人民大学硕士学位论文,2008.
[15] 张紫琼, 叶强, 李一军. 互联网商品评论情感分析研究综述[J]. 管理科学学报, 2010, 13(6): 84-96.
[16] 邢永康, 马少平. 统计语言模型综述[J]. 计算机科学, 2003, 30(9): 22-26.
[17] Bengio Y, Ducharme R, Vincent P, et al. A neural probabilistic language model[J]. The Journal of Machine Learning Research, 2003, 3: 1137-1155.
[18] Mikolov T, Sutskever I, Chen K, et al. Distributed representations of words and phrases and their compositionality[C]//Proceedings of the Advances in Neural Information Processing Systems, 2013: 3111-3119.
[19] Mikolov T, Chen K, Corrado G, et al. Efficient estimation of word representations in vector space. arXiv preprint arXiv: 1301.3781[DB\OL], 2013: 1-16.
[20] Zhang W, Xu W, Chen G, et al. A Feature Extraction Method Based on Word Embedding for Word Similarity Computing[C]//Proceedings of the Natural Language Processing and Chinese Computing, 2014: 160-167.
[21] Iyyer M, Enns P, Boyd-Graber J, et al. Political ideology detection using recursive neural networks[C]//Proceedings of the Association for Computational Linguistics, 2014: 1-11.[22] 黄建传. 汉语标点句统计分析[D]. 北京语言大学硕士学位论文, 2008.
[23] 何玉. 基于核心词扩展的文本分类[D]. 华中科技大学硕士学位论文, 2006.
[24] Banker K. MongoDB in action[M]. Manning Publications, 2011.

基金

国家自然科学基金(71531012、71271211);京东商城电子商务研究项目(413313012);北京市自然科学基金(4132067);中国人民大学品牌计划(10XNI029);中国人民大学2015年度拔尖创新人才培育资助计划成果

PDF(3994 KB)

724

Accesses

Citation

Detail

段落导航

摘要
Abstract
关键词
Key words
引用本文
参考文献
基金

Received	Published
2015-06-03	2016-10-15
Issue Date
2016-10-15

选择文件类型/文献管理软件名称

选择包含的内容

摘要

Abstract

关键词

Key words

引用本文

{{custom_sec.title}}

{{custom_sec.title}}

参考文献

{{custom_fnGroup.title_cn}}

脚注

基金