对于一个实体(产品或者商户),往往伴随着成千上万的用户评论。如何从这些冗杂的评论信息中抽取能够描述此实体的精华信息是研究的热点问题。该文提出了一种能够为每个实体抽取特征标签的方法,并且语义去重,保证标签在语义空间内相互独立。首先,对于每个实体的所有评论,进行中文分词、词性标注,并且做依存句法分析。然后,根据每个句子中的依存关系,抽取关键标签,构成此实体的标签库,并且对标签库进行显式语义去重。最后通过K-Means聚类以及Latent Dirichlet Allocation(LDA)主题模型将每个标签映射到语义独立的主题空间,再根据每个标签相对该主题的置信度进行排序。通过以上步骤,可以为每个实体抽取语义独立的关键标签描述,实验中,该文通过对返回标签列表的准确性以及语义多样性进行了统计分析,验证了标签抽取方法的可行性和有效性。
Abstract
There are usually millions of comments for an entity (e.g. a shop or a product). How to extract the consice and useful information to describe the entity is a challenging issue. This paper proposes a method to extract tags without semantic redundancy. First, we perform the word segmentation, POS tagging and dependency parsing for all the comments. Then, we extract tags acroding to the dependency realtions, and reduce the semantically duplicate tags explicitly. Finally, we map all the tags to the independent semantic space via K-Means and Latent Dirichlet Allocation(LDA), and rank the tag list.according to the topic confidence. The results of the experiments show that our method could extract the tags accurately with semantic independency.
Key wordsopinion mining; topic model; semantic independent; tag extraction; ranking
关键词
意见挖掘 /
主题模型 /
语义独立 /
标签抽取 /
排序
{{custom_keyword}} /
Key words
opinion mining /
topic model /
semantic independent /
tag extraction /
ranking
{{custom_keyword}} /
{{custom_sec.title}}
{{custom_sec.title}}
{{custom_sec.content}}
参考文献
[1] Blei D.M., A.Y. Ng, M.I. Jordan. Latent dirichlet allocation[J]. The Journal of Machine Learning Research, 2003. 3: 993-1022.
[2] Kobayashi N., K. Inui, Y. Matsumoto, et al. Collecting evaluative expressions for opinion extraction[C]//Proceedings of Natural Language Processing-IJCNLP 2004, 2005: 596-605.
[3] 姚天昉, 聂青阳, 李建超, 等. 一个用于汉语汽车评论的意见挖掘系统[C]//中文信息处理前沿进展——中国中文信息学会二十五周年学术会议,北京:清华大
学出版社,2006: 260-281.
[4] 姚天昉, 程希文, 徐飞玉, 等. 文本意见挖掘综述[J]. 中文信息学报, 2008,22(3): 71-80.
[5] Zhuang L., F. Jing, X.Y. Zhu, et al. Movie review mining and summarization[C]//Proceedings of the 15th ACM International Conference on Information and Knowledge Management 2006: 43-50.
[6] Hu, M., B. Liu. Mining opinion features in customer reviews[C]//Proceedings of 19th National Conference on Artificial Intelligence: Menlo Park, CA; Cambridge, MA; London; AAAI Press; MIT Press; 1999.2004: 755-760.
[7] Ma B.L.W.H.Y. Integrating classification and association rule mining[C]//Proceedings of In Knowledge Discovery and Data Mining,1998.
[8] Popescu A.M., O. Etzioni. Extracting product features and opinions from reviews[C]//Proceedings of HLT-Demo ’05 HLT/EMNLP on Interactive Demonstrations Association for Computational Linguistics.2005: 339-346.
[9] Etzioni O., M. Cafarella, D. Downey, et al. Unsupervised named-entity extraction from the web: An experimental study[C]//Proceedings of Artificial Intelligence, 2005: 165(1): 91-134.
[10] MacQueen J. Some methods for classification and analysis of multivariate observations[C]//Proceedings of 5th Berkeley Symposium on Mathematical Statistics and Probability. California, USA,1967: 14.
[11] Levenshtein Distance[OL]. http://en.wikipedia.org/wiki/Levenshtein_distance.
[12] Che W., Z. Li, T. Liu. Ltp: A chinese language technology platform[C]//Proceedings of Coling 2010, Demonstrations: Association for Computational Linguistics.2010: 13-16.
{{custom_fnGroup.title_cn}}
脚注
{{custom_fn.content}}
基金
国家自然科学基金资助项目(60970047,61103151,61173068);教育部博士点基金资助项目(20110131110028)
{{custom_fund}}