EntropyRank: 基于主题熵的关键短语提取算法

尹红,陈雁,李平

PDF(3585 KB)
PDF(3585 KB)
中文信息学报 ›› 2019, Vol. 33 ›› Issue (11) : 107-114.
信息抽取与文本挖掘

EntropyRank: 基于主题熵的关键短语提取算法

  • 尹红,陈雁,李平
作者信息 +

EntropyRank: Keyphrase Extraction Algorithm Based on Topic Entropy

  • YIN Hong, CHEN Yan, LI Ping
Author information +
History +

摘要

关键短语提取是自然语言处理领域的一个重要子任务,其目的是自动识别出文本中的重要短语,现有方法主要强调词语间相关关系和词语自身影响力会影响关键短语提取效果。考虑到关键短语应准确地表示文档主题这一特点,该文提出一种基于主题熵的关键短语提取算法。该算法利用隐含狄利克雷分布训练文档和词的主题分布,并结合两个主题分布来表示特定文档下的词主题分布,然后计算词主题分布的信息熵即主题熵来表示词语自身影响力,最后在词共现网络上使用随机游走方法计算每个候选短语的得分。在6个公开数据集上的实验结果表明,与现有的无监督关键短语提取算法相比,该算法在F1指标上能提高2.61%~6.98%。

Abstract

Key-phrase extraction aims to automatically identify important key-phrases from documents. Most existing methods are focused on the words' importance and the relation between words. Considering that key-phrase should closely related to the article's topics, we proposed an improved method based on topic entropy. Our work firstly use Latent Dirichlet Allocation to train the theme distribution of documents and words, and combine them to get the words' topic distribution of a specific document. Then words' topic entropy are worked out to represent the words' importance. Finally, we use random walk on words' co-occurrence graph to calculate the score of each candidate phrase. Experimental results show that proposed method has an improvement of 2.61%-6.98% in F1 score compared with the existing methods.

关键词

关键短语提取 / 随机游走 / 主题模型 / 词语影响力

Key words

keyphrase extraction / random walk / topic model / word influence

引用本文

导出引用
尹红,陈雁,李平. EntropyRank: 基于主题熵的关键短语提取算法. 中文信息学报. 2019, 33(11): 107-114
YIN Hong, CHEN Yan, LI Ping. EntropyRank: Keyphrase Extraction Algorithm Based on Topic Entropy. Journal of Chinese Information Processing. 2019, 33(11): 107-114

参考文献

[1] AbuJbara A,Radev D. Coherent citation-based summarization of scientific papers[C]//Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies-Volume 1. Association for Computational Linguistics,2011: 500-509.
[2] Jones S,Staveley M S. Phrasier: a system for interactive document retrieval using keyphrases[C]//Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM,1999: 160-167.
[3] Hammouda K M,Matute D N,Kamel M S. Corephrase: Keyphrase extraction for document clustering[C]//Proceedings of the International Workshop on Machine Learning and Data Mining in Pattern Recognition. Springer,Berlin,Heidelberg,2005: 265-274.
[4] Jones S,Paynter G. Topic-based browsing within a digital library using keyphrases[C]//Proceedings of the International Conference on Digital Libraries: Proceedings of the fourth ACM conference on Digital libraries. 1999,11(14): 114-121.
[5] Hasan K S,Ng V. Automatic keyphrase extraction: A survey of the state of the art[C]//Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics. 2014,1: 1262-1273.
[6] Frank E,Paynter G W,Witten I H,et al. Domainspecific keyphrase extraction[C]//Proceedings of the 16th International Joint Conference on Artificial Intelligence. Morgan Kaufmann Publishers Inc.,San Francisco,CA,USA,1999,2: 668-673.
[7] Hulth A. Improved automatic keyword extraction given more linguistic knowledge[C]//Proceedings of the 2003 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics,2003: 216-223.
[8] Kleinberg J M. Authoritative sources in a hyperlinked environment[J].Journal of the ACM (JACM),1999,46(5): 604-632.
[9] Page L,Brin S,Motwani R,et al. The PageRank citation ranking: Bringing order to the web[R].Technical report,Stanford Digital Library Technologies Project,1998.
[10] Chen Y,Wang J,LI P,et al.Single document keyword extraction via quantifying higher-order structural features of word co-occurrence graph[J].Computer Speech & Language,2019,57: 98-107.
[11] Medelyan O,Frank E,Witten I H. Human-competitive tagging using automatic keyphrase extraction[C]//Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing: Volume 3. Association for Computational Linguistics,2009: 1318-1327.
[12] Kim S N,Medelyan O,Kan M Y,et al. Automatic keyphrase extraction from scientific articles[J].Language resources and evaluation,2013,47(3): 723-742.
[13] Mihalcea R,Tarau P. Textrank: Bringing order into text[C]//Proceedings of the conference on empirical methods in natural language processing. 2004: 404-411.
[14] Wan X,Xiao J. Single document keyphrase extraction using neighborhood knowledge[C]//Proceedings of the AAAI. 2008,8: 855-860.
[15] Gollapalli S D,Caragea C.Extracting keyphrases from research papers using citation networks[C]//Proceedings of the Twenty-Eighth AAAI Conference on Artificial Intelligence,2014: 1629-1635.
[16] Wang R,Liu W,McDonald C. Corpus-independent generic keyphrase extraction using word embedding vectors[C]//Proceedings of the Software Engineering Research Conference,2014,39.
[17] Wen Y,Yuan H,Zhang P. Research on keyword extraction based on word2vec weighted textrank[C]//Proceedings of the 2016 2nd IEEE International Conference on Computer and Communications(ICCC).IEEE,2016: 2109-2113.
[18] Florescu C,Caragea C. Positionrank: An unsupervised approach to keyphrase extraction from scholarly documents[C]//Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics.2017,1: 1105-1115.
[19] Liu Z,Huang W,Zheng Y,et al. Automatic keyphrase extraction via topic decomposition[C]//Proceedings of the 2010 conference on empirical methods in natural languageprocessing. Association for Computational Linguistics,2010: 366-376.
[20] Sterckx L,Demeester T,Deleu J,et al. Topical word importance for fast keyphrase extraction[C]//Proceedings of the 24th International Conference on World Wide Web. ACM,2015: 121-122.
[21] Teneva N,Cheng W. Salience rank: Efficient keyphrase extraction with topic modeling[C]//Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics.2017,2: 530-535.
[22] Blei D M,Ng A Y,Jordan M I. Latent dirichlet allocation[J].Journal of machine Learning research,2003,3(1): 993-1022.
[23] Marujo L,Gershman A,Carbonell J,et al. Supervised topical key phrase extraction of news stories using crowdsourcing,light filtering and co-referenceznor-malization[J].arXiv preprint arXiv: 1306.4886,2013.
[24] Wan X,Xiao J. Single Document Keyphrase Extraction Using Neighborhood Knowledge[C]//Proceedings of the AAAI. 2008,8: 855-860.
[25] Nguyen T D,Kan M Y. Keyphrase extraction in scientific publications[C]//Proceedings of International conference on Asian digital libraries. Springer,Berlin,Heidelberg,2007: 317-326.
[26] Krapivin M,Autaeu A,Marchese M. Large dataset for keyphrases extraction[R]. Dept. Inf.Eng. Comput. Sci.,Univ. Trento,Trentino,Italy,Tech. Rep. DISI-09-055, 2009.
[27] Meng R,Zhao S,Han S,et al. Deep keyphrase generation[J]. arXiv preprint arXiv:1704.06879,2017.

基金

国家自然科学青年基金(61503312)
PDF(3585 KB)

890

Accesses

0

Citation

Detail

段落导航
相关文章

/