Keyword extraction is a key research issue in natural language processing, knowledge graph, system dialogue, etc. In this paper, we analyze the keyword extraction process from the existing keyword extraction algorithms, and sort out in detail the computational features and application cases of existing keyword extraction methods. We analyze the supervised extraction, the unsupervised extraction, and the semi-supervised extraction methods in terms of features extraction, representative papers, model algorithms, and method descriptions, summarzing the research progress, algorithm mechanism, advantages, limitations, and application scenarios as well. The keyword extraction evaluation strategies are given, and the application prospects of semi-supervised methods of keyword extraction are prospected, as well as the research directions and possible challenges in feature fusion, domain knowledge, and graph construction.
CUI Hongzhen, ZHANG Longhao, PENG Yunfeng, WU Wen.
A Survey for Keyword Extraction Algorithms. Journal of Chinese Information Processing. 2024, 38(2): 1-14,24
{{custom_sec.title}}
{{custom_sec.title}}
{{custom_sec.content}}
参考文献
[1] TURNEY P D. Learning algorithms for keyphrase extraction[J]. Information Retrieval, 2000, 2(4): 303-336. [2] ZARA N, SYED W J, MUHAMMAD K M. Textual keyword extraction and summarization: State-of-the-art[J]. Information Processing & Management, 2019, 56(6): 1-31. [3] 胡少虎,张颖怡,章成志.关键词提取研究综述[J].数据分析与知识发现,2021,5(03): 45-59. [4] XIE B, SONG J, SHAO L, et al. From statistical methods to deep learning, automatic keyphrase prediction: A survey[J]. Information Processing & Management, 2023, 60(4): 103382. [5] SONG M, FENG Y, JING L. A survey on recent advances in keyphrase extraction from pre-trained language models[J]. Findings of the Association for Computational Linguistics: EACL, 2023: 2108-2119. [6] LUHN H P. A statistical approach to mechanized encoding and searching of literary information[J]. IBM Journal of Research and Development, 1957, 1(4): 309-317. [7] 章成志.自动标引研究的回顾与展望[J].现代图书情报技术,2007(11): 33-39. [8] 赵京胜,朱巧明,周国栋,等.自动关键词抽取研究综述[J].软件学报,2017,28(09): 2431-2449. [9] 常耀成,张宇翔,王红,等.特征驱动的关键词提取算法综述[J].软件学报,2018,29(07): 2046-2070. [10] PAPAGIANNOPOULOU E, TSOUMAKAS G. A review of keyphrase extraction[J]. Wiley Interdisciplinary Reviews Data Mining & Knowledge Discovery, 2020, 10(2): 1-45. [11] MENG R, ZHAO S, HAN S, et al. Deep keyphrase generation[C]//Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics. Vancouver, Canada: Association for Computational Linguistics, 2017: 582-592. [12] FIGUEROA G, CHEN P C, CHEN Y S. RankUp: Enhancing graph-based keyphrase extraction methods with error-feedback propagation[J]. Computer Speech & Language, 2018, 47(1): 112-131. [13] CHEN W, CHAN H P, LI P, et al. Exclusive hierarchical decoding for deep keyphrase generation[J]. arXiv preprint arXiv: 2004.08511, 2020. [14] ZHANG Y, FANG Y, WEIDONG X. Deep keyphrase generation with a convolutional sequence to sequence model[C]//Proceedings of the 4th International Conference on Systems and Informatics. Hangzhou: IEEE, 2017: 1477-1485. [15] CHAN H P, CHEN W, WANG L, et al. Neural keyphrase generation via reinforcement learning with adaptive rewards[C]//Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics.Stroudsburg,PA: ACL,2019: 2163-2174. [16] 于强,林民,李艳玲.基于深度学习的关键词生成研究综述[J].计算机工程与应用,2022,58(14): 27-39. [17] 李静月,李培峰,朱巧明.一种改进的TF-IDF网页关键词提取方法[J].计算机应用与软件,2011,28(05): 25-27. [18] TOUTANOVA K, KLEIN D, MANNING CD, et al. Feature-rich part-of-speech tagging with a cyclic dependency network[C]//Proceedings of the ACL. Stroudsburg: ACL, 2003: 173-180. [19] BOUDIN F, MOUGARD H, CRAM D. How document pre-processing affects keyphrase extraction performance[C]//Proceedings of the COLING Workshop on Noisy User-Generated Text. Osaka: The COLING Organizing Committee, 2016: 121-128. [20] PARK Y, BYRD R J, BOGURAEV B K. Automatic glossary extraction: Beyond terminology identification[C]//Proceedings of the ACL. Stroudsburg: ACL, 2002: 1-7. [21] XIE F, WU X D, ZHU X Q. Efficient sequential pattern mining with wildcards for keyphrase extraction[J]. Knowledge-based Systems, 2017,115(1): 27-39. [22] JONES K S. A statistical interpretation of term specificity and its application in retrieval[J]. Journal of Documentation, 1972, 28 (1): 11-21. [23] HADDOUD M, ABDEDDAIM S. Accurate keyphrase extraction by discriminating overlapping phrases[J]. Journal of Information Science, 2014, 40(4): 488-500. [24] CARAGEA C, BULGAROV F, GODEA A, et al. Citation-enhanced keyphrase extraction from research papers: A supervised approach[C]//Proceedings of the EMNLP. Stroudsburg: ACL, 2014: 1435-1446. [25] ZHANG K, XU H, TANG J, et al. Keyword extraction using support vector machine[C]//Proceedings of the WAIM. Berlin, Heidelberg: Springer-Verlag, 2006. 85-96. [26] JOHN A K, DI C L, BOELLA G. A supervised keyphrase extraction system[C]//Proceedings of the SEMANTiCS. New York: ACM, 2016. 57-62. [27] AQUINO G O, LANZARINI L C. Keyword identification in Spanish documents using neural networks[J]. Journal of Computer Science and Technology, 2015, 15(2): 55-60. [28] BEREND G. Exploiting extra-textual and linguistic information in keyphrase extraction[J]. Natural Language Engineering, 2014,22(1): 73-95. [29] BASALDELLA M, ANTOLLI E, SERRA G, et al. Bidirectional LSTM Recurrent neural network for keyphrase extraction[C]//Proceedings of the 14th Italian Research Conference on Digital Libraries, Udine, Italy. Springer, 2018: 180-187. [30] ZHANG C, ZHAO L, ZHAO M, et al. Enhancing keyphrase extraction from academic articles with their reference information[J]. Scientometrics, 2022, 127(2): 703-731. [31] STERCKX L, CARAGEA C, DEMEESTER T, et al. Supervised keyphrase extraction as positive unlabeled learning[C]//Proceedings of the EMNLP. Stroudsburg: ACL, 2016: 1924-1929. [32] WANG J B, PENG H. Keyphrases extraction from web document by the least squares support vector machine[C]//Proceedings of the IEEE/WIC/ACM Int’l Conf. on Web Intelligence. Washington: IEEE, 2005: 293-296. [33] NGUYEN T D, KAN M Y. Keyphrase extraction in scientific publications[C]//Proceedings of the 10th International Conference on Asian Digital Libraries, Hanoi, Vietnam. Springer, 2007: 317-326. [34] HULTH A. Improved automatic keyword extraction given more linguistic knowledge[C]//Proceedings of the Conference on Empirical Methods in Natural Language Processing, Sapporo, Japan. Association for Computational Linguistics, 2003: 216-223. [35] GOLLAPALLI S D, LI X L, YANG P. Incorporating expert knowledge into keyphrase extraction[C]//Proceedings of the AAAI. Palo Alto: AAAI Press, 2017: 3180-3187. [36] NGUYEN T D, LUONG M T. WINGNUS: Keyphrase extraction utilizing document logical structure.[C]//Proceedings of the ACL Workshop on Semantic Evaluation. Stroudsburg: ACL, 2010: 166-169. [37] MARUJO L, GERSHMAN A, CARBONELL J, et al. Supervised topical key phrase extraction of news stories using crowd sourcing, light filtering and co-reference normalization[C]//Proceedings of the LREC. European Language Resources Association, 2012: 1385-1389. [38] BHASKAR P, NONGMEIKAPAM K, BANDYOPADHYAY S. Keyphrase extraction in scientific articles: A supervised approach[C]//Proceedings of the COLING. Mumbai: The COLING Organizing Committee, 2012: 17-24. [39] MEDELYAN O, FRANK E, WITTEN IH. Human-competitive tagging using automatic keyphrase extraction[C]//Proceedings of the EMNLP. Stroudsburg: ACL, 2009: 1318-1327. [40] ZHANG Y, ZHANG C. Enhancing keyphrase extraction from Microblogs using human reading time[J]. Journal of the Association for Information Science and Technology, 2021, 72(5): 611-626. [41] ZHANG W, FENG W, WANG JY. Integrating semantic relatedness and words' intrinsic features for keyword extraction[C]//Proceedings of the IJCAI. San Francisco: Morgan Kaufmann Publishers Inc., 2013: 2225-2231. [42] WANG R, LIU W, MCDONALD C. Corpus-Independent generic keyphrase extraction using word embedding vectors[C]//Proceedings of the Software Engineering Research Conference. 2014: 39. [43] WANG Y L, JIN Y, ZHU X D, et al. Extracting discriminative keyphrases with learned semantic hierarchies[C]//Proceedings of the COLING. Osaka: The COLING Organizing Committee, 2016: 932-942. [44] PAPAGIANNOPOULOU E, TSOUMAKAS G. Local word vectors guide keyphrase extraction[J]. arXiv Preprint arXiv: 1710.07503, 2017. [45] FRANK E, PAYNTER G W, WITTEN I H, et al. Domain-specific keyphrase extraction[C]//Proceedings of the 6th International Joint Conference on Artificial Intelligence. San Francisco, CA, USA: MorganKaufmann, 1999: 668-673. [46] TURNEY P D. Coherent keyphrase extraction via web mining[C]//Proceedings of the 18th International Joint Conferenceon Artificial Intelligence, Morgan Kaufmann, 2003: 434-439. [47] KELLEHER D, LUZ S. Automatic hypertext keyphrase detection[C]//Proceedings of the 19th International Joint Conference on Artificial Intelligence, San Francisco, CA, USA, 2005: 1608-1609. [48] MEDELYAN O, WITTEN I H. The saurus based automatic keyphrase indexing[C]//Proceedings of the 6th Joint Conference on Digital Libraries, ACM, 2006: 296-297. [49] SANGEETHA J, JOTHILAKSHMI S. A novel spoken keyword spotting system using support vector machine[J]. Engineering Applications of Artificial Intelligence, 2014, 36(1): 287-293. [50] CHEN Y Q, ZHOU R Q, ZHU W H, et al. Ming patent knowledge for automatic keyword extraction[J]. Journal of Computer Research and Development, 2016,53(8): 1740-1752. [51] YIH W, GOODMAN J, CARVALHO V R. Finding advertising keywords on web pages[C]//Proceedings of the 15th International Conference on World Wide Web, ACM, Edinburgh, Scotland, 2006: 213-222. [52] HADDOUD M, MOKHTARI A, LECROQ T, et al. Accurate keyphrase extraction from scientific papers by mining linguistic information[C]//Proceedings of the CLBib. 2015: 12-17. [53] LIANG D C, YI B C, CAO W, et al. Exploring ensemble oversampling method for imbalanced keyword extraction learning in policy text based on three-way decisions and smote[J]. Expert Systems with Applications, 2022, 188(1): 1-12. [54] SARKAR K, NASIPURI M, GHOSE S. Machine learning based keyphrase extraction: Comparing decision trees, Nave Bayes and artificial neural networks[J]. Journal of Information Processing Systems,2012,8(4): 693-712. [55] ZHANG Q, WANG Y, GONG Y Y, et al. Keyphrase extraction using deep recurrent neural networks on twitter[C]//Proceedings of the EMNLP. Stroudsburg: ACL, 2016: 836-845. [56] WANG Y N, LIU Q, Qin C et al. Exploiting topic-based adversarial neural network for cross-domain keyphrase extraction[C]//Proceedings of the IEEE International Conference on Data Mining, 2018: 597-606. [57] ZHANG Y, TUO M X, YIN Q Y, et al. Keywords extraction with deep neural network model[J]. NeuRocomputing, 2020, 383(1): 113-121. [58] WANG Y, LI J, CHAN H P, et al. Topic-aware neural keyphrase generation for social media language[C]//Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. Florence, Italy: Association for Computational Linguistics, 2019: 2516-2526. [59] ZHANG Y, ZHANG C, LI J. Joint modeling of characters, words, and conversation contexts for microblog keyphrase extraction[J]. Journal of the Association for Information Science and Technology, 2020, 71(5): 553-567. [60] DUAN X Y, YING S, CHENG H L, et al. Oilog: An online incremental log keyword extraction approach based on mdp-lstm neural network[J]. Information Systems, 2021, 95(1): 1-11. [61] LI W D, PENG R, LI S, et al. Co-occurrence graph based hierarchical neural networks for keyphrase generation[J]. Neurocomputing, 2020, 415(1): 15-26. [62] CHEN N C, ZHANG Y, DU W, et al. Ke-cnn: A new social sensing method for extracting geographical attributes from text semantic features and its application in wuhan, China[J]. Computers, Environment and Urban Systems, 2021, 88(1): 1-11. [63] ZHANG C Z, WANG H L, LIU Y, et al. Automatic keyword extraction from documents using conditional random fields[J]. Journal of Computational Information Systems, 2008,4(3): 1169-1180. [64] HETAL G, VAHIDA A. Extracting aspect terms using crf and bi-lstm models[J]. Procedia Computer Science, 2020, 167(1): 2486-2495. [65] PAGE L, BRIN S, MOTWANI R. The pagerank citation ranking: Bringing order to the Web[R]. Technical Report, Stanford InfoLab, 1999. [66] MIHALCEA R, TARAU P. TextRank: Bringing order into texts[C]//Proceedings of the EMNLP. Stroudsburg: ACL, 2004: 404-411. [67] WAN X J, XIAO J G. Single document keyphrase extraction using neighborhood knowledge[C]//Proceedings of the AAAI. Palo Alto: AAAI Press, 2008: 855-860. [68] LIU Z Y, HUANG W Y, ZHENG Y B, et al. Automatic keyphrase extraction via topic decomposition[C]//Proceedings of the EMNLP. Stroudsburg: ACL, 2010: 366-376. [69] GOLLAPALLI S D, CARAGEA C. Extracting keyphrases from research papers using citation networks[C]//Proceedings of the AAAI. Palo Alto: AAAI Press, 2014: 1629-1635. [70] DANESH S, SUMNER T, MARTIN J H. SGRank: Combining statistical and graphical methods to improve the state of the art in unsupervised keyphrase extraction[C]//Proceedings of the 4th Joint Conference on Lexical and Computational Semantics, Colorado, USA, 2015: 117-126. [71] STERCKX L, DEMEESTER T, DELEU J, et al. Topical word importance for fast keyphrase extraction[C]//Proceedings of the WWW. New York: ACM, 2015: 121-122. [72] ZHANG Y X, CHANG Y C, LIU X Q, et al. MIKE: Keyphrase extraction by integrating multidimensional information[C]//Proceedings of the CIKM. New York: ACM, 2017: 1349-1358. [73] FLORESCU C, CARAGEA C. A position-biased pagerank algorithm for keyphrase extraction[C]//Proceedings of the AAAI. Palo Alto: AAAI Press, 2017: 4923-4924. [74] TENEVA N, CHENG W W. Salience rank: Efficient keyphrase extraction with topic modeling[C]//Proceedings of the ACL. Stroudsburg: ACL, 2017,2: 530-535. [75] YAN Y, TAN Q P, XIE Q Z, et al. A graph-based approach of automatic keyphrase extraction[J]. Procedia Computer Science, 2017,107(1): 248-255. [76] SHI W, ZHENG W G, YU J X, et al. Keyphrase extraction using knowledge graphs[C]//Proceedings of the AP Web and WAIM Joint Conference on Web and Big Data. Cham: Springer Verlag, 2017: 132-148. [77] WANG X Y, NING H Y. TF-IDF keyword extraction method combining context and semantic classification[C]//Proceedings of the 3rd International Conference on Data Science and Information Technology. Association for Computing Machinery, New York, NY, USA, 2020: 123-128. [78] MUBASHAR N A, MIRZA O B. Top-rank: A topical postionrank for extraction and classification of keyphrases in text[J]. Computer Speech & Language, 2021, 65(1): 1-30. [79] DEVIKA R, SUBRAMANIYASWAMY V. A semantic graph-based keyword extraction model using ranking method on big social data[J]. Wireless Networks, 2021, 27(8): 5447-5459. [80] CHI L, Hu L. Iske: An unsupervised automatic keyphrase extraction approach using the iterated sentences based on graph method[J]. Knowledge-based Systems, 2021, 223(1): 1-12. [81] FURKAN G, ALEV M. Mgrank: A keyword extraction system based on multigraph gow model and novel edge weighting procedure[J]. Knowledge-based Systems, 2022, 251(1): 1-12. [82] ALIYA N, DARKHAN A Z, MADINA M, et al. NMF-based approach to automatic term extraction[J]. Expert Systems with Applications, 2022, 199(1): 1-21. [83] LI T, HU L, LI H, et al. Towards unsupervised keyphrase extraction via an autoregressive approach[J]. Knowledge based Systems, 2023, 274: 110664. [84] YE H, WANG L. Semi-supervised learning for neural keyphrase generation[C]//Proceedings of the Conference on Empirical Methods in Natural Language Processing. Brussels, Belgium: Association for Computational Linguistics, 2018: 4142-4153. [85] GUNJAN A, CHANDNI S, TANVIR A, et al. Aspect term extraction using graph-based semi-supervised learning[J]. Procedia Computer Science, 2020, 167(1): 2080-2090. [86] ABDHUL A, GOVIND V B, RAJAOPALAN S. Text mining of accident reports using semisupervised keyword extraction and topic modeling[J]. Process Safety and Environmental Protection, 2021, 155(1): 455-465. [87] VOORHEES E M. The TREC-8 question answering track report [C]//Proceedings of the 8th Text Retrieval Conference, Gaithersburg, USA. National Institute of Standards and Technology, 1999: 246-500. [88] LIU L, ZSU M T. Encyclopedia of database systems[M]. New York,USA: Springer US, 2009. [89] BUCKLEY C, VOORHEES E M. Retrieval evaluation with incomplete information[C]//Proceedings of the SIGIR, 2004: 25-32. [90] DAGAN I, PEREIRA F C N, LEE L. Similarity-based estimation of word cooccurrence probabilities[C]//Proceedings of the 32nd Annual Meeting of the Association for Computational Linguistics, Las Cruces, USA. ACL, 1994: 272-278. [91] ZHANG C Z, ZHOU D M. General evaluation model for automatic indexing[J]. Journal of the China Society for Scientific and Technical Information, 2009, 28 (1): 40-47.