基于分离模型的中文关键词提取算法研究

罗准辰,王挺

PDF(709 KB)
PDF(709 KB)
中文信息学报 ›› 2009, Vol. 23 ›› Issue (1) : 63.
综述

基于分离模型的中文关键词提取算法研究

  • 罗准辰,王挺
作者信息 +

Research on the Chinese Keyword Extraction Algorithm Based on Separate Models

  • LUO Zhun-chen, WANG Ting
Author information +
History +

摘要

关键词提取在自动文摘、信息检索、文本分类、文本聚类等方面具有十分重要的作用。通常所说的关键词实际上有相当一部分是关键的短语和未登录词,而这部分关键词的抽取是十分困难的问题。该文提出将关键词提取分为两个问题进行处理关键单词提取和关键词串提取,设计了一种基于分离模型的中文关键词提取算法。该算法并针对关键单词提取和关键词串提取这两个问题设计了不同的特征以提高抽取的准确性。实验表明,相对于传统的关键词提取算法,基于分离模型的中文关键词提取算法效果更好。

Abstract

Keyword extraction plays an important role in information retrieval, automatic summarizing, text clustering, and text classification, etc. A significant portion of keywords usually extracted are actually key phrases or the words not recorded yet, which makes the keyword extraction more difficult. This paper argues that the keyword extraction can be treated as two problemsextracting key words and extracting key phrases. A keyword extraction algorithm based on separate models was proposed, with different features developed for the two mentioned problems so as to improve the accuracy of keywords extracted from the Chinese documents. The experiment results show that the proposed algorithm has a better performance compared with the traditional keyword extraction algorithms.

关键词

计算机应用 / 中文信息处理 / 关键词提取 / 关键词串 / 分离模型 / 互信息 / 词串边界参数表

Key words

computer application / Chinese information processing / keyword extraction / keyphrases / separate model / mutual information / word-sequence boundary

引用本文

导出引用
罗准辰,王挺. 基于分离模型的中文关键词提取算法研究. 中文信息学报. 2009, 23(1): 63
LUO Zhun-chen, WANG Ting. Research on the Chinese Keyword Extraction Algorithm Based on Separate Models. Journal of Chinese Information Processing. 2009, 23(1): 63

参考文献

[1] Turney P. D. Learning to extract keyphrases from
text[R]. National Research Council, Canada, NRC Technical Report ERB-1O57,1999.
[2] Witten I. H., Paynter G. W., Frank E., Gutwin C., Nevill—Manning C. G. KEA: Practical automatic keyphrase extraction[C]//Proceedings of the 4th ACM conference on Digital libraries, Berkeley, California, US,1999: 254-256.
[3] 刘远超, 王晓龙, 徐志明,刘秉权. 基于粗集理论的中文关键词短语构成规则挖掘[J]. 电子学报,2007,35(2): 371-374.
[4] Anette Helth. Combining machine learning and natural language processing for automatic keyword extraction[D]. Stockholm: Department of computer and systems sciences, Stockholm University,2004: 35-38.
[5] Yang Wen-Feng. Chinese keyword extraction based on max-duplicated strings of the documents[C]//Proceedings of the 25th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Tampere, Finland,2002: 439-440.
[6] 李素建, 王厚峰, 俞士汶, 辛乘胜. 关键词自动标引的最大熵模型应用研究[J]. 计算机学报,2004, 27(9): 1192-1197.
[7] 王军. 词表的自动丰富——从元数据中提取关键词及其定位[J]. 中文信息学报,2005, 19(6): 36-43.
[8] 索红光, 刘玉树, 曹淑英. 一种基于词汇链的关键词抽取方法[J]. 中文信息学报,2006, 20(6): 27-32.
[9] Chang C. LIBSVM:a library for support vector machines[EB/OL]. 2006. Software available at http://www.csie.ntu.edu.tw/~cjlin/libsvm.
[10] Huang C., Tian Y., Zhou Z., Ling C., Huang T. Keyphrase extraction using semantic networks structure analysis[C]//Sixth IEEE International Conference on Data Mining, Hong Kong, China,2006: 275-284.
[11] Frank E. KEA: Keyphrase Extraction Algorithm[EB/OL]. 1999. Software available at http://www.nzdl.org/Kea/download.html.

               

基金

国家自然科学基金资助项目(60403050);新世纪优秀人才支持计划资助项目(NCET-06-0926)
PDF(709 KB)

Accesses

Citation

Detail

段落导航
相关文章

/