基于双层级联文本分类的简历信息抽取

于琨,管刚,周明,王煦法,蔡庆生

PDF(374 KB)
PDF(374 KB)
中文信息学报 ›› 2006, Vol. 20 ›› Issue (1) : 61-68.

基于双层级联文本分类的简历信息抽取

  • 于琨1,管刚2,周明2,王煦法1,蔡庆生1
作者信息 +

Resume Information Extraction Based on Cascaded Double-layer Classification

  • YU Kun1,GUAN Gang2,ZHOU Ming2,WANG Xu-fa1,CAI Qing-sheng1
Author information +
History +

摘要

本文提出了一种基于双层级联文本分类的方法,用于简历信息的自动抽取。本方法将简历文本分解为文本块和文本串,并将简历中包含的信息分解为概要信息与详细信息。首先对简历文本中的文本块进行切分与分类,抽取出概要信息,然后选择可能包含详细信息的文本块,将其切分为文本串,再通过对文本串的分类抽取出详细信息。对1200份中文简历的实验结果表明,本方法适用于简历信息的自动抽取和管理。

Abstract

This paper presents an approach based on cascaded double-layer text classification for resume information extraction. This approach first divides a resume into block and string. Then it divides the target information into general information and detailed information. It first extracts general information by block segmentation and classification. Then it selects those blocks that may contain predefined detailed information with a fuzzy strategy. At last , it segments these blocks into strings and labels the strings with detailed information classes. The experimental results on 1200 Chinese resumes show that our approach is suitable for the information extraction and management of resumes.

关键词

计算机应用 / 中文信息处理 / 信息抽取 / 文本分类 / 简历管理

Key words

computer application / Chinese information processing / information extraction / text classification / resume management

引用本文

导出引用
于琨,管刚,周明,王煦法,蔡庆生. 基于双层级联文本分类的简历信息抽取. 中文信息学报. 2006, 20(1): 61-68
YU Kun,GUAN Gang,ZHOU Ming,WANG Xu-fa,CAI Qing-sheng. Resume Information Extraction Based on Cascaded Double-layer Classification. Journal of Chinese Information Processing. 2006, 20(1): 61-68

参考文献

[1] M. E. Califf. Relational Learning Techniques for Natural Language Information Extraction [D]. Univ. of Texas , 1998.
[2] D. Freitag , A. McCallum. Information Extraction with HMM Structures Learned by Stochastic Optimization [A] . Proceedings of the Seventeenth National Conference on Artificial Intelligence [C] . Texas : 2000 , 584 - 589.
[3] M. Skounakis , M. Craven , S. Ray. Hierarchical Hidden Markov Models for Information Extraction [A] . IJCAI - 03 [C] .Mexico , 2003.
[4] A. Finn , N. Kushmerick. Multi-level Boundary Classification for Information Extraction [A]. ECML - 2004 [C]. 2004.
[5] L. Peshkin , A. Pfeffer. Bayesian Information Extraction Network [A] . IJCAI - 03 [C] . Acapulco , Mexico. 2003.
[6] D. Koller , M. Sahami. Hierarchically classifying documents using very few words [A] . Proceedings of the Fourteenth International Conference on Machine Learning (ICML’97) [C] . 1997 , 170 - 178.
[7] 刘少辉, 董明楷, 张海俊,等. 一种基于向量空间模型的多层次文本分类方法[J] . 中文信息学报. 2002 , 16 (3) :8 - 14.
[8] J . Cowie ,W.Lehnert. Information Extraction [J] . C. ACM,1996 ,39 (1) :80 - 91.
[9] T. Joachims. Text categorization with support vector machines : Learning with many relevant features [A] . Proceedings of the European Conference on Machine Learning [C] . Germany : Springer Verlag , 1998 , 137 - 142.
[10] C.Burges. A Tutorial on Support Vector Machine for Pattern Recognition [M] . Kluwer Academic Publishers , 1998.
[11] T. Joachims. Making large-Scale SVM Learning Practical. Advances in Kernel Methods-Support Vector Learning [M] . MIT-Press , 1999.
[12] J. F. Gao ,M. Li ,C. N. Huang. Improved Source-Channel Models for Chinese Word Segmentation [A] . ACL - 2003 [C] . Japan ,2003 ,272 - 279.
[13] F. Sebastiani. Machine Learning in Automated Text Categorization [J] . ACM Computing Surveys , 2002 , 34 (1) : 1 - 47.
[14] A. Lavelli et al. . A Critical Survey of the Methodology for IE Evaluation [A] . Proceedings of the 4th International Conference on Language Resources and Evaluation [C] . Portugal , 2004.
PDF(374 KB)

Accesses

Citation

Detail

段落导航
相关文章

/