生物医学文本挖掘技术的研究与进展

王浩畅,赵铁军

PDF(441 KB)
PDF(441 KB)
中文信息学报 ›› 2008, Vol. 22 ›› Issue (3) : 89-98.
综述

生物医学文本挖掘技术的研究与进展

  • 王浩畅,赵铁军
作者信息 +

Research and Development of Biomedical Text Mining

  • WANG Hao-chang, ZHAO Tie-jun
Author information +
History +

摘要

生物医学研究是二十一世纪最受关注的研究领域之一,该领域发表了巨量的研究论文,已经达到年平均60万篇以上。如何在规模巨大的研究文献中有效地获取相关知识,是该领域研究者所面临的挑战。作为生物信息学分支之一的生物医学文本挖掘技术就是一项高效自动地获取相关知识的新探索,近年来取得了较大进展。这篇综述介绍了生物医学文本挖掘的主要研究方法和成果,即基于机器学习方法的生物医学命名实体识别、缩写词和同义词的识别、命名实体关系抽取,以及相关资源建设、相关评测会议和学术会议等。此外还简要介绍了国内研究现状,最后对该领域近期发展作了展望。

Abstract

: 21st century is the era of biology and there are more than 6 hundred thousand academic papers published annually in this field. The challenge to researchers is how to automatically and effectively acquire relevant knowledge from huge size of biomedical literature. To address this issue, the biomedical text mining has become a new branch of bioinformatics and made great progress.. This survey introduces main approaches and relevant achievements in this research, including machine learning methods to named entity recognition, abbreviation and synonym recognition, relation extraction, as well as relevant resource constructions, international evaluations and academic gatherings..Some domestic researches are briefly described and, finally, prospective developments in the near future are anticipated.

关键词

计算机应用 / 中文信息处理 / 生物信息学 / 文本挖掘 / 信息抽取 / 机器学习

Key words

: computer application / Chinese information processing / bioinformatics / text mining / information extraction / machine learning

引用本文

导出引用
王浩畅,赵铁军. 生物医学文本挖掘技术的研究与进展. 中文信息学报. 2008, 22(3): 89-98
WANG Hao-chang, ZHAO Tie-jun. Research and Development of Biomedical Text Mining. Journal of Chinese Information Processing. 2008, 22(3): 89-98

参考文献

[1] Cohen, A. M., W. R. Hersh. A survey of current work in biomedical text mining, [J]. Briefings in Bioinformatics, 2005, 6(1): 57-71.
[2] Wang, Sammy. Application of Data and Text Mining to Bioinformatics [EB/OL]. http: //cs.uga.edu/ ~zhiming/datamining/TM.ppt.
[3] Ananiadou, Sophia, Kell, D. B. Tsujii, Jun-ichi. Text mining and its potential applications in systems biology [J]. Trends in Biotechnology. 2006, 24(12): 571-579.
[4] Polajnar, T. Survey of Text Mining of Biomedical Corpora [EB/OL]. http: //www.brc.dcs.gla.ac.uk/ ~tamara/surveyoftm.pdf.
[5] Kazama, Jun’ichi, Takaki Makino, et al. 2002. Tuning support vector machines for biomedical named entity recognition [A]. In: Proc. of ACL-02 Workshop on Natural Language Processing in the Biomedical Domain [C]. 2002. 1-8.
[6] Lee, Ki-Joong, Young-Sook Hwang, et al. 2003. Two-Phase Biomedical NE Recognition based on SVMs [A]. In: Proc. of ACL-03 Workshop on Natural Language Processing in the Biomedical Domain [C]. 2003. 33-40.
[7] Tanabe, L., Wilbur, W. J. Tagging gene and protein names in biomedical text [J]. Bioinformatics, 18(8): 1124-1132.
[8] Chang J T, Schutze H, Altman R B. GAPSCORE: Finding gene and protein names one word at a time [J]. Bioinformatics, 2004, 20(2): 216-225.
[9] Zhou G, Zhang J, Su J, et al. Recognizing names in biomedical texts: A machine learning approach [J]. Bioinformatics, 2004, 20(7). 1178-1190.
[10] Yi-Feng Lin, Tzong-Han Tsai, Wen-Chi Chou, et al. A Maximum Entropy Approach to Biomedical Named Entity Recognition[A].In: Proceedings of the 4th ACM SIGKDD Workshop on Data Mining in Bioinformatics (BIOKDD 2004)[C]. 2004.56-61.
[11] Tzong-han Tsai, Wen-Chi Chou, Shih-Hung Wu, et al. Integrating Linguistic Knowledge into a Conditional Random Field Framework to Identify Biomedical Named Entities[J]. Expert Systems with Applications. 2006, 30 (1), 117-128.
[12] Elhadad, No emie; and Komal Sutaria (2007) Mining a lexicon of technical terms and lay equivalents[4]. BioNLP 2007: Biological, translational, and clinical language processing[C]. 49-56.
[13] Corbett, Peter; Colin Batchelor; and Simone Teufel (2007) Annotation of chemical named entities[A]. BioNLP 2007: Biological, translational, and clinical language processing[C]. 57-64.
[14] Andreas Vlachos, Caroline Gasperin. Bootstrapping and Evaluating Named Entity Recognition in the Biomedical Domain[A]. Proc. Of BioNLP-2006[C]. June, New York City: 2006. 138-145.
[15] Kazuhiro Yoshida, Jun’ichi Tsujii. Reranking for Biomedical Named-Entity Recognition[A]. Proc. Of BioNLP-2007[C]. Prague: 2007. 215-222.
[16] Baohua Gu, Recognizing Nested Named Entities in GENIA corpus[A]. Proc. Of BioNLP-2006[C]. New York City: 2006.112-113.
[17] Beatrice Alex, Barry Haddow and Claire Grover. Recognising Nested Named Entities in Biomedical Text[A]. Proc. Of BioNLP-2007[C]. Prague: 2007. 65-72.
[18] Lorraine Tanabe, John Wilbur, A Priority Model for Named Entities[A], Proc. of BioNLP-2006[C]. New York City 2006. 33-40.
[19] Liu H, Friedman C. Mining terminological knowledge in large biomedical corpora [A]. Pac Symp Biocomput 2003 [C]. 2003. 415-426.
[20] Yu H, Hripcsak G, Friedman C. Mapping abbreviations to full forms in biomedical articles [J]. J Am Med Inform Assoc 2002, 9(3): 262-272.
[21] Schwartz, AS, Hearst MA. A simple algorithm for identifying abbreviation definitions in biomedical text [A]. Pac Symp Biocomput 2003 [C]. 2003, 451-462.
[22] Chang JT, Schutze H, Altman RB. Creating an online dictionary of abbreviations from MEDLINE [J]. J Am Med Inform Assoc 2002, 9 (6): 612-620.
[23] Yu. H, Agichtein. E. Extracting synonymous gene and protein terms from biological literature [J]. Bioinformatics, 2003, 19 (Suppl 1.1). i340-349.
[24] Cohen AM, Hersh WR, Dubay C, et al. Using co-occurrence network structure to extract synonymous gene and protein names from MEDLINE abstracts [J]. BMC Bioinformatics, 2005, 6(103).
[25] Haw-ren Fang, etc. Human Gene Name Normalization using Text Matching with Automatically Extracted Synonym Dictionaries[4]. Proc. Of BioNLP-2006[C]. New York City: 2006. 41-48.
[26] Alona Fyshe, Duane Szafron, Term Generalization and Synonym Resolution for Biological Abstracts: Using the Gene Ontology for Subcellular Localization Prediction[A]. Proc. Of BioNLP-2006[C]. New York City: 2006. 17-24.
[27] Jenssen TK, Laegreid A, Komorowski J, et al. A literature network of human genes for high throughput analysis of gene expression [J]. Nature Genetics, 2001, 28(1): 21-28.
[28] T Ono, H Hishigaki, A Tanigami, et al. Automated extraction of information on protein protein interactions from the biological literature [J]. Bioinformatics, 2001, 17(2): 155-161.
[29] Rindflesch TC, Tanabe L, JN Weinstein, et al. Extraction of drugs, genes and relations from the biomedical literature [A]. In Pacific Symposium on Biocomputing [C]. 2000. volume 5, 517-528.
[30] Ding J, D Berleant, D Nettleton, et al. Mining MEDLINE: abstracts, sentences, or phrases? [A]. Pacific Symposium on Biocomputing [C]. 2002. 326-337.
[31] Blaschke C, MA Andrade, C Ouzounis, et al. Automatic extraction of biological information from scientific text: protein-protein interactions [A]. In: Proceedings of the International Conference on Intelligent Systems for Molecular Biology [C]. 1999, 60-67.
[32] Eskin E, Agichtein E. Combining text mining and sequence analysis to discover protein functional regions [A]. Pac Symp Biocomput 2004 [C]. 2004. 288-299.
[33] Juan Xiao, Jian Su, GuoDong Zhou, et al. Protein-protein interaction extraction: A supervised learning approach[A]. First International Symposium on Semantic Mining in Biomedicine (SMBM)[C]. 2005. 148-156.
[34] Ameet Soni. Protein Interaction Extraction from Medline Abstracts Using Conditional Random Fields[EB/OL]. http: //pages.cs.wisc.edu/~apirak/cs/cs838/soni_report.pdf, May 4, 2006.
[35] Thomas J, D Milward, C Ouzounis, et al. Automatic extraction of protein interactions from scientific abstracts [A]. In Pacific Symposium on Biocomputing [C]. 2000. volume 5,541-552.
[36] Limsoon Wong. PIES, a protein interaction extraction system [A]. In Pacific Symposium on Biocomputing [C]. 2001. volume 6,520-531.
[37] Yakushiji A, Yuka Tateisi, Yusuke Miyao, et al. Event extraction from biomedical papers using a full parser [A]. In Pacific Symposium on Biocomputing [C]. 2001. volume 6, 408-419.
[38] Park J C, Hyun Sook Kim, Jung Jae Kim. Bidirectional incremental parsing for automatic pathway identification with combinatory categorical grammar [A]. In: Pacific Symposium on Biocomputing [C]. 2001. volume 6, 396-407.
[39] Zhongmin Shi, Anoop Sarkar, Fred Popowich. Simultaneous Identification of Biomedical Named-Entity and Functional Relation Using Statistical Parsing Techniques [A]. Human Language Technologies: The Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL-HLT 2007) [C]. Rochester, NY: April 22-27, 2007. 161-164.
[40] M. Stephens, M. Palakal, S. Mukhopadhyay, et al. Detecting Gene Relations From Medline Abstracts [A], PSB 2001 [C]. 2001. 483-495.
[41] Amgad Madkour, *Kareem Darwish, etc. BioNoculars: Extracting Protein-Protein Interactions from Biomedical Text[A]. Proc. Of BioNLP-2007[C]. Prague: 2007. 89-96.
[42] N. Collier, H. Park, N. Ogata, et al. The GENIA project: corpus-based knowledge acquisition and information extraction from genome research papers [A]. EACL 1999 [C]. 1999. 271-272.
[43] Tanabe, Lorraine, Natalie Xie, et al. GENETAG: a tagged corpus for gene/protein named entity recognition [J]. BMC Bioinformatics, 2005, 6(Suppl. 1): S3.
[44] James Pustejovsky, Jose Castano, Jason Zhang, et al. Medstract: creating large-scale information servers for biomedical libraries [A]. Proc. workshop on natural language processing in the biomedical domain [C]. Association for Computational Linguistics, 2002. 85-92.
[45] Brandeis University. Medstract Project -- Initial Annotion Corpora, 2001; http: //scylla.cs.brandeis.edu/gold-standards.html, accessed June 24, 2003.
[46] Franz en K, Gunnar Eriksson, Fredrik Olsson, et al. Protein names and how to find them [J]. International Journal of Medical Informatics, 2002, 67(1-3): 49-61.
[47] Blaschke, Christian, Miguel A. Andrade, et al. Automatic extraction of biological information from scientific text: protein-protein interactions [A]. ISMB-99 [C]. AAAI Press, 1999. 60-67.
[48] Cohen K B, Lynne Fox, Philip V. Ogren, et al. Corpus design for biomedical natural language processing [A]. Proceedings of the ACL-ISMB workshop on linking biological literature, ontologies and databases: mining biological semantics [C]. 2005. 38-45.
[49] Hersh W, Bhuptiraju RT. TREC Genomics track overview [A]. The Twelfth Text Retrieval Conference (TREC 2003), National Institute of Standards and Technology [C].2003.
[50] Hersh W, Bhuptiraju RT, Ross L, et al. TREC 2004 Genomics track overview [A]. The Thirteenth Text Retrieval Conference (TREC 2004), National Institute of Standards and Technology [C]. 2004.
[51] Hersh W, Cohen A, Yang J, et al. TREC 2005 Genomics track overview [A]. The Fourteenth Text Retrieval Conference (TREC 2005), National Institute of Standards and Technology [C]. 2005.
[52] KIM Jin-Dong,OHTA Tomoko,TSURUOKA Yoshimasa, et al. Introduction to the Bio-Entity Recognition Task at JNLPBA[A]. In Proceedings of the Joint Workshop on Natural Language Processing in Biomedicine and its Applications(JNLPBA-2004) [C]. Geneva, Switzerland: 2004. 70-75.
[53] John P. Pestian1, Christopher Brew, etc. A Shared Task Involving Multi-label Classification of Clinical Free Text[A]. Proc. Of BioNLP-2007[C]. 2007, Prague: 97-104.
[54] Hirschman L, Colosimo M, Morgan A, et al. Overview of BioCreAtIvE Task 1B: normalized gene lists [J]. BMC Bioinformatics, 2005, 6 (Suppl. 1), S11.
[55] Yeh A, Hirschman L, Morgan A. Evaluation of text data mining for database curation: lessons learned from the KDD Challenge Cup [J]. Bioinformatics, 2003, 19 (Suppl. 1), i331-i339.
[56] Huang M, Zhu X, Yu Hao, et al. Discovering patterns to extract protein-protein interactions from full texts [J]. Bioinformatics, 2004, 20(18): 3604-3612.
[57] Yu Hao, Zhu X, Huang M, et al. Discovering Patterns to Extract Protein-Protein Interactions from the Literature: Part II [J]. Bioinformatics, 2005,21(15): 3294-3300.
[58] Huang M, Zhu X, Ming Li. A hybrid method for relation extraction from biomedical literature [J]. International Journal of Medical Informatics, 2006, 75 (Issue 6): 443-455.
[59] 王浩畅,赵铁军.基于SVM的生物医学命名实体的识别[A]. 第十六届中国神经网络大会论文集,哈尔滨工程大学学报[C]. 2006,第27卷增刊: 570-574.
[60] 王浩畅等,赵铁军,刘延力等. 生物医学文本中命名实体识别的智能化方法[A]. 信息、知识、智能及其转换理论第一次高峰论坛会议论文集,北京邮电大学学报[C].2006,第29卷增刊: 54-58.

基金

国家863计划项目(2006AA010108,2006AA01Z150)
PDF(441 KB)

2020

Accesses

0

Citation

Detail

段落导航
相关文章

/