Protein-Protein Interaction Extraction from Biomedical Literature
ZHAO Zhehuan1, YANG Zhihao2, SUN Cong2, LIN Hongfei2
1.Key Laboratory for Ubiqmitous Network and Service Software of Liaoning, School of Software, Dalian University of Technology, Dalan, Liaoning 116620, China; 2.College of Computer Science and Technology, Dalian University of Technology, Dalian, Liaoning 116024, China
摘要蛋白质关系抽取研究对于生命科学各领域的研究具有广泛的应用价值。但是,基于机器学习的蛋白质关系抽取方法普遍停留在二元关系抽取,失去了丰富的关系类型信息,而基于规则的开放式信息抽取方法可以抽取完整的蛋白质关系(“蛋白质1,关系词,蛋白质2”),但是召回率较低。针对以上问题,该文提出了一种混合机器学习和规则方法的蛋白质关系抽取框架。该框架先利用机器学习方法完成命名实体识别和二元关系抽取,然后利用基于句法模板和词典匹配的方法抽取表示当前两个蛋白质间关系类型的关系词。该方法在AImed语料上取得了40.18%的F值,远高于基于规则的Stanford Open IE方法。
Abstract:Protein-protein interaction extraction can be widely applied to the field of life science research. Most of the machine learning based methods focused on binary relationship extraction for high precision, while the rule based strategy can extract complex relations (“protein1, relational word, protein2”) with low recall. This paper proposes a hybrid protein-protein interaction extraction method. In this method, machine learning methods are first applied to recognize protein entities and extract relational protein pairs. Then, the syntactic patterns and a dictionary are employed to find out corresponding relational words that represent the relationship between two proteins. This method obtains a F-score of 40.18% on the AImed corpus, outperforming any of the two methods alone.
[1] Kerrien S, Aranda B, Breuza L, et al. The IntAct molecular interaction database in 2012 [J]. Nucleic Acids Research, Oxford Univ Press, 2011, 40(D1): D841-D846. [2] Licata L, Briganti L, Peluso D, et al. MINT, the molecular interaction database: 2012 update [J]. Nucleic Acids Research, Oxford Univ Press, 2012, 40(D1): D857-D861. [3] Chatr Aryamontri A, Breitkreutz B J, Oughtred R, et al. The BioGRID interaction database: 2015 update [J]. Nucleic Acids Research, Oxford Univ Press, 2015, 43(D1): D470-D478. [4] 李丽双, 蒋振超, 万佳, 等. 利用词表示和深层神经网络抽取蛋白质关系[J], 中文信息学报, 2017, 31(01): 31-40. [5] Zhang Y, Lin H, Yang Z, et al. A single kernel-based approach to extract drug-drug interactions from biomedical literature [J]. PLoS One, 2012, 7(11): e48901. [6] 李丽双, 郭瑞, 黄德根, 等. 基于迁移学习的蛋白质交互关系抽取[J]. 中文信息学报, 2016, 30(2): 160-167. [7] Yang Z, Zhao Z, Li Y, et al. PPIExtractor: A protein interaction extraction and visualization system for biomedical literature [J]. IEEE Transactions on Nanobioscience, 2013, 12(3): 173-181. [8] Banko M, Cafarella M J, Soderland S, et al. Open information extraction from the web [C]//Proceedings of the IJCAI, Hyderabad, India, 2007(7): 2670-2676. [9] Fader A, Soderland S, Etzioni O. Identifying relations for open information extraction [C]//Proceedings of the Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, Portland, Oregon, USA, 2011: 1535-1545. [10] Schmitz M, Bart R, Soderland S, et al. Open language learning for information extraction [C]//Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning. Association for Computational Linguistics, Vancouver, Canada, 2012: 523-534. [11] Angeli G, Premkumar M J, Manning C D. Leveraging linguistic structure for open domain information extraction [C]//Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics (ACL 2015), Beijing, China, 2015: 344-354. [12] Nebot V, Berlanga R. Semantics-aware open information extraction in the biomedical domain [C]//Proceedings of the 4th International Workshop on Semantic Web Applications and Tools for the Life Sciences. ACM, London, UK, 2011: 84-91. [13] Berlanga R, Nebot V, Jimenez E, et al. Semantic annotation of biomedical texts through concept retrieval [J]. Procesamiento Del Lenguaje Natural, 2010, 45(45): 247-250. [14] Bodenreider O. The unified medical language system (UMLS): Integrating biomedical terminology [J]. Nucleic Acids Research, 2004, 32(suppl_1): D267-D270. [15] Nguyen N T H, Miwa M, Tsuruoka Y, et al. Open information extraction from biomedical literature using predicate-argument structure patterns [C]//Proceedings of the 5th International Symposium on Languages in Biology and Medicine, Tokyo, Japan, 2013, 51: 55. [16] Matsuzaki T, Miyao Y, Jun'ichi Tsujii. Efficient HPSG parsing with supertagging and CFG-Filtering [C]. IJCAI, Hyderabad, India, 2007: 1671-1676. [17] Aronson A R. Effective mapping of biomedical text to the UMLSMetathesaurus: the MetaMap program [C]//Proceedings of the AMIA Symposium. American Medical Informatics Association, Washington, DC, 2001: 17. [18] Blaschke C, Valencia A. The frame-based module of the SUISEKI information extraction system [J]. IEEE Intelligent Systems, 2002, 17(2): 14-20. [19] Corney D P A, Buxton B F, Langdon W B, et al. BioRAT: Extracting biological information from full-length papers [J]. Bioinformatics, 2004, 20(17): 3206-3213. [20] Zhao Z, Yang Z, Luo L, et al. ML-CNN: A novel deep learning based disease named entity recognition architecture [C]//Proceedings of bioinformatics and biomedicine (BIBM), 2016 IEEE International Conference on. IEEE, Shenzhen, China, 2016: 794-794. [21] Zhao Z, Yang Z, Luo L, et al. Drug-drug interaction extraction from biomedical literature using syntax convolutional neural network [J]. Bioinformatics, 2016, 32(22): 3444-3453. [22] Bunescu R, Ge R, Kate R J, et al. Comparative experiments on learning information extractors for proteins and their interactions[J]. Artificial Intelligence in Medicine, 2005, 33(2): 139-155. [23] Ma X, Hovy E. End-to-end sequence labeling via bi-directional lstm-cnns-crf[C]//Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, Berlin, Germany, 2016: 1064-1074. [24] Lample G, Ballesteros M, Subramanian S, et al. Neural architectures for named entity recognition [C]//Proceedings of NAACL-HLT, San Diego, California, 2016: 260-270. [25] Lafferty J D, Mccallum A, Pereira F C N. Conditional random fields: Probabilistic models for segmenting and labeling sequence data[C]//Proceedings of Eighteenth International Conference on Machine Learning. Morgan Kaufmann Publishers Inc. 2001: 282-289. [26] Bunescu R C, Mooney R J. A shortest path dependency kernel for relation extraction [C]//Proceedings of the Conference on Human Language Technology and Empirical Methods in Natural Language Processing. Association for Computational Linguistics, University of Michigan, USA, 2005: 724-731. [27] Bengio Y, Ducharme R, Vincent P, et al. A neural probabilistic language model [J]. Journal of Machine Learning Research, 2003, 3(6): 1137-1155. [28] Zeng D, Liu K, Lai S, et al. Relation classification via convolutional deep neural network [C]//Proceedings of COLING. ICCL, Dublin, 2014: 2335-2344. [29] Hinton G E, Zemel R S. Autoencoders, minimum description length and Helmholtz free energy [C]//Advances in Neural Information Processing Systems. 1994, Denver, CO, USA, 1994: 3-10. [30] Temkin J M, Gilder M R. Extraction of protein interaction information from unstructured text using a context-free grammar [J]. Bioinformatics, 2003, 19(16): 2046-2053.