蛋白质复合物对于生物学家有效了解细胞组织和功能具有重要意义,如何通过计算方法从蛋白质-蛋白质相互作用(PPI)网络中识别复合物是当前研究热点之一。然而,由于PPI网络中存在大量假阴性和假阳性噪声数据且现有已知蛋白质复合物并不完整,使得如何克服PPI网络的噪声问题,以及更好地利用已知蛋白质复合物,成为蛋白质复合物识别亟待解决的关键问题。为此,该文提出一种基于蛋白质复合物拓扑信息,利用监督学习进行蛋白质复合物识别的算法(NOBEL)。首先,NOBEL根据蛋白质的生物信息和拓扑信息构建加权PPI网络,降低了网络中的噪声问题;然后,通过加权PPI网络和未加权PPI网络提取复合物拓扑信息作为特征,并根据提取的特征训练监督学习模型,使得监督学习模型能有效学习复合物蕴含的信息;最后,将训练好的模型应用于PPI网络识别蛋白质复合物。作者在四种真实PPI网络上进行了实验,实验结果表明,NOBEL与其他七种蛋白质复合物识别算法相比,在F-measure方面分别至少提高了4.39%(Gavin)、1.32%(DIP)、2.39%(WI-PHI_core)和2.34%(WI-PHI_extend)。
Abstract
Protein complexes are significant in understand cell organization and function, and to identify complex from protein-protein interaction (PPI) network by computational method is one of the hot research topics. To overcome the noise issue in PPI network, this paper proposes a protein complex identification algorithm (NOBEL) via supervised learning based on topological information of protein complex. Firstly, NOBEL construct a weighted PPI network based on proteins biological information and topological information, so as to reduce the noise problem in the network. Then, complex topological information is extracted as features for the supervised model through weighted and unweighted PPI network. Finally, the trained model is applied to identify protein complexes from PPI networks. Experiments on four real PPI networks show that, compared with the other seven protein complexes identification algorithms, NOBEL improves F-measure by at least 4.39% on Gavin, 1.32% on DIP, 2.39% on WI-PHI_core and 2.34% on WI-PHI_extend, respectively.
关键词
蛋白质复合物 /
监督学习 /
特征提取 /
蛋白质相互作用网络
{{custom_keyword}} /
Key words
protein complex /
supervised learning /
feature extraction /
protein interaction network
{{custom_keyword}} /
{{custom_sec.title}}
{{custom_sec.title}}
{{custom_sec.content}}
参考文献
[1] LaCount D G, Vignali M, Chettier R, et al. A protein interaction network of the malaria parasite plasmodium falciparum[J]. Nature, 2005,438(7064): 103-107.
[2] Bader G D, Hogue C W. An automated method for finding molecular complexes in large protein interaction networks[J]. BMC Bioinformatics, 2003,4: 2.
[3] Nepusz T, Yu H, Paccanaro A. Detecting overlapping protein complexes in protein-protein interaction networks[J]. Nat Methods,2012,9(5): 471-472.
[4] Wang R Q, Wang C X, Sun L Y, et al. A seed-extended algorithm for detecting protein complexes based on density and modularity with topological structure and GO annotations[J]. BMC Genomics, 2019,20(1): 637.
[5] Wang J, Liang J, Zheng W, et al. Protein complex detection algorithm based on multiple topological characteristics in PPI networks[J]. Information Sciences,2019,489: 78-92.
[6] Liu G, Wong L, Chua H N. Complex discovery from weighted PPI networks[J]. BMC Bioinformatics, 2009,25(15): 1891-1897.
[7] Ma X K, Gao L. Predicting protein complexes in protein interaction networks using a core-attachment algorithm based on graph communicability[J]. Information Sciences,2012,189: 233-254.
[8] Meng X M, Peng X Q, Wu F X, et al. Detecting protein complex based on hierarchical compressing network embedding[C]//Proceedings of IEEE International Conference on Bioinformatics and Biomedicine. San Diego, CA, USA, 2019: 215-218.
[9] Xu B, Wang Y, Wang Z W, et al. An effective approach to detecting both small and large complexes from protein-protein interaction networks[J]. BMC Bioinformatics,2017,18(S12): 19-28.
[10] Wang R, Liu G, Wang C. Identifying protein complexes based on an edge weight algorithm and core-attachment structure[J]. BMC Bioinformatics, 2019,20(1): 471.
[11] Xu B, Li K, Zheng W, et al. Protein complexes identification based on GO attributed network embedding[J]. BMC Bioinformatics,2018,19(1): 535.
[12] Yu F, Yang Z, Tang N, et al. Predicting protein complex in protein interaction network: A supervised learning based method[J]. BMC Syst Biol,2014,8(S3): S4.
[13] Zhu J, Zheng Z, Yang M, et al. Protein complexes detection based on semi-supervised network embedding model[J]. IEEE/ACM Trans Comput Biol Bioinform,2019,5963: 1.
[14] Liu X X, Yang Z H, Sang S T, et al. Identifying protein complexes based on node embeddings obtained from protein-protein interaction networks[J]. BMC Bioinformatics,2018,19: 332.
[15] 徐周波, 杨建, 刘华东,等. 基于XGBoost与拓扑结构信息的蛋白质复合物识别算法[J]. 计算机应用,2020,357(05): 274-278.
[16] Chent T,Guestrin C. XGBoost: a scalable tree boosting system[C]//Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. New York, 2016: 785-794.
[17] Faridoon A, Sikandar A, Imran M, et al. Combining SVM and ECOC for identification of protein complexes from protein protein interaction networks by integrating amino acids' physical properties and complex topology[J]. Interdiscip Sci, 2020,12: 264-275.
[18] Cortes C, Vapnik V N. Support-vector networks[J]. Machine Learning,1995,20(3): 273-297.
[19] Tomita E, Tanaka A, Takahashi H. The worst-case time complexity for generating all maximal cliques and computational experiments[J]. Theor Comput Sci,2006,363(1): 28-42.
[20] Gavin A C, Aloy P, Grandi P, et al. Proteome survey reveals modularity of the yeast cell machinery[J]. Nature,2006,440(7084): 631-636.
[21] Xenarios I, Salwinski L, Duan X J, et al. DIP, the database of interacting proteins: A research tool for studying cellular networks of protein interactions[J]. Nucleic Acids Res,2002,30(1): 303-305.
[22] Kiemer L, Costa S, Ueffing M, et al. WI-PHI: A weighted yeast interactome enriched for direct physical interactions[J]. Proteomics,2010,7(6): 932-943.
[23] Pu S, Wong J, Turner B, et al. Up-to-date catalogues of yeast protein complexes[J]. Nucleic Acids Res,2008,37(3): 825-831.
[24] Mewes H W, Amid C, Arnold R, et al. MIPS: Analysis and annotation of proteins from whole genomes[J]. Nucleic Acids Res,2004,34(Database issue): 169-72.
[25] Dwight S S, Harris M A, Dolinski K, et al. Saccharomyces Genome Database (SGD) provides secondary gene annotation using the Gene Ontology (GO)[J]. Nucleic Acids Research,2002,30(1): 69-72.
[26] Aloy P, Bottcher B, Ceulemans H, et al. Structure-based assembly of protein complexes in yeast[J]. Science,2004,303(5666): 2026-2029.
[27] Shannon P, Markiel A, Ozier O, et al. Cytoscape: A software environment for integrated models of biomolecular interaction networks[M]. Genome Res,2003: 2498-2504.
[28] Strumbelj E, Kononenko I. Explaining prediction models and individual predictions with feature contributions[J]. Knowl Inf Syst, 2014,41(3): 647-665.
{{custom_fnGroup.title_cn}}
脚注
{{custom_fn.content}}
基金
中国博士后科学基金(2020M680931)
{{custom_fund}}