Matches in SemOpenAlex for { <https://semopenalex.org/work/W4310857078> ?p ?o ?g. }
- W4310857078 endingPage "100651" @default.
- W4310857078 startingPage "100651" @default.
- W4310857078 abstract "•Integrating multiple molecular networks to improve signal-to-noise ratio•Self-supervised representation learning at both node level and context level•Task-specific re-training using graph attention network converges efficiently•Achieves superior performance to refine disease gene reprioritization With the recent progress of high-throughput experimental techniques, physical interactions and functional associations of genes and proteins are accumulating into multiple molecular networks. Effective integration of these networks and extraction of biological insight remains a long-standing challenge. The two-step GNN (graph neural network) approach (Graphene) introduced here offers a self-supervised solution and validates its utility in a range of disease gene sets. Leveraging molecular networks to discover disease-relevant modules is a long-standing challenge. With the accumulation of interactomes, there is a pressing need for powerful computational approaches to handle the inevitable noise and context-specific nature of biological networks. Here, we introduce Graphene, a two-step self-supervised representation learning framework tailored to concisely integrate multiple molecular networks and adapted to gene functional analysis via downstream re-training. In practice, we first leverage GNN (graph neural network) pre-training techniques to obtain initial node embeddings followed by re-training Graphene using a graph attention architecture, achieving superior performance over competing methods for pathway gene recovery, disease gene reprioritization, and comorbidity prediction. Graphene successfully recapitulates tissue-specific gene expression across disease spectrum and demonstrates shared heritability of common mental disorders. Graphene can be updated with new interactomes or other omics features. Graphene holds promise to decipher gene function under network context and refine GWAS (genome-wide association study) hits and offers mechanistic insights via decoding diseases from genome to networks to phenotypes. Leveraging molecular networks to discover disease-relevant modules is a long-standing challenge. With the accumulation of interactomes, there is a pressing need for powerful computational approaches to handle the inevitable noise and context-specific nature of biological networks. Here, we introduce Graphene, a two-step self-supervised representation learning framework tailored to concisely integrate multiple molecular networks and adapted to gene functional analysis via downstream re-training. In practice, we first leverage GNN (graph neural network) pre-training techniques to obtain initial node embeddings followed by re-training Graphene using a graph attention architecture, achieving superior performance over competing methods for pathway gene recovery, disease gene reprioritization, and comorbidity prediction. Graphene successfully recapitulates tissue-specific gene expression across disease spectrum and demonstrates shared heritability of common mental disorders. Graphene can be updated with new interactomes or other omics features. Graphene holds promise to decipher gene function under network context and refine GWAS (genome-wide association study) hits and offers mechanistic insights via decoding diseases from genome to networks to phenotypes. Diseases or traits involve molecules interacting within cellular networks and pathways under certain biological contexts. Understanding functional interdependencies of genes and proteins can provide a system-level view of how genetic alterations dysregulate relevant pathways or biological processes, and further lead to disease phenotypes.1Wong A.K. Sealfon R.S.G. Theesfeld C.L. Troyanskaya O.G. Decoding disease: from genomes to networks to phenotypes.Nat. Rev. Genet. 2021; 22: 774-790https://doi.org/10.1038/s41576-021-00389-xCrossref PubMed Scopus (27) Google Scholar A classical insight behind network biology is that genes or proteins presenting similar topological neighborhood patterns are more likely to be correlated, which enables knowledge refinement for known molecules and property inference for unknown ones through “guilt by association” principle. There has been a recent community benchmark effort to evaluate disease module discovery methods on various network configurations.2Choobdar S. Ahsen M.E. Crawford J. Tomasoni M. Fang T. Lamparter D. Lin J. Hescott B. Hu X. Mercer J. et al.Assessment of network module identification across complex diseases.Nat. Methods. 2019; 16: 843-852https://doi.org/10.1038/s41592-019-0509-5Crossref PubMed Scopus (123) Google Scholar A network-based method has been utilized to reprioritize statistical signals from disease-focused genome-wide association studies (GWAS). For example, the NetWAS3Greene C.S. Krishnan A. Wong A.K. Ricciotti E. Zelaya R.A. Himmelstein D.S. Zhang R. Hartmann B.M. Zaslavsky E. Sealfon S.C. et al.Understanding multicellular function and disease with human tissue-specific networks.Nat. Genet. 2015; 47: 569-576https://doi.org/10.1038/ng.3259Crossref PubMed Scopus (523) Google Scholar framework leverages tissue-specific networks in combination with marginally significant GWAS hits as input for deploying a machine learning model to rank candidate genes. The NAGA4Carlin D.E. Fong S.H. Qin Y. Jia T. Huang J.K. Bao B. Zhang C. Ideker T. A fast and flexible framework for network-assisted genomic association.iScience. 2019; 16: 155-161https://doi.org/10.1016/j.isci.2019.05.025Abstract Full Text Full Text PDF PubMed Scopus (17) Google Scholar framework harnessed a compositive molecular network to implement a propagation approach to boost GWAS results for eight diseases. iRIGs5Wang Q. Chen R. Cheng F. Wei Q. Ji Y. Yang H. Zhong X. Tao R. Wen Z. Sutcliffe J.S. et al.A Bayesian framework that integrates multi-omics data and gene networks predicts risk genes from schizophrenia GWAS data.Nat. Neurosci. 2019; 22: 691-699https://doi.org/10.1038/s41593-019-0382-7Crossref PubMed Scopus (70) Google Scholar reprioritized schizophrenia (SCZ) GWAS genes by using a Bayesian framework to integrate multi-omics data and a protein-protein interaction (PPI) network. Buphamalai et al.6Buphamalai P. Kokotovic T. Nagy V. Menche J. Network analysis reveals rare disease signatures across multiple levels of biological organization.Nat. Commun. 2021; 12: 6306https://doi.org/10.1038/s41467-021-26674-1Crossref PubMed Scopus (15) Google Scholar constructed a multiplex network organized into hierarchical layers spanning different omics levels and revealed that rare diseases also exhibit network signatures similar to complex diseases through propagation-based algorithms.7Cowen L. Ideker T. Raphael B.J. Sharan R. Network propagation: a universal amplifier of genetic associations.Nat. Rev. Genet. 2017; 18: 551-562https://doi.org/10.1038/nrg.2017.38Crossref PubMed Scopus (353) Google Scholar A comprehensive review8Ata S.K. Wu M. Fang Y. Ou-Yang L. Kwoh C.K. Li X.-L. Recent advances in network-based methods for disease gene prediction.Briefings Bioinf. 2021; 22: bbaa303https://doi.org/10.1093/bib/bbaa303Crossref PubMed Scopus (21) Google Scholar of network-based disease gene prioritization categorizes existing computational efforts into three major classes, including network diffusion methods, traditional machine learning methods with handcrafted features, and graph representation learning methods. Notably, Set2Gaussian9Wang S. Flynn E.R. Altman R.B. Gaussian embedding for large-scale gene set analysis.Nat. Mach. Intell. 2020; 2: 387-395https://doi.org/10.1038/s42256-020-0193-2Crossref PubMed Scopus (5) Google Scholar embeds gene sets as a multivariate Gaussian distribution in low-dimensional space based on genes’ proximity in the PPI network, manifesting stronger expressive power over traditional network diffusion methods. The utilities of these network methods strongly rely on the quality and coverage of available molecular networks. Recent advances in high-throughput experimental platforms and computational techniques have enabled characterizing heterogeneous genome-scale networks, including physical interactions (for example, PPI,10Szklarczyk D. Franceschini A. Wyder S. Forslund K. Heller D. Huerta-Cepas J. Simonovic M. Roth A. Santos A. Tsafou K.P. et al.STRING v10: protein–protein interaction networks, integrated over the tree of life.Nucleic Acids Res. 2015; 43: D447-D452https://doi.org/10.1093/nar/gku1003Crossref PubMed Scopus (7025) Google Scholar signaling, and regulatory networks) and functional associations (for example, gene co-expression, genetic dependencies, co-evolution, and phylogenetic patterns). Huang et al.11Huang J.K. Carlin D.E. Yu M.K. Zhang W. Kreisberg J.F. Tamayo P. Ideker T. Systematic evaluation of molecular networks for discovery of disease genes.Cell Syst. 2018; 6: 484-495.e5https://doi.org/10.1016/j.cels.2018.03.001Abstract Full Text Full Text PDF PubMed Scopus (144) Google Scholar systematically evaluated 21 human interaction networks covering various types of interactions, concluding that ConsensusPathDB,12Kamburov A. Wierling C. Lehrach H. Herwig R. ConsensusPathDB—a database for integrating human functional interaction networks.Nucleic Acids Res. 2009; 37: D623-D628https://doi.org/10.1093/nar/gkn698Crossref PubMed Scopus (381) Google Scholar GIANT13Wong A.K. Krishnan A. Troyanskaya O.G. Giant 2.0: genome-scale integrated analysis of gene networks in tissues.Nucleic Acids Res. 2018; 46: W65-W70https://doi.org/10.1093/nar/gky408Crossref PubMed Scopus (39) Google Scholar (now available as Humanbase), and STRING10Szklarczyk D. Franceschini A. Wyder S. Forslund K. Heller D. Huerta-Cepas J. Simonovic M. Roth A. Santos A. Tsafou K.P. et al.STRING v10: protein–protein interaction networks, integrated over the tree of life.Nucleic Acids Res. 2015; 43: D447-D452https://doi.org/10.1093/nar/gku1003Crossref PubMed Scopus (7025) Google Scholar perform best to recover disease gene sets and the larger network as a whole outweighs the drawbacks of potential false positives, and recurrent but nuanced signals can be amplified. Picart et al.14Picart-Armada S. Barrett S.J. Willé D.R. Perera-Lluna A. Gutteridge A. Dessailly B.H. Benchmarking network propagation methods for disease gene identification.PLoS Comput. Biol. 2019; 15: e1007276https://doi.org/10.1371/journal.pcbi.1007276Crossref PubMed Scopus (25) Google Scholar also emphasized the merit of introducing a larger network. The ever-growing repositories of interactomes require developing methods to combine these networks while simultaneously tackling inherent noise and incompleteness among them. Huang et al. pioneered a parsimonious composite network (PCNet)11Huang J.K. Carlin D.E. Yu M.K. Zhang W. Kreisberg J.F. Tamayo P. Ideker T. Systematic evaluation of molecular networks for discovery of disease genes.Cell Syst. 2018; 6: 484-495.e5https://doi.org/10.1016/j.cels.2018.03.001Abstract Full Text Full Text PDF PubMed Scopus (144) Google Scholar with high efficiency. Mashup15Cho H. Berger B. Peng J. Compact integration of multi-network topology for functional analysis of genes.Cell Syst. 2016; 3: 540-548.e5https://doi.org/10.1016/j.cels.2016.10.017Abstract Full Text Full Text PDF PubMed Scopus (158) Google Scholar leverages random walks with restart (RWR)16Tong H. Faloutsos C. Pan J.y. Fast random walk with restart and its applications.in: Sixth International Conference on Data Mining, 2006. IEEE, 2006: 613-622Crossref Scopus (812) Google Scholar for each network, then optimizes a consistent dimension reduction function to derive compact network integration as low-dimensional vectors for each gene or protein to be plugged into downstream functional tasks. Several other methods have been proposed to integrate multiple networks. Gao et al.17Gao X. Ma X. Zhang W. Huang J. Li H. Li Y. Cui J. Multi-view clustering with self-representation and structural Constraint.IEEE Trans. Big Data. 2022; 8: 882-893https://doi.org/10.1109/TBDATA.2021.3128906Crossref Scopus (15) Google Scholar used multi-view representation learning to cluster network data. Ma et al.18Ma X. Sun P. Gong M. An integrative framework of heterogeneous genomic data for cancer Dynamic modules based on matrix decomposition.IEEE ACM Trans. Comput. Biol. Bioinf. 2022; 19: 305-316https://doi.org/10.1109/TCBB.2020.3004808Crossref PubMed Scopus (29) Google Scholar adopted matrix decomposition to integrate heterogeneous networks. Lin et al.19Lin Q. Lin Y. Yu Q. Ma X. Clustering of cancer attributed networks via integration of graph embedding and matrix factorization.IEEE Access. 2020; 8: 197463-197472https://doi.org/10.1109/ACCESS.2020.3034623Crossref Scopus (2) Google Scholar combined node2vec20Grover A. Leskovec J. node2vec: scalable feature learning for networks.in: Proceedings of the 22nd ACM SIGKDD international conference on Knowledge discovery and data mining, 2016. Association for Computing Machinery, 2016: 855-864Crossref Scopus (6341) Google Scholar and matrix factorization to analyze cancer attributed networks. DeepMNE-CNN21Peng J. Xue H. Wei Z. Tuncali I. Hao J. Shang X. Integrating multi-network topology for gene function prediction using deep neural networks.Briefings Bioinf. 2021; 22: 2096-2105https://doi.org/10.1093/bib/bbaa036Crossref PubMed Scopus (55) Google Scholar developed a semi-supervised autoencoder method to integrate RWR-derived embeddings from multiple networks and predict gene function using convolutional network. Graph neural network (GNN) has recently emerged to incorporate graph structures into a deep learning framework.22Defferrard M. Bresson X. Vandergheynst P. Convolutional neural networks on graphs with fast localized spectral filtering.Adv. Neural Inf. Process. Syst. 2016; 29Google Scholar To represent genes as nodes and their interactions as edges, GNN naturally captures the interdependent relationships of comprised molecules within networks, and node embeddings are learned by iteratively updating the information aggregated from its adjacent neighbors. According to the different ways with which GNN propagates information, the architectures of GNN include graph convolutional networks (GCNs),23Kipf T.N. Welling M. Semi-supervised classification with graph convolutional networks.arXiv. 2016; (Preprint at)https://doi.org/10.48550/arXiv.1609.02907Crossref Google Scholar GraphSAGE,24Hamilton W. Ying Z. Leskovec J. Inductive representation learning on large graphs.Adv. Neural Inf. Process. Syst. 2017; 30Google Scholar GAT,25Veličković P. Cucurull G. Casanova A. Romero A. Lio P. Bengio Y. Graph attention networks.arXiv. 2017; (Preprint at)https://doi.org/10.48550/arXiv.1710.10903Crossref Google Scholar GIN,26Xu K. Hu W. Leskovec J. Jegelka S. How powerful are graph neural networks?.arXiv. 2018; (Preprint at)https://doi.org/10.48550/arXiv.1810.00826Crossref Google Scholar etc. In recent years, GNN had demonstrated effectiveness in biologically related tasks, such as drug-target interactions27Torng W. Altman R.B. Graph convolutional neural networks for predicting drug-target interactions.J. Chem. Inf. Model. 2019; 59: 4131-4149https://doi.org/10.1021/acs.jcim.9b00628Crossref PubMed Scopus (150) Google Scholar and disease identification.28Xu H. Wang H. Yuan C. Zhai Q. Tian X. Wu L. Mi Y. Identifying diseases that cause psychological trauma and social avoidance by GCN-Xgboost.BMC Bioinf. 2020; 21: 504https://doi.org/10.1186/s12859-020-03847-1Crossref PubMed Scopus (8) Google Scholar For example, EMOGI29Schulte-Sasse R. Budach S. Hnisz D. Marsico A. Integration of multiomics data with graph convolutional networks to identify new cancer genes and their associated molecular mechanisms.Nat. Mach. Intell. 2021; 3: 513-526https://doi.org/10.1038/s42256-021-00325-yCrossref Scopus (40) Google Scholar leverages GCNs to integrate topologic features from PPI networks with multi-omics pan-cancer data to propose novel cancer genes. Furthermore, multimodal GNNs incorporating more than one type of node enables multi-relational link prediction. Decagon30Zitnik M. Agrawal M. Leskovec J. Modeling polypharmacy side effects with graph convolutional networks.Bioinformatics. 2018; 34: i457-i466https://doi.org/10.1093/bioinformatics/bty294Crossref PubMed Scopus (561) Google Scholar constructed a heterogeneous gene-drug network to predict polypharmacy side effects via decoding links between drug pairs. Self-supervised learning (SSL) has recently provided a promising paradigm toward human-level intelligence and achieved great success in the domains of natural language processing and computer vision, such as BERT,31Devlin J. Chang M.-W. Lee K. Toutanova K. Bert: pre-training of deep bidirectional transformers for language understanding.arXiv. 2018; (Preprint at)https://doi.org/10.48550/arXiv.1810.04805Crossref Google Scholar SimCLR,32Chen T. Kornblith S. Norouzi M. Hinton G. A Simple Framework for Contrastive Learning of Visual Representations. PMLR, 2020: 1597-1607Google Scholar and MAE.33He K. Chen X. Xie S. Li Y. Dollár P. Girshick R. Masked autoencoders are scalable vision learners.in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022. IEEE, 2022: 16000-16009Crossref Scopus (279) Google Scholar SSL firstly pre-trains a model on a well-designed pretext task, then fine-tunes it on a specific downstream task of interest. Biology networks contain tremendous intrinsic information, and applying SSL to network biology shows promise to directly learn from interacted biological molecules. Due to the non-Euclidean data structure, graph SSL has several particular characteristics for which pre-training can be implemented at the level of individual nodes and entire graphs to derive useful local and global representations simultaneously.34Hu W. Liu B. Gomes J. Zitnik M. Liang P. Pande V. Leskovec J. Strategies for pre-training graph neural networks.arXiv. 2019; (Preprint at)https://doi.org/10.48550/arXiv.1905.12265Crossref Google Scholar A recent review article35Liu Y. Jin M. Pan S. Zhou C. Zheng Y. Xia F. et al.Graph self-supervised learning: a survey.in: IEEE Transactions on Knowledge and Data Engineering. IEEE, 2022Crossref Scopus (15) Google Scholar divided the pre-training task into four categories, including generative, contrastive, and auxiliary property-based, as well as their hybridizations. Avoiding negative generalizability during knowledge transfer from pre-training task to downstream objectives is the key consideration for self-supervised graph representation learning.36Rosenstein M. Marx Z. Kaelbling L. Dietterich T. To Transfer or Not to Transfer. NIPS, 2005Google Scholar Inspired by the recent progress of self-supervised GNN,34Hu W. Liu B. Gomes J. Zitnik M. Liang P. Pande V. Leskovec J. Strategies for pre-training graph neural networks.arXiv. 2019; (Preprint at)https://doi.org/10.48550/arXiv.1905.12265Crossref Google Scholar we propose Graphene, a two-step graph representation learning method for gene function analysis. We first integrate multiple molecular networks and then pre-train a GCN to derive initialized embeddings for each gene or protein. Then we re-train the network via GAT model architecture and achieve state-of-the-art performance to recover pathway and disease genes. The integration is simply done through taking the unions of edges derived from different networks after aligning the nodes’ identities (see methods). The generalizability of gene embeddings learned from GWAS hits is directly tested by another two independently curated disease gene sets (DisGeNET37Piñero J. Bravo À. Queralt-Rosinach N. Gutiérrez-Sacristán A. Deu-Pons J. Centeno E. García-García J. Sanz F. Furlong L.I. DisGeNET: a comprehensive platform integrating information on human disease-associated genes and variants.Nucleic Acids Res. 2017; 45: D833-D839https://doi.org/10.1093/nar/gkw943Crossref PubMed Scopus (1328) Google Scholar and UK Biobank38McInnes G. Tanigawa Y. DeBoever C. Lavertu A. Olivieri J.E. Aguirre M. Rivas M.A. Global Biobank Engine: enabling genotype-phenotype browsing for biobank summary statistics.Bioinformatics. 2019; 35: 2495-2497https://doi.org/10.1093/bioinformatics/bty999Crossref PubMed Scopus (44) Google Scholar) without further model training. Tissue-specific patterns are recapitulated for a broad range of diseases. Reprioritized genes show biologically relevant functional enrichment in related pathways. We also show that attention weights between gene nodes learned from the GAT network offer natural hints on regulatory relationships. Shared gene modules are identified among several common psychiatric disorders, offering functional evidence and recapitulating previous mechanistic insights. In brief, we demonstrate that pre-training GNN on molecular networks in a self-supervised manner provides strategic adaptability to a series of downstream tasks, including pathway gene recovery, disease gene prioritization, module identification, and comorbidity validation. Prioritizing disease-related markers can also benefit from explicitly adding disease nodes. For example, Zhang et al.39Zeng X. Zhang X. Zou Q. Integrative approaches for predicting microRNA function and prioritizing disease-related microRNA using biological interaction networks.Briefings Bioinf. 2016; 17: 193-203https://doi.org/10.1093/bib/bbv033Crossref PubMed Scopus (281) Google Scholar integrated a microRNA network and disease phenotype network to prioritize disease relevant microRNA. In the comorbidity prediction task, we also demonstrate how to incorporate disease nodes to build a heterogeneous GNN, followed by adding a decoder function and re-training the network, which achieves superior accuracy. As shown in Figure 1A , we use four molecular networks to pre-train Graphene, including 142 tissue-specific gene networks from Humanbase, a PPI network from STRING (9606 v11), a recently released systematic proteome-wide reference, namely the Human Reference Interactome (HuRI),40Luck K. Kim D.-K. Lambourne L. Spirohn K. Begg B.E. Bian W. Brignall R. Cafarelli T. Campos-Laborie F.J. Charloteaux B. et al.A reference map of the human binary protein interactome.Nature. 2020; 580: 402-408https://doi.org/10.1038/s41586-020-2188-xCrossref PubMed Scopus (437) Google Scholar and a well-integrated composite network, PCNet. These networks are combined by unifying their edges and nodes (see methods and all network datasets in Table S1) and result in a giant network comprising 19,324 gene nodes and 16,142,804 interconnected edges. We adopt node recovery and context-prediction as two pretext tasks for Graphene pre-training34Hu W. Liu B. Gomes J. Zitnik M. Liang P. Pande V. Leskovec J. Strategies for pre-training graph neural networks.arXiv. 2019; (Preprint at)https://doi.org/10.48550/arXiv.1905.12265Crossref Google Scholar (methods). In particular, we randomly mask 15% of nodes and predict the identifications of masked nodes from transformation of neighborhood representations, defined as a multi-class classification problem through cross-entropy loss. For context prediction, the k-hop neighborhood contains all nodes that are k-hops away from the center node. Nodes shared between the neighborhood and the context graph are referred to as context anchor nodes, providing the connectivity information between the neighborhood and context graphs. Then negative sampling41Ying R. He R. Chen K. Eksombatchai P. Hamilton W.L. Leskovec J. Graph convolutional neural networks for web-scale recommender systems.in: Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. Association for Computing Machinery, 2018: 974-983Crossref Scopus (1625) Google Scholar is used to jointly learn both neighborhood and context graph-derived embeddings, casting it as a binary classification problem whether particular context graph and neighborhood belong to the same center node or not. These two auxiliary tasks enable the integration of four molecular networks in a self-supervised manner. We consider GCN and GAT as two pre-training GNN architectures to aggregate neighborhood features. In our experiments of model pre-training, we find that GCN produces more flexible embeddings than GAT, which is beneficial to the downstream re-training process. Embedding size is set as 100. The number of layers for GCN is set as 5. We use one Tesla V100 GPU and draw lessons from the previous report34Hu W. Liu B. Gomes J. Zitnik M. Liang P. Pande V. Leskovec J. Strategies for pre-training graph neural networks.arXiv. 2019; (Preprint at)https://doi.org/10.48550/arXiv.1905.12265Crossref Google Scholar to pre-train Graphene for 100 epochs in around 150 h. The downstream tasks of disease gene reprioritization and gene set member identification can be completed in about 300 s (1,000 epochs) on a Quadro RTX 6000 GPU (Table S2), which is much more efficient than other competing methods. At the downstream re-training stage, we borrow all pre-trained node embedding as model initialization and adopt two to three GAT layers to drive node embeddings for downstream tasks due to GAT’s faster convergence speed (see Table S2) during re-training. These node representations are then fed into one multiple-perceptron classification layer to predict node labels. We use Reactome42Fabregat A. Jupe S. Matthews L. Sidiropoulos K. Gillespie M. Garapati P. Haw R. Jassal B. Korninger F. May B. et al.The reactome pathway knowledgebase.Nucleic Acids Res. 2018; 46: D649-D655https://doi.org/10.1093/nar/gkx1132Crossref PubMed Scopus (1597) Google Scholar and NCI43Schaefer C.F. Anthony K. Krupa S. Buchoff J. Day M. Hannay T. Buetow K.H. PID: the pathway interaction database.Nucleic Acids Res. 2009; 37: D674-D679https://doi.org/10.1093/nar/gkn653Crossref PubMed Scopus (1122) Google Scholar as validation datasets for membership recovery task of pathway gene sets (Figure 1B). Only half of the nodes’ pathway labels are kept for training, and the remaining members are recovered for each pathway. We use the GWAS Catalog44Buniello A. MacArthur J.A.L. Cerezo M. Harris L.W. Hayhurst J. Malangone C. McMahon A. Morales J. Mountjoy E. Sollis E. et al.The NHGRI-EBI GWAS Catalog of published genome-wide association studies, targeted arrays and summary statistics 2019.Nucleic Acids Res. 2019; 47: D1005-D1012https://doi.org/10.1093/nar/gky1120Crossref PubMed Scopus (1935) Google Scholar dataset, composed of 202 common diseases, as a training set for the task of disease-gene reprioritization (Figure 1C). It is noted that the re-training process of the disease gene prioritization task is different from the pathway member recovery setting in aspects of train-validation ratio and mask split (methods). Then DisGeNET and UK Biobank (171 aligned disease nomenclatures with GWAS for DisGeNET and 81 diseases for UK Biobank) are used as hold-out test set without further model training for independent cross-dataset evaluation. The re-ranked genes by Graphene can then be used for disease-relevant function module identification and tissue specificity analysis. We also construct a heterogeneous graph via explicitly adding disease nodes to explore the comorbidity relationship between disease pairs, where a decoder function is introduced to predict the edge labels between two disease nodes (Figure 1D). Detailed model architectures for each stage can be found in Figure S1, and illustrations of Graphene implementation can be found in the methods. Publicly available pathway gene sets related to certain biological processes contain abundant noise due to the inherent nature of high-throughput experiments. We first sought to assess whether re-training Graphene could accurately denoise and recover pathway gene sets. Initialized with pre-trained embeddings, we use a two-layer GAT architecture followed by one classification layer to learn domain-specific representations for Reactome and NCI pathway gene sets. We adopt the same train-test ratio for Set2Gaussian where only half of those membership labels are used in the re-training stage. Evaluated on an NCI dataset using the same metric (mean area under the precision recall curve [mean AUPRC]), Graphene outperforms Set2Gaussian and the simple mean pooling method across all three levels of pathway sets (mean AUPRC = 0.29, 0.31, and 0.29 for small (3–10), medium (11–30), and large (31–1,000), respectively) (Figure 2A ). For the purpose of comparison, we also use random initial input embeddings to train Graphene with the same model architecture and obtain inferior performance. Detailed comparison results can be found in Table S3. For the Reactome dataset, Graphene achieves mean an AUPRC of 0.58 and 0.69 for medium (11–30) and large (31–1,000) sets (Figure 2B), outperforming Set2Gaussian. Graphene’s GNN architecture effectively propagates information across the graph and facilitates knowledge transfer using a two-step training strategy. This task is run with five repetitions (Figure S5). As potential disease genes converge on interacting molecules in functional networks, we next apply Graphene to GWAS hits to examine how integration of multiple networks and pre-training can benefit decoding gene-disease relationships. We collect association signals for 202 diseases downloaded from the GWAS Catalog and leverage 60% of labels to re-train Graphene on the disease gene recovery task, which is compatible with canonical GWAS workflow. NAGA,4Carlin D.E. Fong S.H. Qin Y. Jia T. Huang J.K. Bao B. Zhang C. Ideker T. A fast and flexible framework for network-assisted genomic association.iScience. 2019; 16: 155-161https://doi.org/10.1016/j.isci.2019.05.025Abstract Full Text Full Text PDF PubMed Scopus (17) Google Scholar which uses RWR as its propagation scheme, together with GenePanda45Yin T. Chen S. Wu X. Tian W. GenePANDA—a novel network-based gene prioritizing tool for complex diseases.Sci. Rep. 2017; 7: 43258https:/" @default.
- W4310857078 created "2022-12-19" @default.
- W4310857078 creator A5029667848 @default.
- W4310857078 creator A5076728540 @default.
- W4310857078 creator A5077718741 @default.
- W4310857078 creator A5077851706 @default.
- W4310857078 creator A5083408391 @default.
- W4310857078 creator A5090433321 @default.
- W4310857078 date "2023-01-01" @default.
- W4310857078 modified "2023-10-16" @default.
- W4310857078 title "Self-supervised graph representation learning integrates multiple molecular networks and decodes gene-disease relationships" @default.
- W4310857078 cites W1014257459 @default.
- W4310857078 cites W1533942137 @default.
- W4310857078 cites W1639356345 @default.
- W4310857078 cites W1787224781 @default.
- W4310857078 cites W1966761076 @default.
- W4310857078 cites W1967232661 @default.
- W4310857078 cites W1970662525 @default.
- W4310857078 cites W1977223059 @default.
- W4310857078 cites W1981409633 @default.
- W4310857078 cites W1989277387 @default.
- W4310857078 cites W1994245977 @default.
- W4310857078 cites W2001101255 @default.
- W4310857078 cites W2007300748 @default.
- W4310857078 cites W2016125507 @default.
- W4310857078 cites W2029490031 @default.
- W4310857078 cites W2045186085 @default.
- W4310857078 cites W2051674484 @default.
- W4310857078 cites W2056782561 @default.
- W4310857078 cites W2057181320 @default.
- W4310857078 cites W2069281240 @default.
- W4310857078 cites W2089927164 @default.
- W4310857078 cites W2096173332 @default.
- W4310857078 cites W2096766502 @default.
- W4310857078 cites W2101357408 @default.
- W4310857078 cites W2104972295 @default.
- W4310857078 cites W2114893730 @default.
- W4310857078 cites W2116635624 @default.
- W4310857078 cites W2119412782 @default.
- W4310857078 cites W2132490020 @default.
- W4310857078 cites W2138014988 @default.
- W4310857078 cites W2139776339 @default.
- W4310857078 cites W2151014102 @default.
- W4310857078 cites W2157952220 @default.
- W4310857078 cites W2159482845 @default.
- W4310857078 cites W2165927010 @default.
- W4310857078 cites W2170621766 @default.
- W4310857078 cites W2170637386 @default.
- W4310857078 cites W2195783463 @default.
- W4310857078 cites W2231907166 @default.
- W4310857078 cites W2258129851 @default.
- W4310857078 cites W2321595698 @default.
- W4310857078 cites W2340656051 @default.
- W4310857078 cites W2537679995 @default.
- W4310857078 cites W2556519015 @default.
- W4310857078 cites W2571042874 @default.
- W4310857078 cites W2592729027 @default.
- W4310857078 cites W2609040378 @default.
- W4310857078 cites W2624021832 @default.
- W4310857078 cites W2786016794 @default.
- W4310857078 cites W2794479004 @default.
- W4310857078 cites W2804994040 @default.
- W4310857078 cites W2898850951 @default.
- W4310857078 cites W2901303766 @default.
- W4310857078 cites W2905428026 @default.
- W4310857078 cites W2905748319 @default.
- W4310857078 cites W2911487850 @default.
- W4310857078 cites W2935758675 @default.
- W4310857078 cites W2945749616 @default.
- W4310857078 cites W2949112115 @default.
- W4310857078 cites W2962756421 @default.
- W4310857078 cites W2970771711 @default.
- W4310857078 cites W2971586943 @default.
- W4310857078 cites W2978484973 @default.
- W4310857078 cites W2997029448 @default.
- W4310857078 cites W3014623641 @default.
- W4310857078 cites W3015964336 @default.
- W4310857078 cites W3035387075 @default.
- W4310857078 cites W3038607030 @default.
- W4310857078 cites W3041678484 @default.
- W4310857078 cites W3096539820 @default.
- W4310857078 cites W3108155943 @default.
- W4310857078 cites W3111074515 @default.
- W4310857078 cites W3112561745 @default.
- W4310857078 cites W3119797816 @default.
- W4310857078 cites W3124209384 @default.
- W4310857078 cites W3134771351 @default.
- W4310857078 cites W3156270925 @default.
- W4310857078 cites W3157950658 @default.
- W4310857078 cites W3189676628 @default.
- W4310857078 cites W3198493511 @default.
- W4310857078 cites W3199235896 @default.
- W4310857078 cites W3212549469 @default.
- W4310857078 cites W3213303594 @default.
- W4310857078 cites W4233698560 @default.
- W4310857078 cites W4245660285 @default.
- W4310857078 cites W4294216483 @default.
- W4310857078 doi "https://doi.org/10.1016/j.patter.2022.100651" @default.