Matches in SemOpenAlex for { <https://semopenalex.org/work/W3088395980> ?p ?o ?g. }
Showing items 1 to 77 of
77
with 100 items per page.
- W3088395980 abstract "Many of the machine learning (ML) models used in the field of bioinformatics and computational biology to predict either function or structure of proteins rely on the evolutionary information as summarized in multiple-sequence alignments (MSAs) or the resulting position-specific scoring matrices (PSSMs), as generated by PSI-BLAST. Due to the exhaustive database search to retrieve evolutionary information, the current procedure used in protein structure and function prediction is computationally exhaustive and time-consuming. This issue is becoming more problematic since the protein sequence databases are increasing in size exponentially over time, and hence raising PSI-BLAST runtime as well. According to previous experiments we have conducted, one query protein takes PSI-BLAST on average 14 minutes to build a PSSM profile. Therefore, in order to build a simple ML model, one may need a couple of months waiting for PSI-BLAST to generate PSSMs for a few thousands of query proteins. This runtime bottleneck issue makes it a major problem that requires an efficient alternative solution. A protein sequence is a collection of contiguous tokens or characters called amino acids (AAs). This analogy to natural language allowed us to exploit the recent advancements in the field of Natural Language Processing (NLP) and therefore transferring NLP state-of-the-art algorithms to bioinformatics. A recent prominent alternative to replace the need of PSSMs as input to our prediction methods is the direction of Embedding Language Models (ELMo), converting a protein sequence to a numerical vector representation. ELMo/SeqVec is a state-of-the-art deep learning pre-trained model that embeds a protein sequence into a 3-dimensional tensor of numerical values. This ELMo trained a 2-layer bidirectional Long Short-Term Memory (LSTM) network following a two-path architecture, one for the forward and the second for the backward pass, on an unsupervised task predicting the next AA from the previously seen residues in the sequence. The performance of the embedder was then evaluated on downstream tasks, such as secondary structure and subcellular localization predictions. The results showed that the embeddings succeed to capture the biochemical and the biophysical properties of a protein, but not achieving state-of-the-art performance. By merging the idea of PSSMs with the concept of transfer-learning during pre-training, we are aiming to deploy a new ELMo with a better embedding power than SeqVec by training a novel single branch bidirectional language model (bi-LM), with four times less free parameters. This is the first time that an ELMo is trained not only on predicting the next AA but also on the probability distribution of the next AA derived from similar, yet different sequences as summarized in a PSSM simultaneously (multi-task training), hence learning evolutionary information of protein sequences as well. To train our novel embedder, we compiled the largest currently curated dataset of sequences with their corresponding PSSMs of size 1.83 Million proteins (~0.8 billion amino acids). The dataset of proteins is reduced to 40% sequence identity, with respect to the validation/test sets, and contains sequences ranging between 18 and 9858 residues in length." @default.
- W3088395980 created "2020-10-01" @default.
- W3088395980 creator A5077123521 @default.
- W3088395980 date "2020-01-01" @default.
- W3088395980 modified "2023-09-27" @default.
- W3088395980 title "Variational Inference to Learn Representations for Protein Evolutionary Information" @default.
- W3088395980 hasPublicationYear "2020" @default.
- W3088395980 type Work @default.
- W3088395980 sameAs 3088395980 @default.
- W3088395980 citedByCount "0" @default.
- W3088395980 crossrefType "journal-article" @default.
- W3088395980 hasAuthorship W3088395980A5077123521 @default.
- W3088395980 hasConcept C119857082 @default.
- W3088395980 hasConcept C14036430 @default.
- W3088395980 hasConcept C149635348 @default.
- W3088395980 hasConcept C154945302 @default.
- W3088395980 hasConcept C17744445 @default.
- W3088395980 hasConcept C199539241 @default.
- W3088395980 hasConcept C202444582 @default.
- W3088395980 hasConcept C2776214188 @default.
- W3088395980 hasConcept C2776359362 @default.
- W3088395980 hasConcept C2778112365 @default.
- W3088395980 hasConcept C2780513914 @default.
- W3088395980 hasConcept C33923547 @default.
- W3088395980 hasConcept C41008148 @default.
- W3088395980 hasConcept C54355233 @default.
- W3088395980 hasConcept C78458016 @default.
- W3088395980 hasConcept C80444323 @default.
- W3088395980 hasConcept C86803240 @default.
- W3088395980 hasConcept C94625758 @default.
- W3088395980 hasConcept C9652623 @default.
- W3088395980 hasConceptScore W3088395980C119857082 @default.
- W3088395980 hasConceptScore W3088395980C14036430 @default.
- W3088395980 hasConceptScore W3088395980C149635348 @default.
- W3088395980 hasConceptScore W3088395980C154945302 @default.
- W3088395980 hasConceptScore W3088395980C17744445 @default.
- W3088395980 hasConceptScore W3088395980C199539241 @default.
- W3088395980 hasConceptScore W3088395980C202444582 @default.
- W3088395980 hasConceptScore W3088395980C2776214188 @default.
- W3088395980 hasConceptScore W3088395980C2776359362 @default.
- W3088395980 hasConceptScore W3088395980C2778112365 @default.
- W3088395980 hasConceptScore W3088395980C2780513914 @default.
- W3088395980 hasConceptScore W3088395980C33923547 @default.
- W3088395980 hasConceptScore W3088395980C41008148 @default.
- W3088395980 hasConceptScore W3088395980C54355233 @default.
- W3088395980 hasConceptScore W3088395980C78458016 @default.
- W3088395980 hasConceptScore W3088395980C80444323 @default.
- W3088395980 hasConceptScore W3088395980C86803240 @default.
- W3088395980 hasConceptScore W3088395980C94625758 @default.
- W3088395980 hasConceptScore W3088395980C9652623 @default.
- W3088395980 hasLocation W30883959801 @default.
- W3088395980 hasOpenAccess W3088395980 @default.
- W3088395980 hasPrimaryLocation W30883959801 @default.
- W3088395980 hasRelatedWork W1496233740 @default.
- W3088395980 hasRelatedWork W1513589390 @default.
- W3088395980 hasRelatedWork W1853093403 @default.
- W3088395980 hasRelatedWork W2034115630 @default.
- W3088395980 hasRelatedWork W2036177564 @default.
- W3088395980 hasRelatedWork W2057447180 @default.
- W3088395980 hasRelatedWork W2112815320 @default.
- W3088395980 hasRelatedWork W2156226201 @default.
- W3088395980 hasRelatedWork W2222511249 @default.
- W3088395980 hasRelatedWork W2429850869 @default.
- W3088395980 hasRelatedWork W2809485291 @default.
- W3088395980 hasRelatedWork W2899117813 @default.
- W3088395980 hasRelatedWork W2920759733 @default.
- W3088395980 hasRelatedWork W3105738938 @default.
- W3088395980 hasRelatedWork W3107443535 @default.
- W3088395980 hasRelatedWork W3109006384 @default.
- W3088395980 hasRelatedWork W3167641553 @default.
- W3088395980 hasRelatedWork W3186814059 @default.
- W3088395980 hasRelatedWork W3208818094 @default.
- W3088395980 hasRelatedWork W80262151 @default.
- W3088395980 isParatext "false" @default.
- W3088395980 isRetracted "false" @default.
- W3088395980 magId "3088395980" @default.
- W3088395980 workType "article" @default.