Matches in SemOpenAlex for { <https://semopenalex.org/work/W4366991376> ?p ?o ?g. }
- W4366991376 endingPage "1337" @default.
- W4366991376 startingPage "1337" @default.
- W4366991376 abstract "Advances in machine learning (ML) and the availability of protein sequences via high-throughput sequencing techniques have transformed the ability to design novel diagnostic and therapeutic proteins. ML allows protein engineers to capture complex trends hidden within protein sequences that would otherwise be difficult to identify in the context of the immense and rugged protein fitness landscape. Despite this potential, there persists a need for guidance during the training and evaluation of ML methods over sequencing data. Two key challenges for training discriminative models and evaluating their performance include handling severely imbalanced datasets (e.g., few high-fitness proteins among an abundance of non-functional proteins) and selecting appropriate protein sequence representations (numerical encodings). Here, we present a framework for applying ML over assay-labeled datasets to elucidate the capacity of sampling techniques and protein encoding methods to improve binding affinity and thermal stability prediction tasks. For protein sequence representations, we incorporate two widely used methods (One-Hot encoding and physiochemical encoding) and two language-based methods (next-token prediction, UniRep; masked-token prediction, ESM). Elaboration on performance is provided over protein fitness, protein size, and sampling techniques. In addition, an ensemble of protein representation methods is generated to discover the contribution of distinct representations and improve the final prediction score. We then implement multiple criteria decision analysis (MCDA; TOPSIS with entropy weighting), using multiple metrics well-suited for imbalanced data, to ensure statistical rigor in ranking our methods. Within the context of these datasets, the synthetic minority oversampling technique (SMOTE) outperformed undersampling while encoding sequences with One-Hot, UniRep, and ESM representations. Moreover, ensemble learning increased the predictive performance of the affinity-based dataset by 4% compared to the best single-encoding candidate (F1-score = 97%), while ESM alone was rigorous enough in stability prediction (F1-score = 92%)." @default.
- W4366991376 created "2023-04-27" @default.
- W4366991376 creator A5006849516 @default.
- W4366991376 creator A5045731858 @default.
- W4366991376 date "2023-04-25" @default.
- W4366991376 modified "2023-10-14" @default.
- W4366991376 title "Protein Fitness Prediction Is Impacted by the Interplay of Language Models, Ensemble Learning, and Sampling Methods" @default.
- W4366991376 cites W1983558567 @default.
- W4366991376 cites W1995875735 @default.
- W4366991376 cites W2010356367 @default.
- W4366991376 cites W2018003897 @default.
- W4366991376 cites W2028415754 @default.
- W4366991376 cites W2055110202 @default.
- W4366991376 cites W2064476819 @default.
- W4366991376 cites W2071072418 @default.
- W4366991376 cites W2076646346 @default.
- W4366991376 cites W2079761222 @default.
- W4366991376 cites W2118265845 @default.
- W4366991376 cites W2141694027 @default.
- W4366991376 cites W2144588155 @default.
- W4366991376 cites W2144794655 @default.
- W4366991376 cites W2490420619 @default.
- W4366991376 cites W2515146973 @default.
- W4366991376 cites W2592740410 @default.
- W4366991376 cites W2611281850 @default.
- W4366991376 cites W2739999456 @default.
- W4366991376 cites W2791476909 @default.
- W4366991376 cites W2791796577 @default.
- W4366991376 cites W2800788706 @default.
- W4366991376 cites W2884001105 @default.
- W4366991376 cites W2889326414 @default.
- W4366991376 cites W2917580301 @default.
- W4366991376 cites W2949676527 @default.
- W4366991376 cites W2956530858 @default.
- W4366991376 cites W2971227267 @default.
- W4366991376 cites W2980789587 @default.
- W4366991376 cites W2990580840 @default.
- W4366991376 cites W2995514860 @default.
- W4366991376 cites W3000982932 @default.
- W4366991376 cites W3041568739 @default.
- W4366991376 cites W3087224093 @default.
- W4366991376 cites W3102961474 @default.
- W4366991376 cites W3114444973 @default.
- W4366991376 cites W3116452748 @default.
- W4366991376 cites W3123226037 @default.
- W4366991376 cites W3129073614 @default.
- W4366991376 cites W3130189352 @default.
- W4366991376 cites W3131975258 @default.
- W4366991376 cites W3139654928 @default.
- W4366991376 cites W3143378422 @default.
- W4366991376 cites W3144239152 @default.
- W4366991376 cites W3144701084 @default.
- W4366991376 cites W3146944767 @default.
- W4366991376 cites W3161782229 @default.
- W4366991376 cites W3164031528 @default.
- W4366991376 cites W3164947024 @default.
- W4366991376 cites W3177500196 @default.
- W4366991376 cites W3194729882 @default.
- W4366991376 cites W3197226962 @default.
- W4366991376 cites W3216588755 @default.
- W4366991376 cites W3216617870 @default.
- W4366991376 cites W38708963 @default.
- W4366991376 cites W4205773061 @default.
- W4366991376 cites W4210861939 @default.
- W4366991376 cites W4211075026 @default.
- W4366991376 cites W4225868104 @default.
- W4366991376 cites W4230429131 @default.
- W4366991376 cites W4281717669 @default.
- W4366991376 cites W4288066876 @default.
- W4366991376 cites W4307068738 @default.
- W4366991376 cites W4327550249 @default.
- W4366991376 doi "https://doi.org/10.3390/pharmaceutics15051337" @default.
- W4366991376 hasPubMedId "https://pubmed.ncbi.nlm.nih.gov/37242577" @default.
- W4366991376 hasPublicationYear "2023" @default.
- W4366991376 type Work @default.
- W4366991376 citedByCount "1" @default.
- W4366991376 crossrefType "journal-article" @default.
- W4366991376 hasAuthorship W4366991376A5006849516 @default.
- W4366991376 hasAuthorship W4366991376A5045731858 @default.
- W4366991376 hasBestOaLocation W43669913761 @default.
- W4366991376 hasConcept C106131492 @default.
- W4366991376 hasConcept C119857082 @default.
- W4366991376 hasConcept C124101348 @default.
- W4366991376 hasConcept C126838900 @default.
- W4366991376 hasConcept C136536468 @default.
- W4366991376 hasConcept C140779682 @default.
- W4366991376 hasConcept C151730666 @default.
- W4366991376 hasConcept C154945302 @default.
- W4366991376 hasConcept C183115368 @default.
- W4366991376 hasConcept C189430467 @default.
- W4366991376 hasConcept C197323446 @default.
- W4366991376 hasConcept C2776257435 @default.
- W4366991376 hasConcept C2779343474 @default.
- W4366991376 hasConcept C31258907 @default.
- W4366991376 hasConcept C31972630 @default.
- W4366991376 hasConcept C38652104 @default.
- W4366991376 hasConcept C41008148 @default.
- W4366991376 hasConcept C45942800 @default.