Matches in SemOpenAlex for { <https://semopenalex.org/work/W3008859095> ?p ?o ?g. }
- W3008859095 abstract "Abstract Gene families are groups of genes that have descended from a common ancestral gene present in the species under study. Current, widely used gene family building algorithms can produce family clusters that may be fragmented or missing true family sequences (under-clustering). Here we present a classification method based on sequence pairs that, first, inspects given families for under-clustering and then predicts the missing sequences for the families using family-specific alignment score cutoffs. We have tested this method on a set of curated, gold-standard (“true”) families from the Yeast Gene Order Browser (YGOB) database, including 20 yeast species, as well as a test set of intentionally under-clustered (“deficient”) families derived from the YGOB families. For 83% of the modified yeast families, our pair-classification method was able to reliably detect under-clustering in “deficient” families that were missing 20% of sequences relative to the full/” true” families. We also attempted to predict back the missing sequences using the family-specific alignment score cutoffs obtained during the detection phase. In the case of “pure” under-clustered families (under-clustered families with no “wrong”/unrelated sequences), for 78% of families the prediction precision and recall was ≥0.75, with mean precision = 0.928 and mean recall = 0.859. For “impure” under-clustered families, (under-clustered families containing closest sequences from outside the family, in addition to missing true family sequences), the prediction precision and recall was ≥0.75 for 63% of families with mean precision = 0.790 and mean recall = 0.869. To check if our method can detect and correct incomplete families obtained using existing family building methods, we attempted to correct 374 under-clustered yeast families produced using the OrthoFinder tool. We were able to predict missing sequences for at least 19 yeast families with mean precision of 0.9 and mean recall of 0.65. We also analyzed 14,663 legume families built using the OrthoFinder program, with 14 legume species. We were able to identify 1,665 OrthoFinder families that were missing one or more sequences - sequences which were previously un-clustered or clustered into unusually small families. Further, using a simple merging strategy, we were able to merge 2,216 small families into 933 under-clustered families using the predicted missing sequences. Out of the 933 merged families, we could confirm correct mergings in at least 534 families using the maximum-likelihood phylogenies of the merged families. We also provide recommendations on different types of family-specific alignment score cutoffs that can be used for predicting the missing sequences based on the “purity” of under-clustered families and the chosen precision and recall for prediction. Finally, we provide the containerized version of the pair-classification method that can be applied on any given set of gene families." @default.
- W3008859095 created "2020-03-06" @default.
- W3008859095 creator A5007582587 @default.
- W3008859095 creator A5028155757 @default.
- W3008859095 creator A5070873690 @default.
- W3008859095 date "2020-02-23" @default.
- W3008859095 modified "2023-09-25" @default.
- W3008859095 title "A Sequence-Pair-Classification-Based Method for Detecting and Correcting Under-Clustered Gene Families" @default.
- W3008859095 cites W1900937478 @default.
- W3008859095 cites W1933357484 @default.
- W3008859095 cites W1964602476 @default.
- W3008859095 cites W1964612763 @default.
- W3008859095 cites W1972598615 @default.
- W3008859095 cites W1973408460 @default.
- W3008859095 cites W1990453950 @default.
- W3008859095 cites W2008856488 @default.
- W3008859095 cites W2010562878 @default.
- W3008859095 cites W2012986985 @default.
- W3008859095 cites W2019851585 @default.
- W3008859095 cites W2021044098 @default.
- W3008859095 cites W2023218781 @default.
- W3008859095 cites W2030317329 @default.
- W3008859095 cites W2041294499 @default.
- W3008859095 cites W2041624900 @default.
- W3008859095 cites W2047957449 @default.
- W3008859095 cites W2049131915 @default.
- W3008859095 cites W2064906168 @default.
- W3008859095 cites W2066760256 @default.
- W3008859095 cites W2067513135 @default.
- W3008859095 cites W2074363252 @default.
- W3008859095 cites W207833169 @default.
- W3008859095 cites W2094334461 @default.
- W3008859095 cites W2122821321 @default.
- W3008859095 cites W2124166542 @default.
- W3008859095 cites W2127847431 @default.
- W3008859095 cites W2128605601 @default.
- W3008859095 cites W2137084536 @default.
- W3008859095 cites W2155651473 @default.
- W3008859095 cites W2158714788 @default.
- W3008859095 cites W2159266221 @default.
- W3008859095 cites W2159770366 @default.
- W3008859095 cites W2160378127 @default.
- W3008859095 cites W2163376857 @default.
- W3008859095 cites W2341605457 @default.
- W3008859095 cites W2508114195 @default.
- W3008859095 cites W3147254695 @default.
- W3008859095 cites W319819633 @default.
- W3008859095 cites W4236236547 @default.
- W3008859095 doi "https://doi.org/10.1101/2020.02.22.942557" @default.
- W3008859095 hasPublicationYear "2020" @default.
- W3008859095 type Work @default.
- W3008859095 sameAs 3008859095 @default.
- W3008859095 citedByCount "0" @default.
- W3008859095 crossrefType "posted-content" @default.
- W3008859095 hasAuthorship W3008859095A5007582587 @default.
- W3008859095 hasAuthorship W3008859095A5028155757 @default.
- W3008859095 hasAuthorship W3008859095A5070873690 @default.
- W3008859095 hasBestOaLocation W30088590951 @default.
- W3008859095 hasConcept C100660578 @default.
- W3008859095 hasConcept C104317684 @default.
- W3008859095 hasConcept C105795698 @default.
- W3008859095 hasConcept C141231307 @default.
- W3008859095 hasConcept C154945302 @default.
- W3008859095 hasConcept C15744967 @default.
- W3008859095 hasConcept C177264268 @default.
- W3008859095 hasConcept C180747234 @default.
- W3008859095 hasConcept C199360897 @default.
- W3008859095 hasConcept C2778112365 @default.
- W3008859095 hasConcept C33923547 @default.
- W3008859095 hasConcept C41008148 @default.
- W3008859095 hasConcept C54355233 @default.
- W3008859095 hasConcept C5911399 @default.
- W3008859095 hasConcept C70721500 @default.
- W3008859095 hasConcept C73555534 @default.
- W3008859095 hasConcept C81669768 @default.
- W3008859095 hasConcept C86803240 @default.
- W3008859095 hasConcept C9357733 @default.
- W3008859095 hasConceptScore W3008859095C100660578 @default.
- W3008859095 hasConceptScore W3008859095C104317684 @default.
- W3008859095 hasConceptScore W3008859095C105795698 @default.
- W3008859095 hasConceptScore W3008859095C141231307 @default.
- W3008859095 hasConceptScore W3008859095C154945302 @default.
- W3008859095 hasConceptScore W3008859095C15744967 @default.
- W3008859095 hasConceptScore W3008859095C177264268 @default.
- W3008859095 hasConceptScore W3008859095C180747234 @default.
- W3008859095 hasConceptScore W3008859095C199360897 @default.
- W3008859095 hasConceptScore W3008859095C2778112365 @default.
- W3008859095 hasConceptScore W3008859095C33923547 @default.
- W3008859095 hasConceptScore W3008859095C41008148 @default.
- W3008859095 hasConceptScore W3008859095C54355233 @default.
- W3008859095 hasConceptScore W3008859095C5911399 @default.
- W3008859095 hasConceptScore W3008859095C70721500 @default.
- W3008859095 hasConceptScore W3008859095C73555534 @default.
- W3008859095 hasConceptScore W3008859095C81669768 @default.
- W3008859095 hasConceptScore W3008859095C86803240 @default.
- W3008859095 hasConceptScore W3008859095C9357733 @default.
- W3008859095 hasLocation W30088590951 @default.
- W3008859095 hasOpenAccess W3008859095 @default.
- W3008859095 hasPrimaryLocation W30088590951 @default.
- W3008859095 hasRelatedWork W10178904 @default.