Matches in SemOpenAlex for { <https://semopenalex.org/work/W2912120090> ?p ?o ?g. }
- W2912120090 abstract "Abstract Background The principal objective of comparative genomics is inferring attributes of an unknown gene by comparing it with well-studied genes. In this regard, identifying orthologous genes plays a pivotal role as the orthologous genes remain less diverged in the course of evolution. However, identifying orthologous genes is often difficult, slow, and idiosyncratic, especially in the presence of multiplicity of domains in proteins, evolutionary dynamics (gene duplication, transfer, loss, introgression etc.), multiple paralogous genes, incomplete genome data, and for distantly related species where similarity is hard to recognize. Motivation Advances in identifying orthologs have mostly been constrained to developing databases of genes or methods which involve computationally expensive BLAST search or constructing phylogenetic trees to infer orthologous relationships. These methods do not generally scale well and cannot analyze large amount of data from diverse organisms with high accuracy. Moreover, most of these methods involve manual parameter tuning, and hence are neither fully automated nor free from human bias. Results We present NORTH, a novel, automated, highly accurate and scalable machine learning based orhtologous gene clustering method. We have utilized the biological basis and intuition of orthologous genes and made an effort to incorporate appropriate ideas from machine learning (ML) and natural language processing (NLP). We have discovered that the BLAST search based protocols deeply resemble a “text classification” problem. Thus, we employ the robust bag-of-words model accompanied by a Naive Bayes classifier to cluster the orthologous genes. We studied 1,255,877 genes in the largest 250 ortholog clusters from the KEGG database, across 3,880 organisms comprising the six major groups of life, namely, Archaea, Bacteria, Animals, Fungi, Plants and Protists. Despite having more than a million of genes on distantly related species with acute data imbalance, NORTH is able to cluster them with 98.48% Precision, 98.43% Recall and 98.44% F 1 score, showing that automatic orthologous gene clustering can be both highly accurate and scalable. NORTH is available as a web interface with a server side application, along with cross-platform native applications (available at https://nibtehaz.github.io/NORTH/ ) – allowing queries based on individual genes." @default.
- W2912120090 created "2019-02-21" @default.
- W2912120090 creator A5032727078 @default.
- W2912120090 creator A5038924256 @default.
- W2912120090 creator A5040511403 @default.
- W2912120090 creator A5058915785 @default.
- W2912120090 creator A5090371713 @default.
- W2912120090 date "2019-01-23" @default.
- W2912120090 modified "2023-09-24" @default.
- W2912120090 title "NORTH: a highly accurate and scalable Naive Bayes based ORTHologous gene clustering algorithm" @default.
- W2912120090 cites W130881599 @default.
- W2912120090 cites W1484660603 @default.
- W2912120090 cites W1520812622 @default.
- W2912120090 cites W1552559843 @default.
- W2912120090 cites W1565201084 @default.
- W2912120090 cites W1900937478 @default.
- W2912120090 cites W1930624869 @default.
- W2912120090 cites W1979835060 @default.
- W2912120090 cites W1980867644 @default.
- W2912120090 cites W2007447692 @default.
- W2912120090 cites W2008856488 @default.
- W2912120090 cites W2014545475 @default.
- W2912120090 cites W2030317329 @default.
- W2912120090 cites W2041788773 @default.
- W2912120090 cites W2041837226 @default.
- W2912120090 cites W2044892321 @default.
- W2912120090 cites W2066593167 @default.
- W2912120090 cites W2080170974 @default.
- W2912120090 cites W2095649738 @default.
- W2912120090 cites W2099321785 @default.
- W2912120090 cites W2104558151 @default.
- W2912120090 cites W2107954630 @default.
- W2912120090 cites W2111229238 @default.
- W2912120090 cites W2114606763 @default.
- W2912120090 cites W2118978333 @default.
- W2912120090 cites W2135281627 @default.
- W2912120090 cites W2136258184 @default.
- W2912120090 cites W2137084536 @default.
- W2912120090 cites W2137464714 @default.
- W2912120090 cites W2138952930 @default.
- W2912120090 cites W2140031146 @default.
- W2912120090 cites W2140161828 @default.
- W2912120090 cites W2143335572 @default.
- W2912120090 cites W2145100181 @default.
- W2912120090 cites W2151859071 @default.
- W2912120090 cites W2154044807 @default.
- W2912120090 cites W2154139219 @default.
- W2912120090 cites W2159266221 @default.
- W2912120090 cites W2159482845 @default.
- W2912120090 cites W2159500193 @default.
- W2912120090 cites W2168048337 @default.
- W2912120090 cites W2170252811 @default.
- W2912120090 cites W2180806933 @default.
- W2912120090 cites W2280625941 @default.
- W2912120090 cites W2316262244 @default.
- W2912120090 cites W2549006896 @default.
- W2912120090 cites W2568815643 @default.
- W2912120090 cites W319819633 @default.
- W2912120090 cites W4236236547 @default.
- W2912120090 cites W4241931738 @default.
- W2912120090 cites W93588716 @default.
- W2912120090 doi "https://doi.org/10.1101/528323" @default.
- W2912120090 hasPublicationYear "2019" @default.
- W2912120090 type Work @default.
- W2912120090 sameAs 2912120090 @default.
- W2912120090 citedByCount "0" @default.
- W2912120090 crossrefType "posted-content" @default.
- W2912120090 hasAuthorship W2912120090A5032727078 @default.
- W2912120090 hasAuthorship W2912120090A5038924256 @default.
- W2912120090 hasAuthorship W2912120090A5040511403 @default.
- W2912120090 hasAuthorship W2912120090A5058915785 @default.
- W2912120090 hasAuthorship W2912120090A5090371713 @default.
- W2912120090 hasBestOaLocation W29121200901 @default.
- W2912120090 hasConcept C104317684 @default.
- W2912120090 hasConcept C107673813 @default.
- W2912120090 hasConcept C119857082 @default.
- W2912120090 hasConcept C12267149 @default.
- W2912120090 hasConcept C141231307 @default.
- W2912120090 hasConcept C154945302 @default.
- W2912120090 hasConcept C189206191 @default.
- W2912120090 hasConcept C193252679 @default.
- W2912120090 hasConcept C207201462 @default.
- W2912120090 hasConcept C2909108991 @default.
- W2912120090 hasConcept C41008148 @default.
- W2912120090 hasConcept C48044578 @default.
- W2912120090 hasConcept C52001869 @default.
- W2912120090 hasConcept C54355233 @default.
- W2912120090 hasConcept C70721500 @default.
- W2912120090 hasConcept C73555534 @default.
- W2912120090 hasConcept C77088390 @default.
- W2912120090 hasConcept C86803240 @default.
- W2912120090 hasConceptScore W2912120090C104317684 @default.
- W2912120090 hasConceptScore W2912120090C107673813 @default.
- W2912120090 hasConceptScore W2912120090C119857082 @default.
- W2912120090 hasConceptScore W2912120090C12267149 @default.
- W2912120090 hasConceptScore W2912120090C141231307 @default.
- W2912120090 hasConceptScore W2912120090C154945302 @default.
- W2912120090 hasConceptScore W2912120090C189206191 @default.
- W2912120090 hasConceptScore W2912120090C193252679 @default.
- W2912120090 hasConceptScore W2912120090C207201462 @default.