Matches in SemOpenAlex for { <https://semopenalex.org/work/W4206667478> ?p ?o ?g. }
- W4206667478 abstract "Abstract Background Sequencing technologies are prone to errors, making error correction (EC) necessary for downstream applications. EC tools need to be manually configured for optimal performance. We find that the optimal parameters (e.g., k -mer size) are both tool- and dataset-dependent. Moreover, evaluating the performance (i.e., Alignment-rate or Gain) of a given tool usually relies on a reference genome, but quality reference genomes are not always available. We introduce Lerna for the automated configuration of k -mer-based EC tools. Lerna first creates a language model (LM) of the uncorrected genomic reads, and then, based on this LM, calculates a metric called the perplexity metric to evaluate the corrected reads for different parameter choices. Next, it finds the one that produces the highest alignment rate without using a reference genome. The fundamental intuition of our approach is that the perplexity metric is inversely correlated with the quality of the assembly after error correction. Therefore, Lerna leverages the perplexity metric for automated tuning of k -mer sizes without needing a reference genome. Results First, we show that the best k -mer value can vary for different datasets, even for the same EC tool. This motivates our design that automates k -mer size selection without using a reference genome. Second, we show the gains of our LM using its component attention-based transformers. We show the model’s estimation of the perplexity metric before and after error correction. The lower the perplexity after correction, the better the k -mer size. We also show that the alignment rate and assembly quality computed for the corrected reads are strongly negatively correlated with the perplexity, enabling the automated selection of k -mer values for better error correction, and hence, improved assembly quality. We validate our approach on both short and long reads. Additionally, we show that our attention-based models have significant runtime improvement for the entire pipeline—18 $$times$$ <mml:math xmlns:mml=http://www.w3.org/1998/Math/MathML><mml:mo>×</mml:mo></mml:math> faster than previous works, due to parallelizing the attention mechanism and the use of JIT compilation for GPU inferencing. Conclusion Lerna improves de novo genome assembly by optimizing EC tools. Our code is made available in a public repository at: https://github.com/icanforce/lerna-genomics ." @default.
- W4206667478 created "2022-01-25" @default.
- W4206667478 creator A5001762730 @default.
- W4206667478 creator A5011795185 @default.
- W4206667478 creator A5055585728 @default.
- W4206667478 creator A5073589054 @default.
- W4206667478 creator A5074538846 @default.
- W4206667478 creator A5080069587 @default.
- W4206667478 date "2022-01-06" @default.
- W4206667478 modified "2023-10-08" @default.
- W4206667478 title "Lerna: transformer architectures for configuring error correction tools for short- and long-read genome sequencing" @default.
- W4206667478 cites W148400104 @default.
- W4206667478 cites W1579534339 @default.
- W4206667478 cites W1902237438 @default.
- W4206667478 cites W1903088142 @default.
- W4206667478 cites W1975570633 @default.
- W4206667478 cites W1980667059 @default.
- W4206667478 cites W2000050212 @default.
- W4206667478 cites W2001079689 @default.
- W4206667478 cites W2014099509 @default.
- W4206667478 cites W2063104570 @default.
- W4206667478 cites W2064675550 @default.
- W4206667478 cites W2077624120 @default.
- W4206667478 cites W2081763928 @default.
- W4206667478 cites W2082825869 @default.
- W4206667478 cites W2103078213 @default.
- W4206667478 cites W2104677379 @default.
- W4206667478 cites W2107772251 @default.
- W4206667478 cites W2113154626 @default.
- W4206667478 cites W2119745866 @default.
- W4206667478 cites W2121530737 @default.
- W4206667478 cites W2134800079 @default.
- W4206667478 cites W2141978199 @default.
- W4206667478 cites W2144993653 @default.
- W4206667478 cites W2160177274 @default.
- W4206667478 cites W2168908795 @default.
- W4206667478 cites W2170551349 @default.
- W4206667478 cites W2267186426 @default.
- W4206667478 cites W2293185259 @default.
- W4206667478 cites W2346241034 @default.
- W4206667478 cites W2413794162 @default.
- W4206667478 cites W2498287879 @default.
- W4206667478 cites W2515791790 @default.
- W4206667478 cites W2604585222 @default.
- W4206667478 cites W2604588349 @default.
- W4206667478 cites W2747175821 @default.
- W4206667478 cites W2774492845 @default.
- W4206667478 cites W2789843538 @default.
- W4206667478 cites W2904808784 @default.
- W4206667478 cites W2909240409 @default.
- W4206667478 cites W2921990535 @default.
- W4206667478 cites W2940391028 @default.
- W4206667478 cites W2950266744 @default.
- W4206667478 cites W2950354111 @default.
- W4206667478 cites W2962685467 @default.
- W4206667478 cites W2980672119 @default.
- W4206667478 cites W3007172120 @default.
- W4206667478 cites W3036135788 @default.
- W4206667478 cites W4211088835 @default.
- W4206667478 cites W4235832836 @default.
- W4206667478 cites W4245955616 @default.
- W4206667478 doi "https://doi.org/10.1186/s12859-021-04547-0" @default.
- W4206667478 hasPubMedId "https://pubmed.ncbi.nlm.nih.gov/34991450" @default.
- W4206667478 hasPublicationYear "2022" @default.
- W4206667478 type Work @default.
- W4206667478 citedByCount "3" @default.
- W4206667478 countsByYear W42066674782022 @default.
- W4206667478 countsByYear W42066674782023 @default.
- W4206667478 crossrefType "journal-article" @default.
- W4206667478 hasAuthorship W4206667478A5001762730 @default.
- W4206667478 hasAuthorship W4206667478A5011795185 @default.
- W4206667478 hasAuthorship W4206667478A5055585728 @default.
- W4206667478 hasAuthorship W4206667478A5073589054 @default.
- W4206667478 hasAuthorship W4206667478A5074538846 @default.
- W4206667478 hasAuthorship W4206667478A5080069587 @default.
- W4206667478 hasBestOaLocation W42066674781 @default.
- W4206667478 hasConcept C100279451 @default.
- W4206667478 hasConcept C103088060 @default.
- W4206667478 hasConcept C104317684 @default.
- W4206667478 hasConcept C11413529 @default.
- W4206667478 hasConcept C124101348 @default.
- W4206667478 hasConcept C137293760 @default.
- W4206667478 hasConcept C141231307 @default.
- W4206667478 hasConcept C154945302 @default.
- W4206667478 hasConcept C162324750 @default.
- W4206667478 hasConcept C176217482 @default.
- W4206667478 hasConcept C192953774 @default.
- W4206667478 hasConcept C21547014 @default.
- W4206667478 hasConcept C2279292 @default.
- W4206667478 hasConcept C40969351 @default.
- W4206667478 hasConcept C41008148 @default.
- W4206667478 hasConcept C54355233 @default.
- W4206667478 hasConcept C86803240 @default.
- W4206667478 hasConceptScore W4206667478C100279451 @default.
- W4206667478 hasConceptScore W4206667478C103088060 @default.
- W4206667478 hasConceptScore W4206667478C104317684 @default.
- W4206667478 hasConceptScore W4206667478C11413529 @default.
- W4206667478 hasConceptScore W4206667478C124101348 @default.
- W4206667478 hasConceptScore W4206667478C137293760 @default.
- W4206667478 hasConceptScore W4206667478C141231307 @default.