Matches in SemOpenAlex for { <https://semopenalex.org/work/W2894932706> ?p ?o ?g. }
- W2894932706 abstract "High Throughput Sequencing (HTS) technologies are constantly improving and making genome sequencing more affordable. However, HTS sequencers can only produce short overlapping genome fragments that are erroneous and cover the sequenced genomes unevenly. These genome fragments are assembled based on their overlaps to produce larger contiguous sequences. Since de novo genome assembly is computationally intensive, some species have a reference genome used as a guide for assembling genome fragments from the same species or as a basis for comparative genomics methods. Yet, assembling a genome is an error-prone process depending on the quality of the sequencing data and the heuristics used during the assembly. Furthermore, analyses based on a reference are biased towards the reference. Finally, a single reference cannot reflect the dynamics and diversity of a population of genomes. Overcoming these issues requires to move away from the single-genome reference-centric paradigm and take advantage of the multiple sequenced genomes available for each species. For this purpose, pan-genomes were introduced as sets of genomes from different strains of the same species. A pan-genome is represented by a multi-genome index exploiting the similarity and redundancy of the genomes it contains. Still, pan-genomes are more difficult to analyze than single genomes because of the large amount of data to be stored and indexed. Current data structures for pan-genome indexing do not fulfill all requirements for pan-genome analysis. Indeed, these data structures are often immutable while the size of a pan-genome grows constantly with newly sequenced genomes. Frequently, these data structures consider only assemblies as input, while unassembled genome fragments abound in databases. Also, indexing variants and similarities between the genomes of a pan-genome usually requires time and memory consuming algorithms such as sequence alignments. Sometimes, pan-genome analysis tools just assume variants and similarities are provided as input.While data structures already exist for pan-genome indexing, no solution is currently proposed for genome fragment compression in a pan-genome context. Indeed, it is often of interest to transmit and store all genome fragments of a pan-genome. However, HTS-specific compression tools are not dynamic and cannot update a compressed archive of genome fragments with new fragments of a genome without decompression. Hence, those tools are poorly adapted to the transmission and storage of genome fragments in a pan-genome context. In this thesis, we aim to provide scalable solutions for pan-genome indexing and storage. We first address the problem of pan-genome indexing by proposing a new alignment-free, reference-free and incremental data structure that considers genome fragments as well as assemblies in input: the Bloom Filter Trie (BFT). The BFT is a tree data structure representing a colored de Bruijn graph in which k-mers, words of length k from the input genomes, are associated with sets of colors representing the genomes in which they occur. The BFT makes extensive use of Bloom filters to navigate in the tree and optimize the graph traversal. A bursting method is employed to perform an efficient path and level compaction of the tree. We show that the BFT outperforms a data structure that has similar features but is based on an approximation of the set of indexed k-mers. Secondly, we address the problem of genome fragments compression in a pan-genome context by proposing a new abstract data structure, the guided de Bruijn graph. It augments the de Bruijn graph with k-mer partitions such that the graph traversal is guided to reconstruct exactly the genome fragments when decompressing. Different techniques are proposed to optimize the storage of fragments in the graph and the partition encoding. We show that the BFT described previously has all features required to index a guided de Bruijn graph and is used in the implementation of our compression method named DARRC. The evaluation of DARRC on a large pan-genome dataset compared to state-of-the-art HTS-specific and general purpose compression tools shows a 30% compression ratio improvement over the second best performing tool of this evaluation." @default.
- W2894932706 created "2018-10-12" @default.
- W2894932706 creator A5060291070 @default.
- W2894932706 date "2018-01-01" @default.
- W2894932706 modified "2023-09-27" @default.
- W2894932706 title "Pan-genome Search and Storage" @default.
- W2894932706 cites W1568325880 @default.
- W2894932706 cites W1625645377 @default.
- W2894932706 cites W1867886327 @default.
- W2894932706 cites W1931027898 @default.
- W2894932706 cites W1953219706 @default.
- W2894932706 cites W1964377951 @default.
- W2894932706 cites W1974033543 @default.
- W2894932706 cites W2002459863 @default.
- W2894932706 cites W2005129098 @default.
- W2894932706 cites W2008130168 @default.
- W2894932706 cites W2010361633 @default.
- W2894932706 cites W2011210839 @default.
- W2894932706 cites W2011993186 @default.
- W2894932706 cites W2015325511 @default.
- W2894932706 cites W2016995838 @default.
- W2894932706 cites W2031207772 @default.
- W2894932706 cites W2033740195 @default.
- W2894932706 cites W2034337154 @default.
- W2894932706 cites W2037219729 @default.
- W2894932706 cites W2041391522 @default.
- W2894932706 cites W2041824945 @default.
- W2894932706 cites W2055615325 @default.
- W2894932706 cites W2058759221 @default.
- W2894932706 cites W2060108852 @default.
- W2894932706 cites W2061680337 @default.
- W2894932706 cites W2064452752 @default.
- W2894932706 cites W2071225564 @default.
- W2894932706 cites W2072021854 @default.
- W2894932706 cites W2076747312 @default.
- W2894932706 cites W2087361130 @default.
- W2894932706 cites W2092880969 @default.
- W2894932706 cites W2093080729 @default.
- W2894932706 cites W2093610003 @default.
- W2894932706 cites W2096128575 @default.
- W2894932706 cites W2096465161 @default.
- W2894932706 cites W2100076391 @default.
- W2894932706 cites W2101247207 @default.
- W2894932706 cites W2102278945 @default.
- W2894932706 cites W2103441770 @default.
- W2894932706 cites W2104549677 @default.
- W2894932706 cites W2104846587 @default.
- W2894932706 cites W2107079154 @default.
- W2894932706 cites W2107745473 @default.
- W2894932706 cites W2108190694 @default.
- W2894932706 cites W2108640362 @default.
- W2894932706 cites W2111044311 @default.
- W2894932706 cites W2118442768 @default.
- W2894932706 cites W2118493204 @default.
- W2894932706 cites W2121252285 @default.
- W2894932706 cites W2123845384 @default.
- W2894932706 cites W2125119899 @default.
- W2894932706 cites W2125418992 @default.
- W2894932706 cites W2125456570 @default.
- W2894932706 cites W2125763460 @default.
- W2894932706 cites W2126353995 @default.
- W2894932706 cites W2126540423 @default.
- W2894932706 cites W2127230663 @default.
- W2894932706 cites W2127674396 @default.
- W2894932706 cites W2129264270 @default.
- W2894932706 cites W2129652681 @default.
- W2894932706 cites W2130530163 @default.
- W2894932706 cites W2133412023 @default.
- W2894932706 cites W2134283755 @default.
- W2894932706 cites W2135208303 @default.
- W2894932706 cites W2137351413 @default.
- W2894932706 cites W2138270253 @default.
- W2894932706 cites W2143420371 @default.
- W2894932706 cites W2144560237 @default.
- W2894932706 cites W2147477044 @default.
- W2894932706 cites W2150550043 @default.
- W2894932706 cites W2153707226 @default.
- W2894932706 cites W2154803978 @default.
- W2894932706 cites W2155512447 @default.
- W2894932706 cites W2156104322 @default.
- W2894932706 cites W2158322625 @default.
- W2894932706 cites W2158874082 @default.
- W2894932706 cites W2159906372 @default.
- W2894932706 cites W2161488606 @default.
- W2894932706 cites W2163338240 @default.
- W2894932706 cites W2166588423 @default.
- W2894932706 cites W2166666507 @default.
- W2894932706 cites W2167142288 @default.
- W2894932706 cites W2195724570 @default.
- W2894932706 cites W2198888083 @default.
- W2894932706 cites W2233676531 @default.
- W2894932706 cites W2235699210 @default.
- W2894932706 cites W2245756385 @default.
- W2894932706 cites W2247945216 @default.
- W2894932706 cites W2266239166 @default.
- W2894932706 cites W2337480916 @default.
- W2894932706 cites W2339406693 @default.
- W2894932706 cites W2438121987 @default.
- W2894932706 cites W2531091319 @default.
- W2894932706 cites W2533248932 @default.