Matches in SemOpenAlex for { <https://semopenalex.org/work/W191503951> ?p ?o ?g. }
- W191503951 abstract "By leveraging the enormous amount of computational capabilities, scientists today are being able to make significant progress in solving problems, ranging from finding cure to cancer -- to using fusion in solving world's clean energy crisis. The number of computational components in extreme scale computing environments is growing exponentially. Since the failure rate of each component starts factoring in, the reliability of overall systems decreases proportionately. Hence, in spite of having enormous computational capabilities, these groundbreaking simulations may never run to completion. The only way to ensure their timely completion is by making these systems reliable, so that no failure can hinder the progress of science. On such systems, long running scientific applications periodically store their execution states in checkpoint files on stable storage, and recover from a failure by restarting from the last saved checkpoint file. Resilient high-throughput and high-performance systems enable applications to simulate scientific problems at granularities finer than ever thought possible. Unfortunately, this explosion in scientific computing capabilities generates large amounts of state. As a result, today's checkpointing systems crumble under the increased amount of checkpoint data. Additionally, the network I/O bandwidth is not growing nearly as fast as the compute cycles. These two factors have caused scalability challenges for checkpointing systems. The focus of this thesis is to develop scalable checkpointing systems for two different execution environments – high-throughput grids and high-performance clusters. In grid environment, machine owners voluntarily share their idle CPU cycles with other users of the system, as long as the performance degradation of host processes remain under certain threshold. The challenge of such an environment is to ensure end-to-end application performance given the high-rate of unavailability of machines and that of guest-job eviction. Today's systems often use expensive, high-performance dedicated checkpoint servers. In this thesis, we present a system – FALCON, which uses available disk resources of the grid machines as shared checkpoint repositories. However, an unavailable storage host may lead to loss of checkpoint data. Therefore, we model the failures of storage hosts and predict the availability of checkpoint repositories. Experiments run on production high-throughput system – DiaGrid show that FALCON improves the overall performance of benchmark applications, that write gigabytes of checkpoint data, between 11% and 44% compared to the widely used Condor checkpointing solutions. In high-performance computing (HPC) systems, applications store their states in checkpoints on a parallel file system (PFS). As applications scale up, checkpoint-restart incurs high overheads due to contention for PFS resources. The high overheads force large-scale applications to reduce checkpoint frequency, which means more compute time is lost in the event of failure. We alleviate this problem by developing a scalable checkpoint-restart system, MCRENGINE. MCRENGINE aggregates checkpoints from multiple application processes with knowledge of the data semantics available through widely used I/O libraries, e.g., HDF5 and netCDF, and compresses them. Our novel scheme improves compressibility of checkpoints up to 115% over simple concatenation and compression. Our evaluation with large-scale application checkpoints show that MCRENGINE reduces checkpointing overhead by up to 87% and restart overhead by up to 62% over a baseline with no aggregation or compression. We believe that the contributions made in this thesis serve as a good foundation for further research in improving scalability of checkpointing systems in large-scale, distributed computing environments." @default.
- W191503951 created "2016-06-24" @default.
- W191503951 creator A5002465410 @default.
- W191503951 creator A5047310442 @default.
- W191503951 date "2013-01-01" @default.
- W191503951 modified "2023-09-27" @default.
- W191503951 title "Reliable and scalable checkpointing systems for distributed computing environments" @default.
- W191503951 cites W1497470190 @default.
- W191503951 cites W1498674203 @default.
- W191503951 cites W1600328410 @default.
- W191503951 cites W1822772607 @default.
- W191503951 cites W1912206988 @default.
- W191503951 cites W1981432246 @default.
- W191503951 cites W1984564341 @default.
- W191503951 cites W1993383198 @default.
- W191503951 cites W2023324599 @default.
- W191503951 cites W2023779315 @default.
- W191503951 cites W2043522772 @default.
- W191503951 cites W2047459348 @default.
- W191503951 cites W2078837027 @default.
- W191503951 cites W2082498963 @default.
- W191503951 cites W2089536264 @default.
- W191503951 cites W2100970777 @default.
- W191503951 cites W2106811322 @default.
- W191503951 cites W2108334038 @default.
- W191503951 cites W2109485293 @default.
- W191503951 cites W2110455446 @default.
- W191503951 cites W2115890460 @default.
- W191503951 cites W2116011221 @default.
- W191503951 cites W2119018856 @default.
- W191503951 cites W2125269836 @default.
- W191503951 cites W2126716774 @default.
- W191503951 cites W2131613942 @default.
- W191503951 cites W2135202817 @default.
- W191503951 cites W2139244298 @default.
- W191503951 cites W2140746402 @default.
- W191503951 cites W2141299790 @default.
- W191503951 cites W2145594092 @default.
- W191503951 cites W2145778594 @default.
- W191503951 cites W2147667563 @default.
- W191503951 cites W2159161022 @default.
- W191503951 cites W2160225915 @default.
- W191503951 cites W2163295644 @default.
- W191503951 cites W2165022815 @default.
- W191503951 cites W2167563208 @default.
- W191503951 cites W2170163131 @default.
- W191503951 cites W39601649 @default.
- W191503951 cites W2250153769 @default.
- W191503951 hasPublicationYear "2013" @default.
- W191503951 type Work @default.
- W191503951 sameAs 191503951 @default.
- W191503951 citedByCount "0" @default.
- W191503951 crossrefType "journal-article" @default.
- W191503951 hasAuthorship W191503951A5002465410 @default.
- W191503951 hasAuthorship W191503951A5047310442 @default.
- W191503951 hasConcept C111919701 @default.
- W191503951 hasConcept C120314980 @default.
- W191503951 hasConcept C157764524 @default.
- W191503951 hasConcept C173608175 @default.
- W191503951 hasConcept C187691185 @default.
- W191503951 hasConcept C2524010 @default.
- W191503951 hasConcept C33923547 @default.
- W191503951 hasConcept C41008148 @default.
- W191503951 hasConcept C48044578 @default.
- W191503951 hasConcept C555944384 @default.
- W191503951 hasConcept C70429105 @default.
- W191503951 hasConcept C83283714 @default.
- W191503951 hasConceptScore W191503951C111919701 @default.
- W191503951 hasConceptScore W191503951C120314980 @default.
- W191503951 hasConceptScore W191503951C157764524 @default.
- W191503951 hasConceptScore W191503951C173608175 @default.
- W191503951 hasConceptScore W191503951C187691185 @default.
- W191503951 hasConceptScore W191503951C2524010 @default.
- W191503951 hasConceptScore W191503951C33923547 @default.
- W191503951 hasConceptScore W191503951C41008148 @default.
- W191503951 hasConceptScore W191503951C48044578 @default.
- W191503951 hasConceptScore W191503951C555944384 @default.
- W191503951 hasConceptScore W191503951C70429105 @default.
- W191503951 hasConceptScore W191503951C83283714 @default.
- W191503951 hasLocation W1915039511 @default.
- W191503951 hasOpenAccess W191503951 @default.
- W191503951 hasPrimaryLocation W1915039511 @default.
- W191503951 hasRelatedWork W1482451474 @default.
- W191503951 hasRelatedWork W1546275500 @default.
- W191503951 hasRelatedWork W1597348391 @default.
- W191503951 hasRelatedWork W1984564341 @default.
- W191503951 hasRelatedWork W2037736212 @default.
- W191503951 hasRelatedWork W2044920626 @default.
- W191503951 hasRelatedWork W2047196223 @default.
- W191503951 hasRelatedWork W2078349455 @default.
- W191503951 hasRelatedWork W2078837027 @default.
- W191503951 hasRelatedWork W2098983224 @default.
- W191503951 hasRelatedWork W2130768881 @default.
- W191503951 hasRelatedWork W2153958596 @default.
- W191503951 hasRelatedWork W2761821601 @default.
- W191503951 hasRelatedWork W2949822407 @default.
- W191503951 hasRelatedWork W2951565793 @default.
- W191503951 hasRelatedWork W2997170519 @default.
- W191503951 hasRelatedWork W3158095115 @default.
- W191503951 hasRelatedWork W3183818740 @default.