Matches in SemOpenAlex for { <https://semopenalex.org/work/W4226053458> ?p ?o ?g. }
- W4226053458 endingPage "285" @default.
- W4226053458 startingPage "251" @default.
- W4226053458 abstract "This work is based on the seminar titled ‘Resiliency in Numerical Algorithm Design for Extreme Scale Simulations’ held March 1–6, 2020, at Schloss Dagstuhl, that was attended by all the authors. Advanced supercomputing is characterized by very high computation speeds at the cost of involving an enormous amount of resources and costs. A typical large-scale computation running for 48 h on a system consuming 20 MW, as predicted for exascale systems, would consume a million kWh, corresponding to about 100k Euro in energy cost for executing 10 23 floating-point operations. It is clearly unacceptable to lose the whole computation if any of the several million parallel processes fails during the execution. Moreover, if a single operation suffers from a bit-flip error, should the whole computation be declared invalid? What about the notion of reproducibility itself: should this core paradigm of science be revised and refined for results that are obtained by large-scale simulation? Naive versions of conventional resilience techniques will not scale to the exascale regime: with a main memory footprint of tens of Petabytes, synchronously writing checkpoint data all the way to background storage at frequent intervals will create intolerable overheads in runtime and energy consumption. Forecasts show that the mean time between failures could be lower than the time to recover from such a checkpoint, so that large calculations at scale might not make any progress if robust alternatives are not investigated. More advanced resilience techniques must be devised. The key may lie in exploiting both advanced system features as well as specific application knowledge. Research will face two essential questions: (1) what are the reliability requirements for a particular computation and (2) how do we best design the algorithms and software to meet these requirements? While the analysis of use cases can help understand the particular reliability requirements, the construction of remedies is currently wide open. One avenue would be to refine and improve on system- or application-level checkpointing and rollback strategies in the case an error is detected. Developers might use fault notification interfaces and flexible runtime systems to respond to node failures in an application-dependent fashion. Novel numerical algorithms or more stochastic computational approaches may be required to meet accuracy requirements in the face of undetectable soft errors. These ideas constituted an essential topic of the seminar. The goal of this Dagstuhl Seminar was to bring together a diverse group of scientists with expertise in exascale computing to discuss novel ways to make applications resilient against detected and undetected faults. In particular, participants explored the role that algorithms and applications play in the holistic approach needed to tackle this challenge. This article gathers a broad range of perspectives on the role of algorithms, applications and systems in achieving resilience for extreme scale simulations. The ultimate goal is to spark novel ideas and encourage the development of concrete solutions for achieving such resilience holistically." @default.
- W4226053458 created "2022-05-05" @default.
- W4226053458 creator A5004359265 @default.
- W4226053458 creator A5012133869 @default.
- W4226053458 creator A5012806004 @default.
- W4226053458 creator A5013543252 @default.
- W4226053458 creator A5017292778 @default.
- W4226053458 creator A5018841476 @default.
- W4226053458 creator A5021065276 @default.
- W4226053458 creator A5025780942 @default.
- W4226053458 creator A5026096803 @default.
- W4226053458 creator A5028391522 @default.
- W4226053458 creator A5032938194 @default.
- W4226053458 creator A5034102439 @default.
- W4226053458 creator A5037989453 @default.
- W4226053458 creator A5045289712 @default.
- W4226053458 creator A5047821416 @default.
- W4226053458 creator A5050012491 @default.
- W4226053458 creator A5055373392 @default.
- W4226053458 creator A5056569157 @default.
- W4226053458 creator A5057650049 @default.
- W4226053458 creator A5060047917 @default.
- W4226053458 creator A5064382259 @default.
- W4226053458 creator A5068403289 @default.
- W4226053458 creator A5069177485 @default.
- W4226053458 creator A5073056338 @default.
- W4226053458 creator A5074132178 @default.
- W4226053458 creator A5074346385 @default.
- W4226053458 creator A5074411785 @default.
- W4226053458 creator A5077614531 @default.
- W4226053458 creator A5077666857 @default.
- W4226053458 creator A5077768314 @default.
- W4226053458 creator A5079730045 @default.
- W4226053458 creator A5081408798 @default.
- W4226053458 creator A5083030982 @default.
- W4226053458 creator A5085094235 @default.
- W4226053458 creator A5091149742 @default.
- W4226053458 creator A5019745966 @default.
- W4226053458 date "2021-12-10" @default.
- W4226053458 modified "2023-10-02" @default.
- W4226053458 title "Resiliency in numerical algorithm design for extreme scale simulations" @default.
- W4226053458 cites W1476286422 @default.
- W4226053458 cites W1499850000 @default.
- W4226053458 cites W1541239844 @default.
- W4226053458 cites W1557253686 @default.
- W4226053458 cites W1620307392 @default.
- W4226053458 cites W1646309082 @default.
- W4226053458 cites W1765408465 @default.
- W4226053458 cites W179730351 @default.
- W4226053458 cites W1965069357 @default.
- W4226053458 cites W1970476229 @default.
- W4226053458 cites W1977646937 @default.
- W4226053458 cites W1978564754 @default.
- W4226053458 cites W1981432246 @default.
- W4226053458 cites W1984564341 @default.
- W4226053458 cites W1984848758 @default.
- W4226053458 cites W1985713815 @default.
- W4226053458 cites W1987493287 @default.
- W4226053458 cites W1989331073 @default.
- W4226053458 cites W1990615614 @default.
- W4226053458 cites W1995746640 @default.
- W4226053458 cites W1997126580 @default.
- W4226053458 cites W2005390260 @default.
- W4226053458 cites W2005998283 @default.
- W4226053458 cites W2007024473 @default.
- W4226053458 cites W2008505510 @default.
- W4226053458 cites W2011100413 @default.
- W4226053458 cites W2013641642 @default.
- W4226053458 cites W2017060126 @default.
- W4226053458 cites W2019228767 @default.
- W4226053458 cites W2020213615 @default.
- W4226053458 cites W2021234574 @default.
- W4226053458 cites W2031149877 @default.
- W4226053458 cites W2031260715 @default.
- W4226053458 cites W2034593585 @default.
- W4226053458 cites W2035448730 @default.
- W4226053458 cites W2035492130 @default.
- W4226053458 cites W2035851348 @default.
- W4226053458 cites W2036641664 @default.
- W4226053458 cites W2037208432 @default.
- W4226053458 cites W2038238534 @default.
- W4226053458 cites W2041101939 @default.
- W4226053458 cites W2043184000 @default.
- W4226053458 cites W2044752722 @default.
- W4226053458 cites W2045152180 @default.
- W4226053458 cites W2046607737 @default.
- W4226053458 cites W2051757414 @default.
- W4226053458 cites W2055274745 @default.
- W4226053458 cites W2059701835 @default.
- W4226053458 cites W2066927514 @default.
- W4226053458 cites W2072072075 @default.
- W4226053458 cites W2073362170 @default.
- W4226053458 cites W2074626480 @default.
- W4226053458 cites W2078794610 @default.
- W4226053458 cites W2079577430 @default.
- W4226053458 cites W2080571175 @default.
- W4226053458 cites W2083613288 @default.
- W4226053458 cites W2086247510 @default.