Matches in SemOpenAlex for { <https://semopenalex.org/work/W141129880> ?p ?o ?g. }
- W141129880 abstract "Scientists use advanced computing techniques to assist in answering the complex questions at the forefront of discovery. The High Performance Computing (HPC) scientific applications created by these scientists are running longer and scaling to larger systems. These applications must be able to tolerate the inevitable failure of a subset of processes (process failures) that occur as a result of pushing the reliability boundaries of HPC systems. HPC system reliability is emerging as a problem in future exascale systems where the time to failure is measured in minutes or hours instead of days or months. Resilient applications (i.e., applications that can continue to run despite process failures) depend on resilient communication and runtime environments to sustain the application across process failures. Unfortunately, these environments are uncommon and not typically present on HPC systems. In order to preserve performance, scalability, and scientific accuracy, a resilient application may choose the invasiveness of the recovery solution, from completely transparent to completely application-directed. Therefore, resilient communication and runtime environments must provide customizable fault recovery mechanisms. Resilient applications often use rollback recovery techniques for fault tolerance: particularly popular are checkpoint/restart (C/R) techniques. HPC applications commonly use the Message Passing Interface (MPI) standard for communication. This thesis identifies a complete set of capabilities that compose to form a coordinated C/R infrastructure for MPI applications running on HPC systems. These capabilities, when integrated into an MPI implementation, provide applications with transparent, yet optionally application configurable, fault tolerance. By adding these capabilities to Open MPI we demonstrate support for C/R process fault tolerance, automatic recovery, proactive process migration, and parallel debugging. We also discuss how this infrastructure is being used to support further research into fault tolerance." @default.
- W141129880 created "2016-06-24" @default.
- W141129880 creator A5000867808 @default.
- W141129880 creator A5076123370 @default.
- W141129880 date "2010-01-01" @default.
- W141129880 modified "2023-10-01" @default.
- W141129880 title "Coordinated checkpoint/restart process fault tolerance for mpi applications on hpc systems" @default.
- W141129880 cites W126554001 @default.
- W141129880 cites W128180137 @default.
- W141129880 cites W134816542 @default.
- W141129880 cites W1482022028 @default.
- W141129880 cites W1494600391 @default.
- W141129880 cites W1494933502 @default.
- W141129880 cites W1498586823 @default.
- W141129880 cites W1499133891 @default.
- W141129880 cites W1500546894 @default.
- W141129880 cites W1504603681 @default.
- W141129880 cites W1510894298 @default.
- W141129880 cites W1515322768 @default.
- W141129880 cites W1516416232 @default.
- W141129880 cites W1520339130 @default.
- W141129880 cites W1524357507 @default.
- W141129880 cites W1525865893 @default.
- W141129880 cites W1527076549 @default.
- W141129880 cites W1527754331 @default.
- W141129880 cites W153168184 @default.
- W141129880 cites W1532689837 @default.
- W141129880 cites W1537929875 @default.
- W141129880 cites W1545979834 @default.
- W141129880 cites W1550132458 @default.
- W141129880 cites W1554413585 @default.
- W141129880 cites W1554637430 @default.
- W141129880 cites W1555640165 @default.
- W141129880 cites W1560249771 @default.
- W141129880 cites W1568637577 @default.
- W141129880 cites W1571582830 @default.
- W141129880 cites W1574290619 @default.
- W141129880 cites W1579345337 @default.
- W141129880 cites W1582505175 @default.
- W141129880 cites W1586189629 @default.
- W141129880 cites W1591191547 @default.
- W141129880 cites W1597130819 @default.
- W141129880 cites W161866744 @default.
- W141129880 cites W1643342729 @default.
- W141129880 cites W169659540 @default.
- W141129880 cites W1730721998 @default.
- W141129880 cites W173313170 @default.
- W141129880 cites W1767718504 @default.
- W141129880 cites W1780860177 @default.
- W141129880 cites W1801512066 @default.
- W141129880 cites W1815088304 @default.
- W141129880 cites W1825216778 @default.
- W141129880 cites W1826478303 @default.
- W141129880 cites W1846255488 @default.
- W141129880 cites W1847519410 @default.
- W141129880 cites W1848988118 @default.
- W141129880 cites W1855495706 @default.
- W141129880 cites W1861256213 @default.
- W141129880 cites W1862835629 @default.
- W141129880 cites W1877496576 @default.
- W141129880 cites W1911403303 @default.
- W141129880 cites W1913279262 @default.
- W141129880 cites W1923741182 @default.
- W141129880 cites W1931589823 @default.
- W141129880 cites W1935389564 @default.
- W141129880 cites W1954990415 @default.
- W141129880 cites W1963836890 @default.
- W141129880 cites W1964729314 @default.
- W141129880 cites W1964998488 @default.
- W141129880 cites W1965091139 @default.
- W141129880 cites W1968105297 @default.
- W141129880 cites W1969550081 @default.
- W141129880 cites W1969617086 @default.
- W141129880 cites W1973269641 @default.
- W141129880 cites W1979117305 @default.
- W141129880 cites W1983037032 @default.
- W141129880 cites W1985006281 @default.
- W141129880 cites W1985164784 @default.
- W141129880 cites W1986009243 @default.
- W141129880 cites W1987608073 @default.
- W141129880 cites W1989678938 @default.
- W141129880 cites W1990580673 @default.
- W141129880 cites W1993383198 @default.
- W141129880 cites W1995644038 @default.
- W141129880 cites W1997021720 @default.
- W141129880 cites W2002667367 @default.
- W141129880 cites W2003214215 @default.
- W141129880 cites W2004885674 @default.
- W141129880 cites W2005728319 @default.
- W141129880 cites W2007397415 @default.
- W141129880 cites W2014010055 @default.
- W141129880 cites W2014594876 @default.
- W141129880 cites W2019465613 @default.
- W141129880 cites W2019680442 @default.
- W141129880 cites W2027888627 @default.
- W141129880 cites W2032094884 @default.
- W141129880 cites W2032237458 @default.
- W141129880 cites W2032493687 @default.
- W141129880 cites W2032858539 @default.
- W141129880 cites W2033656974 @default.