Matches in SemOpenAlex for { <https://semopenalex.org/work/W3131881330> ?p ?o ?g. }
- W3131881330 abstract "Large-scale high-performance computing systems frequently experience a wide range of failure modes, such as reliability failures (e.g., hang or crash), and resource overload-related failures (e.g., congestion collapse), impacting systems and applications. Despite the adverse effects of these failures, current systems do not provide methodologies for proactively detecting, localizing, and diagnosing failures. We present Kaleidoscope, a near real-time failure detection and diagnosis framework, consisting of of hierarchical domain-guided machine learning models that identify the failing components, the corresponding failure mode, and point to the most likely cause indicative of the failure in near real-time (within one minute of failure occurrence). Kaleidoscope has been deployed on Blue Waters supercomputer and evaluated with more than two years of production telemetry data. Our evaluation shows that Kaleidoscope successfully localized 99.3% and pinpointed the root causes of 95.8% of 843 real-world production issues, with less than 0.01% runtime overhead." @default.
- W3131881330 created "2021-03-01" @default.
- W3131881330 creator A5016778130 @default.
- W3131881330 creator A5027605695 @default.
- W3131881330 creator A5030058653 @default.
- W3131881330 creator A5030574828 @default.
- W3131881330 creator A5032073929 @default.
- W3131881330 creator A5043860236 @default.
- W3131881330 creator A5062104583 @default.
- W3131881330 creator A5067802693 @default.
- W3131881330 date "2020-11-01" @default.
- W3131881330 modified "2023-09-24" @default.
- W3131881330 title "Live Forensics for HPC Systems: A Case Study on Distributed Storage Systems" @default.
- W3131881330 cites W1553890549 @default.
- W3131881330 cites W1661413208 @default.
- W3131881330 cites W1840725281 @default.
- W3131881330 cites W1974388055 @default.
- W3131881330 cites W1975705975 @default.
- W3131881330 cites W1990249073 @default.
- W3131881330 cites W1994815284 @default.
- W3131881330 cites W2010805714 @default.
- W3131881330 cites W2021171338 @default.
- W3131881330 cites W2025024269 @default.
- W3131881330 cites W2029921999 @default.
- W3131881330 cites W2037721173 @default.
- W3131881330 cites W2038924755 @default.
- W3131881330 cites W2039157918 @default.
- W3131881330 cites W2040512983 @default.
- W3131881330 cites W2049459394 @default.
- W3131881330 cites W2107502666 @default.
- W3131881330 cites W2127577941 @default.
- W3131881330 cites W2133943294 @default.
- W3131881330 cites W2142812297 @default.
- W3131881330 cites W2143143555 @default.
- W3131881330 cites W2143522309 @default.
- W3131881330 cites W2145071552 @default.
- W3131881330 cites W2154884120 @default.
- W3131881330 cites W2157736543 @default.
- W3131881330 cites W2168118398 @default.
- W3131881330 cites W2217402295 @default.
- W3131881330 cites W2258641260 @default.
- W3131881330 cites W2613580061 @default.
- W3131881330 cites W2757367413 @default.
- W3131881330 cites W2767094836 @default.
- W3131881330 cites W2769038759 @default.
- W3131881330 cites W2895690683 @default.
- W3131881330 cites W2902431547 @default.
- W3131881330 cites W2915854813 @default.
- W3131881330 cites W2984468687 @default.
- W3131881330 cites W3080528907 @default.
- W3131881330 cites W3138819813 @default.
- W3131881330 cites W4251820394 @default.
- W3131881330 cites W4253233651 @default.
- W3131881330 cites W4254182148 @default.
- W3131881330 doi "https://doi.org/10.1109/sc41405.2020.00069" @default.
- W3131881330 hasPublicationYear "2020" @default.
- W3131881330 type Work @default.
- W3131881330 sameAs 3131881330 @default.
- W3131881330 citedByCount "4" @default.
- W3131881330 countsByYear W31318813302020 @default.
- W3131881330 countsByYear W31318813302021 @default.
- W3131881330 countsByYear W31318813302022 @default.
- W3131881330 crossrefType "proceedings-article" @default.
- W3131881330 hasAuthorship W3131881330A5016778130 @default.
- W3131881330 hasAuthorship W3131881330A5027605695 @default.
- W3131881330 hasAuthorship W3131881330A5030058653 @default.
- W3131881330 hasAuthorship W3131881330A5030574828 @default.
- W3131881330 hasAuthorship W3131881330A5032073929 @default.
- W3131881330 hasAuthorship W3131881330A5043860236 @default.
- W3131881330 hasAuthorship W3131881330A5062104583 @default.
- W3131881330 hasAuthorship W3131881330A5067802693 @default.
- W3131881330 hasConcept C111919701 @default.
- W3131881330 hasConcept C120314980 @default.
- W3131881330 hasConcept C121332964 @default.
- W3131881330 hasConcept C163258240 @default.
- W3131881330 hasConcept C165136773 @default.
- W3131881330 hasConcept C183469790 @default.
- W3131881330 hasConcept C199360897 @default.
- W3131881330 hasConcept C199683683 @default.
- W3131881330 hasConcept C2778037017 @default.
- W3131881330 hasConcept C2779960059 @default.
- W3131881330 hasConcept C41008148 @default.
- W3131881330 hasConcept C43214815 @default.
- W3131881330 hasConcept C62520636 @default.
- W3131881330 hasConcept C63540848 @default.
- W3131881330 hasConcept C79403827 @default.
- W3131881330 hasConcept C83283714 @default.
- W3131881330 hasConceptScore W3131881330C111919701 @default.
- W3131881330 hasConceptScore W3131881330C120314980 @default.
- W3131881330 hasConceptScore W3131881330C121332964 @default.
- W3131881330 hasConceptScore W3131881330C163258240 @default.
- W3131881330 hasConceptScore W3131881330C165136773 @default.
- W3131881330 hasConceptScore W3131881330C183469790 @default.
- W3131881330 hasConceptScore W3131881330C199360897 @default.
- W3131881330 hasConceptScore W3131881330C199683683 @default.
- W3131881330 hasConceptScore W3131881330C2778037017 @default.
- W3131881330 hasConceptScore W3131881330C2779960059 @default.
- W3131881330 hasConceptScore W3131881330C41008148 @default.
- W3131881330 hasConceptScore W3131881330C43214815 @default.
- W3131881330 hasConceptScore W3131881330C62520636 @default.