Matches in SemOpenAlex for { <https://semopenalex.org/work/W48771824> ?p ?o ?g. }
Showing items 1 to 77 of
77
with 100 items per page.
- W48771824 abstract "In a large scale real-time distributed system, a large number of components and the time criticality of tasks can contribute to complex situations. Providing predictable and reliable service is a paramount interest in such a system. For example, a single point failure in an electric grid system may lead to a widespread power outage like the Northeast Blackout of 2003. System design and implementation address fault avoidance and mitigation. However, not all faults and failures can be removed during these phases, and therefore run-time fault avoidance and mitigation are needed during the operation. Timing constraints and predictability of the system behavior are important concerns in a large scale system as well. This dissertation proposes several distributed fault tolerance mechanisms using multi-agent technologies to predict and mitigate faults with various frequencies and severities. Some faults are frequently observed over time and some are not. In general, frequent fault types often cause relatively less severe consequences. Rare faults, however, are extremely difficult to predict, yet the consequences can be catastrophic. A rare fault—often indicated by repeated doses of common faults—causes severe harm. In our preliminary study, we design distributed rational agents using a probabilistic prediction mechanism to discover faults in the CMS experiments at CERN. All fault-mitigating activities of the agents and application tasks are guaranteed by the urgency-based priority scheduling policy with multiple steps of feasibility tests. The experiment shows that the distributed approach provides 15% more system availability than centralized approaches. This dissertation also explores the problem of predicting rare events. Many adaptive fault tolerant mechanisms attempt to predict faults through learning from data. However, in order to train the system, we need a significant amount of training data, which is not easily available for rare fault events. We use the PNNL (Pacific Northwest National Laboratory) system failure data collected from about 1,000 nodes over 4 years. We find that the severity of observed fault events is power-law distributed and there are certain associations among these events. Based on the power-law observation, we generate training data for the machine learning algorithm developed in this dissertation. The algorithm incorporates the power-law distribution principle, Bayesian inference, and logistic regression to predict rare events as well as common ones. The logistic regression is used to predict the probability of each type of events and the Bayesian inference is used for finding associations among events. A new learning algorithm is deployed with fully distributed agents using a rational decision model. The simulation study based on the PNNL data shows that the new prediction algorithm provides 15$%$ better system availability than the prediction using the simple update method that was used in our preliminary study; and it achieves more than 10 times less system loss caused by rare faults. Finally, we developed a comprehensive simulation library, named SWARM- eTOSSIM for cyber-physical systems research. The library provides a framework suitable for simulating power-aware real-time distributed networked systems with powerful simulation controls and graphical interface. We downsized the new fault-mitigation mechanism so that it can be ported to devices with limited resources, such as sensor network elements." @default.
- W48771824 created "2016-06-24" @default.
- W48771824 creator A5028115505 @default.
- W48771824 creator A5041612441 @default.
- W48771824 date "2011-01-01" @default.
- W48771824 modified "2023-09-27" @default.
- W48771824 title "A distributed approach for fault mitigation in large scale distributed systems" @default.
- W48771824 hasPublicationYear "2011" @default.
- W48771824 type Work @default.
- W48771824 sameAs 48771824 @default.
- W48771824 citedByCount "0" @default.
- W48771824 crossrefType "journal-article" @default.
- W48771824 hasAuthorship W48771824A5028115505 @default.
- W48771824 hasAuthorship W48771824A5041612441 @default.
- W48771824 hasConcept C120314980 @default.
- W48771824 hasConcept C121332964 @default.
- W48771824 hasConcept C127313418 @default.
- W48771824 hasConcept C127413603 @default.
- W48771824 hasConcept C154945302 @default.
- W48771824 hasConcept C163258240 @default.
- W48771824 hasConcept C165205528 @default.
- W48771824 hasConcept C175551986 @default.
- W48771824 hasConcept C197640229 @default.
- W48771824 hasConcept C200601418 @default.
- W48771824 hasConcept C206729178 @default.
- W48771824 hasConcept C21547014 @default.
- W48771824 hasConcept C2777693866 @default.
- W48771824 hasConcept C41008148 @default.
- W48771824 hasConcept C49937458 @default.
- W48771824 hasConcept C62520636 @default.
- W48771824 hasConcept C63540848 @default.
- W48771824 hasConcept C89227174 @default.
- W48771824 hasConceptScore W48771824C120314980 @default.
- W48771824 hasConceptScore W48771824C121332964 @default.
- W48771824 hasConceptScore W48771824C127313418 @default.
- W48771824 hasConceptScore W48771824C127413603 @default.
- W48771824 hasConceptScore W48771824C154945302 @default.
- W48771824 hasConceptScore W48771824C163258240 @default.
- W48771824 hasConceptScore W48771824C165205528 @default.
- W48771824 hasConceptScore W48771824C175551986 @default.
- W48771824 hasConceptScore W48771824C197640229 @default.
- W48771824 hasConceptScore W48771824C200601418 @default.
- W48771824 hasConceptScore W48771824C206729178 @default.
- W48771824 hasConceptScore W48771824C21547014 @default.
- W48771824 hasConceptScore W48771824C2777693866 @default.
- W48771824 hasConceptScore W48771824C41008148 @default.
- W48771824 hasConceptScore W48771824C49937458 @default.
- W48771824 hasConceptScore W48771824C62520636 @default.
- W48771824 hasConceptScore W48771824C63540848 @default.
- W48771824 hasConceptScore W48771824C89227174 @default.
- W48771824 hasLocation W487718241 @default.
- W48771824 hasOpenAccess W48771824 @default.
- W48771824 hasPrimaryLocation W487718241 @default.
- W48771824 hasRelatedWork W103794388 @default.
- W48771824 hasRelatedWork W1530044334 @default.
- W48771824 hasRelatedWork W1560844163 @default.
- W48771824 hasRelatedWork W156133708 @default.
- W48771824 hasRelatedWork W1983444731 @default.
- W48771824 hasRelatedWork W2019759631 @default.
- W48771824 hasRelatedWork W2042967865 @default.
- W48771824 hasRelatedWork W2083930504 @default.
- W48771824 hasRelatedWork W2109014976 @default.
- W48771824 hasRelatedWork W2168296277 @default.
- W48771824 hasRelatedWork W2185304841 @default.
- W48771824 hasRelatedWork W2229576977 @default.
- W48771824 hasRelatedWork W2353300927 @default.
- W48771824 hasRelatedWork W2477663446 @default.
- W48771824 hasRelatedWork W2509865514 @default.
- W48771824 hasRelatedWork W2959457235 @default.
- W48771824 hasRelatedWork W3003075337 @default.
- W48771824 hasRelatedWork W3096930836 @default.
- W48771824 hasRelatedWork W3173203101 @default.
- W48771824 hasRelatedWork W62186976 @default.
- W48771824 isParatext "false" @default.
- W48771824 isRetracted "false" @default.
- W48771824 magId "48771824" @default.
- W48771824 workType "article" @default.