Matches in SemOpenAlex for { <https://semopenalex.org/work/W2911327665> ?p ?o ?g. }
- W2911327665 abstract "Aggressive scaling of CMOS transistors has enabled extensive system integration and building faster and more efficient systems. On the flip side, this has resulted in an increasing number of devices that fail in shipped components in-the-field for a variety of reasons including soft errors, wear-out failures, and infant mortality. The pervasiveness of the problem across a broad market demands low cost and generic reliability solutions, precluding traditional solutions that employed excessive redundancy or piecemeal solutions that address only a few failure modes. This dissertation presents SWAT (SoftWare Anomaly Treatment), a low cost resiliency solution that effectively handles hardware faults while incurring low cost during the common mode of fault-free operations. SWAT is based on two key observations about the design of resilient systems. First, only those hardware faults that affect software need to be handled and second, since the common mode of operation is fault-free, fault-free execution should incur near-zero overheads. SWAT thus uses novel zero to low cost hardware and software monitors that watch for anomalous software behavior to detect hardware faults. SWAT then relies on hardware support for checkpointing and rollback recovery. When dealing with fault recovery in the presence of I/O, we identify that existing software-level mechanisms that handle output buffering fall short. This dissertation therefore proposes a simple low-cost hardware buffer for output buffering and demonstrates that this strategy achieves high recoverability while incurring low overheads. Although not detailed in this dissertation, SWAT contains a comprehensive diagnosis procedure that is invoked in the rare event of a fault to isolate the root-cause of the fault by distinguishing between software bugs, transient hardware faults, and permanent hardware faults. Effectively, SWAT handles hardware faults uniformly as software bugs, amortizing the resiliency cost across both hardware and software reliability. The results in this dissertation show that the SWAT strategy is effective to detect and recover the system from a variety of in-core permanent and transient faults in various microarchitecture units for both compute-intensive and I/O-intensive workloads. In particular, this dissertation demonstrates that the SWAT detectors detect nearly all permanent and transient faults in most hardware units in both types of workloads, with only a small fraction of the faults corrupting application output.(Certain hardware structures like the FPU may need additional support to be amenable to software anomaly detection.) Further, a majority of these faults are tolerated by the applications due to their inherent fault-tolerant nature, resulting in only 0.2% of the injected faults affecting the application and yielding incorrect outputs (such faults are classified as Silent Data Corruptions, or SDCs). When attempting to recover the detected faults, we show that handling I/O is important for fault recovery. With our proposed low-cost hardware for output buffering, we show that over 94% of the detected faults are recoverable with low performance and area overheads during fault-free execution even in the presence of I/O. Finally, this dissertation builds a fundamental understanding behind why the SWAT strategy is effective for handling faults in modern workloads. The key insight is that the SWAT detectors are adept at detecting perturbations in control operations and memory addresses and a majority of the application values affect such operations. Faults in values that that never affect such operations are hard-to-detect and require additional support to be amenable to software anomaly detection. In summary, this dissertation presents SWAT as a complete solution to detect and recover from from in-core hardware faults. The techniques presented here therefore have far reaching implications on the design of low-cost solutions to handle unreliable hardware." @default.
- W2911327665 created "2019-02-21" @default.
- W2911327665 creator A5003311789 @default.
- W2911327665 creator A5086111967 @default.
- W2911327665 date "2011-01-01" @default.
- W2911327665 modified "2023-09-23" @default.
- W2911327665 title "Detecting and recovering from in-core hardware faults through software anomaly treatment" @default.
- W2911327665 cites W1125513871 @default.
- W2911327665 cites W129814695 @default.
- W2911327665 cites W1482451474 @default.
- W2911327665 cites W1488205854 @default.
- W2911327665 cites W1523125571 @default.
- W2911327665 cites W1533195354 @default.
- W2911327665 cites W1560055425 @default.
- W2911327665 cites W1579215414 @default.
- W2911327665 cites W1583776219 @default.
- W2911327665 cites W1891950198 @default.
- W2911327665 cites W1971952282 @default.
- W2911327665 cites W1994759706 @default.
- W2911327665 cites W2001354277 @default.
- W2911327665 cites W2007325303 @default.
- W2911327665 cites W2007925061 @default.
- W2911327665 cites W2008482633 @default.
- W2911327665 cites W2019463941 @default.
- W2911327665 cites W2020888328 @default.
- W2911327665 cites W2038366891 @default.
- W2911327665 cites W2051694501 @default.
- W2911327665 cites W2088250010 @default.
- W2911327665 cites W2098473740 @default.
- W2911327665 cites W2099123934 @default.
- W2911327665 cites W2099828501 @default.
- W2911327665 cites W2100866260 @default.
- W2911327665 cites W2101580666 @default.
- W2911327665 cites W2104677471 @default.
- W2911327665 cites W2104915333 @default.
- W2911327665 cites W2105372251 @default.
- W2911327665 cites W2108557605 @default.
- W2911327665 cites W2110908283 @default.
- W2911327665 cites W2112648765 @default.
- W2911327665 cites W2112752650 @default.
- W2911327665 cites W2114100940 @default.
- W2911327665 cites W2114498748 @default.
- W2911327665 cites W2115081151 @default.
- W2911327665 cites W2116613705 @default.
- W2911327665 cites W2116991991 @default.
- W2911327665 cites W2118033476 @default.
- W2911327665 cites W2118811116 @default.
- W2911327665 cites W2119160628 @default.
- W2911327665 cites W2121579803 @default.
- W2911327665 cites W2123475473 @default.
- W2911327665 cites W2123608497 @default.
- W2911327665 cites W2123907700 @default.
- W2911327665 cites W2125169487 @default.
- W2911327665 cites W2126869140 @default.
- W2911327665 cites W2128941141 @default.
- W2911327665 cites W2129360963 @default.
- W2911327665 cites W2129673456 @default.
- W2911327665 cites W2139727248 @default.
- W2911327665 cites W2141365240 @default.
- W2911327665 cites W2142892618 @default.
- W2911327665 cites W2143242007 @default.
- W2911327665 cites W2144382742 @default.
- W2911327665 cites W2144495364 @default.
- W2911327665 cites W2146065717 @default.
- W2911327665 cites W2147280888 @default.
- W2911327665 cites W2148109481 @default.
- W2911327665 cites W2148162182 @default.
- W2911327665 cites W2149473197 @default.
- W2911327665 cites W2151845324 @default.
- W2911327665 cites W2152475836 @default.
- W2911327665 cites W2155581886 @default.
- W2911327665 cites W2155851497 @default.
- W2911327665 cites W2156204788 @default.
- W2911327665 cites W2159889776 @default.
- W2911327665 cites W2162351670 @default.
- W2911327665 cites W2163890539 @default.
- W2911327665 cites W2164264749 @default.
- W2911327665 cites W2165022815 @default.
- W2911327665 cites W2171882483 @default.
- W2911327665 cites W2174598112 @default.
- W2911327665 cites W2342091124 @default.
- W2911327665 cites W3203992401 @default.
- W2911327665 cites W589836237 @default.
- W2911327665 hasPublicationYear "2011" @default.
- W2911327665 type Work @default.
- W2911327665 sameAs 2911327665 @default.
- W2911327665 citedByCount "2" @default.
- W2911327665 countsByYear W29113276652013 @default.
- W2911327665 crossrefType "journal-article" @default.
- W2911327665 hasAuthorship W2911327665A5003311789 @default.
- W2911327665 hasAuthorship W2911327665A5086111967 @default.
- W2911327665 hasConcept C111919701 @default.
- W2911327665 hasConcept C119599485 @default.
- W2911327665 hasConcept C126953365 @default.
- W2911327665 hasConcept C127413603 @default.
- W2911327665 hasConcept C134146338 @default.
- W2911327665 hasConcept C149635348 @default.
- W2911327665 hasConcept C152124472 @default.
- W2911327665 hasConcept C200601418 @default.
- W2911327665 hasConcept C2777904410 @default.