Matches in SemOpenAlex for { <https://semopenalex.org/work/W4320854356> ?p ?o ?g. }
Showing items 1 to 79 of
79
with 100 items per page.
- W4320854356 abstract "As the size of deep learning models gets larger and larger, training takes longer time and more resources, making fault tolerance more and more critical. Existing state-of-the-art methods like CheckFreq and Elastic Horovod need to back up a copy of the model state (i.e., parameters and optimizer states) in memory, which is costly for large models and leads to non-trivial overhead. This paper presents SWIFT, a novel recovery design for distributed deep neural network training that significantly reduces the failure recovery overhead without affecting training throughput and model accuracy. Instead of making an additional copy of the model state, SWIFT resolves the inconsistencies of the model state caused by the failure and exploits the replicas of the model state in data parallelism for failure recovery. We propose a logging-based approach when replicas are unavailable, which records intermediate data and replays the computation to recover the lost state upon a failure. The re-computation is distributed across multiple machines to accelerate failure recovery further. We also log intermediate data selectively, exploring the trade-off between recovery time and intermediate data storage overhead. Evaluations show that SWIFT significantly reduces the failure recovery time and achieves similar or better training throughput during failure-free execution compared to state-of-the-art methods without degrading final model accuracy. SWIFT can also achieve up to 1.16x speedup in total training time compared to state-of-the-art methods." @default.
- W4320854356 created "2023-02-16" @default.
- W4320854356 creator A5002676881 @default.
- W4320854356 creator A5004883324 @default.
- W4320854356 creator A5013466530 @default.
- W4320854356 creator A5020661868 @default.
- W4320854356 creator A5078076737 @default.
- W4320854356 date "2023-02-13" @default.
- W4320854356 modified "2023-09-28" @default.
- W4320854356 title "SWIFT: Expedited Failure Recovery for Large-scale DNN Training" @default.
- W4320854356 doi "https://doi.org/10.48550/arxiv.2302.06173" @default.
- W4320854356 hasPublicationYear "2023" @default.
- W4320854356 type Work @default.
- W4320854356 citedByCount "0" @default.
- W4320854356 crossrefType "posted-content" @default.
- W4320854356 hasAuthorship W4320854356A5002676881 @default.
- W4320854356 hasAuthorship W4320854356A5004883324 @default.
- W4320854356 hasAuthorship W4320854356A5013466530 @default.
- W4320854356 hasAuthorship W4320854356A5020661868 @default.
- W4320854356 hasAuthorship W4320854356A5078076737 @default.
- W4320854356 hasBestOaLocation W43208543561 @default.
- W4320854356 hasConcept C111919701 @default.
- W4320854356 hasConcept C11413529 @default.
- W4320854356 hasConcept C116188536 @default.
- W4320854356 hasConcept C120314980 @default.
- W4320854356 hasConcept C121332964 @default.
- W4320854356 hasConcept C153294291 @default.
- W4320854356 hasConcept C157764524 @default.
- W4320854356 hasConcept C165696696 @default.
- W4320854356 hasConcept C173608175 @default.
- W4320854356 hasConcept C199360897 @default.
- W4320854356 hasConcept C2777211547 @default.
- W4320854356 hasConcept C2778755073 @default.
- W4320854356 hasConcept C2779960059 @default.
- W4320854356 hasConcept C38652104 @default.
- W4320854356 hasConcept C41008148 @default.
- W4320854356 hasConcept C45374587 @default.
- W4320854356 hasConcept C48103436 @default.
- W4320854356 hasConcept C555944384 @default.
- W4320854356 hasConcept C62520636 @default.
- W4320854356 hasConcept C63540848 @default.
- W4320854356 hasConcept C68339613 @default.
- W4320854356 hasConceptScore W4320854356C111919701 @default.
- W4320854356 hasConceptScore W4320854356C11413529 @default.
- W4320854356 hasConceptScore W4320854356C116188536 @default.
- W4320854356 hasConceptScore W4320854356C120314980 @default.
- W4320854356 hasConceptScore W4320854356C121332964 @default.
- W4320854356 hasConceptScore W4320854356C153294291 @default.
- W4320854356 hasConceptScore W4320854356C157764524 @default.
- W4320854356 hasConceptScore W4320854356C165696696 @default.
- W4320854356 hasConceptScore W4320854356C173608175 @default.
- W4320854356 hasConceptScore W4320854356C199360897 @default.
- W4320854356 hasConceptScore W4320854356C2777211547 @default.
- W4320854356 hasConceptScore W4320854356C2778755073 @default.
- W4320854356 hasConceptScore W4320854356C2779960059 @default.
- W4320854356 hasConceptScore W4320854356C38652104 @default.
- W4320854356 hasConceptScore W4320854356C41008148 @default.
- W4320854356 hasConceptScore W4320854356C45374587 @default.
- W4320854356 hasConceptScore W4320854356C48103436 @default.
- W4320854356 hasConceptScore W4320854356C555944384 @default.
- W4320854356 hasConceptScore W4320854356C62520636 @default.
- W4320854356 hasConceptScore W4320854356C63540848 @default.
- W4320854356 hasConceptScore W4320854356C68339613 @default.
- W4320854356 hasLocation W43208543561 @default.
- W4320854356 hasOpenAccess W4320854356 @default.
- W4320854356 hasPrimaryLocation W43208543561 @default.
- W4320854356 hasRelatedWork W1509211761 @default.
- W4320854356 hasRelatedWork W156843270 @default.
- W4320854356 hasRelatedWork W1784146144 @default.
- W4320854356 hasRelatedWork W1905659066 @default.
- W4320854356 hasRelatedWork W1967627035 @default.
- W4320854356 hasRelatedWork W2007449167 @default.
- W4320854356 hasRelatedWork W2331290679 @default.
- W4320854356 hasRelatedWork W2391167130 @default.
- W4320854356 hasRelatedWork W34241620 @default.
- W4320854356 hasRelatedWork W4242263690 @default.
- W4320854356 isParatext "false" @default.
- W4320854356 isRetracted "false" @default.
- W4320854356 workType "article" @default.