Matches in SemOpenAlex for { <https://semopenalex.org/work/W4376505605> ?p ?o ?g. }
Showing items 1 to 92 of
92
with 100 items per page.
- W4376505605 endingPage "26" @default.
- W4376505605 startingPage "1" @default.
- W4376505605 abstract "Deep learning (DL) has become a key component of modern software. In the “ big model ” era, the rich features of DL-based software (i.e., DL software) substantially rely on powerful DL models, e.g., BERT, GPT-3, and the recently emerging GPT-4, which are trained on the powerful cloud with large datasets. Hence, training effective DL models has become a vital stage in the whole software lifecycle. When training deep learning models, especially those big models, developers need to parallelize and distribute the computation and memory resources amongst multiple devices (e.g., a cluster of GPUs) in the training process, which is known as distributed deep learning training , or distributed training for short. However, the unique challenges that developers encounter in distributed training process have not been studied in the software engineering community. Given the increasingly heavy dependence of current DL-based software on distributed training, this paper aims to fill in the knowledge gap and presents the first comprehensive study on developers’ issues in distributed training. To this end, we focus on popular DL frameworks that support distributed training (including TensorFlow, PyTorch, Keras, and Horovod) and analyze 1,131 real-world developers’ issues about using these frameworks reported on Stack Overflow and GitHub. We construct a fine-grained taxonomy consisting of 30 categories regarding the fault symptoms and summarize common fix patterns for different symptoms. We find that: (1) many distributed-specific faults and non-distributed-specific faults inherently share the same fault symptoms, making it challenging to debug; (2) most of the fault symptoms have frequent fix patterns; (3) about half of the faults are related to system-level configurations. Based on the results, we suggest actionable implications on research avenues that can potentially facilitate the distributed training to develop DL-based software, such as focusing on the frequent and common fix patterns when designing testing or debugging tools, developing efficient testing and debugging techniques for communication configuration along with the synthesis of network configuration analysis, designing new multi-device checkpoint-and-replay techniques to help reproduction, and designing serverless APIs for cloud platforms." @default.
- W4376505605 created "2023-05-15" @default.
- W4376505605 creator A5001824731 @default.
- W4376505605 creator A5031457464 @default.
- W4376505605 creator A5031799406 @default.
- W4376505605 creator A5038736328 @default.
- W4376505605 creator A5052249316 @default.
- W4376505605 creator A5067390667 @default.
- W4376505605 creator A5070648432 @default.
- W4376505605 creator A5088948176 @default.
- W4376505605 date "2023-09-29" @default.
- W4376505605 modified "2023-10-14" @default.
- W4376505605 title "Rise of Distributed Deep Learning Training in the Big Model Era: From A Software Engineering Perspective" @default.
- W4376505605 cites W2039676055 @default.
- W4376505605 cites W2053154970 @default.
- W4376505605 cites W2104577574 @default.
- W4376505605 cites W2113175552 @default.
- W4376505605 cites W2149490731 @default.
- W4376505605 cites W2164777277 @default.
- W4376505605 cites W2968594320 @default.
- W4376505605 cites W2975712713 @default.
- W4376505605 cites W2981937105 @default.
- W4376505605 cites W3138798301 @default.
- W4376505605 cites W3206636350 @default.
- W4376505605 cites W4233671654 @default.
- W4376505605 cites W4290991168 @default.
- W4376505605 doi "https://doi.org/10.1145/3597204" @default.
- W4376505605 hasPublicationYear "2023" @default.
- W4376505605 type Work @default.
- W4376505605 citedByCount "0" @default.
- W4376505605 crossrefType "journal-article" @default.
- W4376505605 hasAuthorship W4376505605A5001824731 @default.
- W4376505605 hasAuthorship W4376505605A5031457464 @default.
- W4376505605 hasAuthorship W4376505605A5031799406 @default.
- W4376505605 hasAuthorship W4376505605A5038736328 @default.
- W4376505605 hasAuthorship W4376505605A5052249316 @default.
- W4376505605 hasAuthorship W4376505605A5067390667 @default.
- W4376505605 hasAuthorship W4376505605A5070648432 @default.
- W4376505605 hasAuthorship W4376505605A5088948176 @default.
- W4376505605 hasBestOaLocation W43765056051 @default.
- W4376505605 hasConcept C108583219 @default.
- W4376505605 hasConcept C111919701 @default.
- W4376505605 hasConcept C115903868 @default.
- W4376505605 hasConcept C119857082 @default.
- W4376505605 hasConcept C120314980 @default.
- W4376505605 hasConcept C154945302 @default.
- W4376505605 hasConcept C168065819 @default.
- W4376505605 hasConcept C2522767166 @default.
- W4376505605 hasConcept C2777904410 @default.
- W4376505605 hasConcept C41008148 @default.
- W4376505605 hasConcept C50712370 @default.
- W4376505605 hasConcept C63540848 @default.
- W4376505605 hasConcept C75684735 @default.
- W4376505605 hasConcept C79974875 @default.
- W4376505605 hasConceptScore W4376505605C108583219 @default.
- W4376505605 hasConceptScore W4376505605C111919701 @default.
- W4376505605 hasConceptScore W4376505605C115903868 @default.
- W4376505605 hasConceptScore W4376505605C119857082 @default.
- W4376505605 hasConceptScore W4376505605C120314980 @default.
- W4376505605 hasConceptScore W4376505605C154945302 @default.
- W4376505605 hasConceptScore W4376505605C168065819 @default.
- W4376505605 hasConceptScore W4376505605C2522767166 @default.
- W4376505605 hasConceptScore W4376505605C2777904410 @default.
- W4376505605 hasConceptScore W4376505605C41008148 @default.
- W4376505605 hasConceptScore W4376505605C50712370 @default.
- W4376505605 hasConceptScore W4376505605C63540848 @default.
- W4376505605 hasConceptScore W4376505605C75684735 @default.
- W4376505605 hasConceptScore W4376505605C79974875 @default.
- W4376505605 hasFunder F4320321001 @default.
- W4376505605 hasIssue "6" @default.
- W4376505605 hasLocation W43765056051 @default.
- W4376505605 hasLocation W43765056052 @default.
- W4376505605 hasLocation W43765056053 @default.
- W4376505605 hasLocation W43765056054 @default.
- W4376505605 hasOpenAccess W4376505605 @default.
- W4376505605 hasPrimaryLocation W43765056051 @default.
- W4376505605 hasRelatedWork W1561183558 @default.
- W4376505605 hasRelatedWork W2113737616 @default.
- W4376505605 hasRelatedWork W2165816089 @default.
- W4376505605 hasRelatedWork W2508503355 @default.
- W4376505605 hasRelatedWork W2798934447 @default.
- W4376505605 hasRelatedWork W3014300295 @default.
- W4376505605 hasRelatedWork W4200184607 @default.
- W4376505605 hasRelatedWork W4206291213 @default.
- W4376505605 hasRelatedWork W4285160398 @default.
- W4376505605 hasRelatedWork W4385485544 @default.
- W4376505605 hasVolume "32" @default.
- W4376505605 isParatext "false" @default.
- W4376505605 isRetracted "false" @default.
- W4376505605 workType "article" @default.