Matches in SemOpenAlex for { <https://semopenalex.org/work/W2476272165> ?p ?o ?g. }
Showing items 1 to 67 of
67
with 100 items per page.
- W2476272165 abstract "In parallel computing, MPI is heavily used due to its support of popular cluster based parallel machines and the Single Program Multiple Data (SPMD) model. Normally cluster nodes are dedicated to a single parallel job/application but MPI could also be used with nodes that are concurrently shared by multiple users. In this case, nodes could become overloaded with work from other users. Even a few overloaded nodes can result in application slowdown. Thus, it is desirable to relocate affected processes in a running application to lightly loaded nodes by partial checkpointing and migrating of those processes. In some MPI applications, groups of processes communicate frequently with one another. Such groups must be near one another to ensure communication efficiency. Thus, if any member of a group is to be checkpointed and migrated, all should be. It must therefore be possible to identify such groups. I have built a prototype, using LAM/MPI, that supports partial checkpoint, migration and restart of MPI processes. To identify process groups for checkpoint and migration, I adapted TEIRESIAS (an algorithm for pattern discovery from bioinformatics) to identify frequent, recurring patterns of communication using data gathered by LAM/MPI. I then created predictors that use the discovered patterns to predict groups of communicating processes that should be checkpointed and migrated together. I have assessed the effectiveness of my technique using synthetic and real communication data (for a small set of representative applications) to show that my predictors can accurately predict process groups for those applications. Additionally, I have created a simple simulation system to allow me to explore scenarios related to network characteristics and overload conditions under which my system might provide useful speedup. Not all MPI applications will benefit from my approach (e.g. those with unpredictable communication patterns or large groups of frequently communicating processes). However, my experimental and simulation results suggest that my technique should be effective for a number of common application types, network characteristics and overload conditions. Using partial checkpoint and migration should therefore allow many long running applications to finish faster than if a subset of their processes was left running on overloaded nodes." @default.
- W2476272165 created "2016-08-23" @default.
- W2476272165 creator A5016231661 @default.
- W2476272165 creator A5048218770 @default.
- W2476272165 date "2011-01-01" @default.
- W2476272165 modified "2023-09-27" @default.
- W2476272165 title "Performance oriented partial checkpoint and migration of lam/mpi applications" @default.
- W2476272165 hasPublicationYear "2011" @default.
- W2476272165 type Work @default.
- W2476272165 sameAs 2476272165 @default.
- W2476272165 citedByCount "0" @default.
- W2476272165 crossrefType "journal-article" @default.
- W2476272165 hasAuthorship W2476272165A5016231661 @default.
- W2476272165 hasAuthorship W2476272165A5048218770 @default.
- W2476272165 hasConcept C111919701 @default.
- W2476272165 hasConcept C120314980 @default.
- W2476272165 hasConcept C164866538 @default.
- W2476272165 hasConcept C166782233 @default.
- W2476272165 hasConcept C173608175 @default.
- W2476272165 hasConcept C177264268 @default.
- W2476272165 hasConcept C199360897 @default.
- W2476272165 hasConcept C31258907 @default.
- W2476272165 hasConcept C41008148 @default.
- W2476272165 hasConcept C7042729 @default.
- W2476272165 hasConcept C83283714 @default.
- W2476272165 hasConcept C854659 @default.
- W2476272165 hasConcept C98045186 @default.
- W2476272165 hasConceptScore W2476272165C111919701 @default.
- W2476272165 hasConceptScore W2476272165C120314980 @default.
- W2476272165 hasConceptScore W2476272165C164866538 @default.
- W2476272165 hasConceptScore W2476272165C166782233 @default.
- W2476272165 hasConceptScore W2476272165C173608175 @default.
- W2476272165 hasConceptScore W2476272165C177264268 @default.
- W2476272165 hasConceptScore W2476272165C199360897 @default.
- W2476272165 hasConceptScore W2476272165C31258907 @default.
- W2476272165 hasConceptScore W2476272165C41008148 @default.
- W2476272165 hasConceptScore W2476272165C7042729 @default.
- W2476272165 hasConceptScore W2476272165C83283714 @default.
- W2476272165 hasConceptScore W2476272165C854659 @default.
- W2476272165 hasConceptScore W2476272165C98045186 @default.
- W2476272165 hasLocation W24762721651 @default.
- W2476272165 hasOpenAccess W2476272165 @default.
- W2476272165 hasPrimaryLocation W24762721651 @default.
- W2476272165 hasRelatedWork W1545979834 @default.
- W2476272165 hasRelatedWork W1566356205 @default.
- W2476272165 hasRelatedWork W1800026166 @default.
- W2476272165 hasRelatedWork W2104639126 @default.
- W2476272165 hasRelatedWork W2107303919 @default.
- W2476272165 hasRelatedWork W2138671691 @default.
- W2476272165 hasRelatedWork W2166456798 @default.
- W2476272165 hasRelatedWork W2276392719 @default.
- W2476272165 hasRelatedWork W2339010456 @default.
- W2476272165 hasRelatedWork W2512443821 @default.
- W2476272165 hasRelatedWork W2529077345 @default.
- W2476272165 hasRelatedWork W2533524164 @default.
- W2476272165 hasRelatedWork W2606552463 @default.
- W2476272165 hasRelatedWork W2613982821 @default.
- W2476272165 hasRelatedWork W2910983495 @default.
- W2476272165 hasRelatedWork W2939262575 @default.
- W2476272165 hasRelatedWork W37230481 @default.
- W2476272165 hasRelatedWork W43776143 @default.
- W2476272165 hasRelatedWork W2244427474 @default.
- W2476272165 hasRelatedWork W2751751794 @default.
- W2476272165 isParatext "false" @default.
- W2476272165 isRetracted "false" @default.
- W2476272165 magId "2476272165" @default.
- W2476272165 workType "article" @default.