Matches in SemOpenAlex for { <https://semopenalex.org/work/W4308760184> ?p ?o ?g. }
Showing items 1 to 86 of
86
with 100 items per page.
- W4308760184 abstract "We study the problem of efficient generative inference for Transformer models, in one of its most challenging settings: large deep models, with tight latency targets and long sequence lengths. Better understanding of the engineering tradeoffs for inference for large Transformer-based models is important as use cases of these models are growing rapidly throughout application areas. We develop a simple analytical model for inference efficiency to select the best multi-dimensional partitioning techniques optimized for TPU v4 slices based on the application requirements. We combine these with a suite of low-level optimizations to achieve a new Pareto frontier on the latency and model FLOPS utilization (MFU) tradeoffs on 500B+ parameter models that outperforms the FasterTransformer suite of benchmarks. We further show that with appropriate partitioning, the lower memory requirements of multiquery attention (i.e. multiple query heads share single key/value head) enables scaling up to 32x larger context lengths. Finally, we achieve a low-batch-size latency of 29ms per token during generation (using int8 weight quantization) and a 76% MFU during large-batch-size processing of input tokens, while supporting a long 2048-token context length on the PaLM 540B parameter model." @default.
- W4308760184 created "2022-11-15" @default.
- W4308760184 creator A5011889606 @default.
- W4308760184 creator A5018290574 @default.
- W4308760184 creator A5033596864 @default.
- W4308760184 creator A5052263308 @default.
- W4308760184 creator A5055969617 @default.
- W4308760184 creator A5057457287 @default.
- W4308760184 creator A5069544528 @default.
- W4308760184 creator A5077434272 @default.
- W4308760184 creator A5086404363 @default.
- W4308760184 creator A5090477236 @default.
- W4308760184 date "2022-11-09" @default.
- W4308760184 modified "2023-09-27" @default.
- W4308760184 title "Efficiently Scaling Transformer Inference" @default.
- W4308760184 doi "https://doi.org/10.48550/arxiv.2211.05102" @default.
- W4308760184 hasPublicationYear "2022" @default.
- W4308760184 type Work @default.
- W4308760184 citedByCount "0" @default.
- W4308760184 crossrefType "posted-content" @default.
- W4308760184 hasAuthorship W4308760184A5011889606 @default.
- W4308760184 hasAuthorship W4308760184A5018290574 @default.
- W4308760184 hasAuthorship W4308760184A5033596864 @default.
- W4308760184 hasAuthorship W4308760184A5052263308 @default.
- W4308760184 hasAuthorship W4308760184A5055969617 @default.
- W4308760184 hasAuthorship W4308760184A5057457287 @default.
- W4308760184 hasAuthorship W4308760184A5069544528 @default.
- W4308760184 hasAuthorship W4308760184A5077434272 @default.
- W4308760184 hasAuthorship W4308760184A5086404363 @default.
- W4308760184 hasAuthorship W4308760184A5090477236 @default.
- W4308760184 hasBestOaLocation W43087601841 @default.
- W4308760184 hasConcept C11413529 @default.
- W4308760184 hasConcept C119599485 @default.
- W4308760184 hasConcept C127413603 @default.
- W4308760184 hasConcept C154945302 @default.
- W4308760184 hasConcept C165801399 @default.
- W4308760184 hasConcept C166957645 @default.
- W4308760184 hasConcept C2524010 @default.
- W4308760184 hasConcept C2776214188 @default.
- W4308760184 hasConcept C28855332 @default.
- W4308760184 hasConcept C33923547 @default.
- W4308760184 hasConcept C38652104 @default.
- W4308760184 hasConcept C41008148 @default.
- W4308760184 hasConcept C48145219 @default.
- W4308760184 hasConcept C66322947 @default.
- W4308760184 hasConcept C76155785 @default.
- W4308760184 hasConcept C79581498 @default.
- W4308760184 hasConcept C82876162 @default.
- W4308760184 hasConcept C95457728 @default.
- W4308760184 hasConcept C99844830 @default.
- W4308760184 hasConceptScore W4308760184C11413529 @default.
- W4308760184 hasConceptScore W4308760184C119599485 @default.
- W4308760184 hasConceptScore W4308760184C127413603 @default.
- W4308760184 hasConceptScore W4308760184C154945302 @default.
- W4308760184 hasConceptScore W4308760184C165801399 @default.
- W4308760184 hasConceptScore W4308760184C166957645 @default.
- W4308760184 hasConceptScore W4308760184C2524010 @default.
- W4308760184 hasConceptScore W4308760184C2776214188 @default.
- W4308760184 hasConceptScore W4308760184C28855332 @default.
- W4308760184 hasConceptScore W4308760184C33923547 @default.
- W4308760184 hasConceptScore W4308760184C38652104 @default.
- W4308760184 hasConceptScore W4308760184C41008148 @default.
- W4308760184 hasConceptScore W4308760184C48145219 @default.
- W4308760184 hasConceptScore W4308760184C66322947 @default.
- W4308760184 hasConceptScore W4308760184C76155785 @default.
- W4308760184 hasConceptScore W4308760184C79581498 @default.
- W4308760184 hasConceptScore W4308760184C82876162 @default.
- W4308760184 hasConceptScore W4308760184C95457728 @default.
- W4308760184 hasConceptScore W4308760184C99844830 @default.
- W4308760184 hasLocation W43087601841 @default.
- W4308760184 hasLocation W43087601842 @default.
- W4308760184 hasOpenAccess W4308760184 @default.
- W4308760184 hasPrimaryLocation W43087601841 @default.
- W4308760184 hasRelatedWork W1985412924 @default.
- W4308760184 hasRelatedWork W2375389409 @default.
- W4308760184 hasRelatedWork W2488051804 @default.
- W4308760184 hasRelatedWork W2625315266 @default.
- W4308760184 hasRelatedWork W2948197522 @default.
- W4308760184 hasRelatedWork W3008625068 @default.
- W4308760184 hasRelatedWork W3039805635 @default.
- W4308760184 hasRelatedWork W4213041209 @default.
- W4308760184 hasRelatedWork W4294982680 @default.
- W4308760184 hasRelatedWork W2779562428 @default.
- W4308760184 isParatext "false" @default.
- W4308760184 isRetracted "false" @default.
- W4308760184 workType "article" @default.