Energy issues of GPU computing clusters
Stéphane Vialle SUPELEC– UMI GT‐CNRS 2958 & AlGorille INRIA Project Team EJC 19‐20/11/2012 Lyon, France
AlGorille INRIA Project Team
Energy issues of GPU computing clusters Stphane Vialle SUPELEC UMI - - PowerPoint PPT Presentation
AlGorille INRIA Project Team Energy issues of GPU computing clusters Stphane Vialle SUPELEC UMI GT CNRS 2958 & AlGorille INRIA Project Team EJC 19 20/11/2012 Lyon, France What means using a GPU cluster ? Programming a
AlGorille INRIA Project Team
Lokman Abas‐Turki Stephane Vialle Bernard Lapeyre 2008
PC0 Input data files PC1 PC0 PCP‐1 PC1 PC0 PCP‐1
GPU cores
PC1 PC0 PCP‐1 PC1 PC0 PCP‐1 PC1 PC0 PCP‐1 PC0 Coarse grain Coarse grain and fine grain Coarse grain and fine grain Coarse grain and fine grain Coarse grain 1 ‐ Input data reading on P0 2 ‐ Input data broadcast from P0 3 ‐ Parallel and independent RNG initialization 4 ‐ Parallel and independent Monte‐Carlo computations 5 ‐ Parallel and independent partial results computation 6 ‐ Partial results reduction on P0 and final price computation 7 – Print results and perfs
GPU cores GPU cores
1E+00 1E+01 1E+02 1E+03 1E+04 1E+05 1 10 100 1000 10000 Total Exec Time(s) Nb of nodes
Asian Pricer on clusters of GPU and CPU
T-1024x1024-1coreCPU T-512x1024-1coreCPU T-512x512-1coreCPU T-1024x1024-1nodeCPU T-512x1024-1nodeCPU T-512x512-1nodeCPU T-1024x1024-1nodeGPU T-512x1024-1nodeGPU T-512x512-1nodeGPU
256 INTEL dual-core nodes 1 CISCO 256-ports switch, Gigabit-eth 16 INTEL dual-core nodes 1 GPU (GeForce 8800 GT) / node 1 DELL 24-ports switch, Gigabit-eth
1E+0 1E+1 1E+2 1E+3 1E+4 1 2 4 8 16 32 64 128 256 Consummed Watt.h Nb of nodes
Asian Pricer on clusters of GPU and CPU
W.h-1024x1024-CPU W.h-512x1024-CPU W.h-512x512-CPU W.h-1024x1024-GPU W.h-512x1024-GPU
Sylvain Contassot‐Vivier Stephane Vialle Thomas Jost Wilfried Kirschenmann 2009
… Internode CPU communications Local CPU GPU data transfers Local GPU computations Local GPU CPU partial result transfers Local CPU computations (not adapted to GPU processing) Internode CPU communications Local CPU GPU data transfers …
One data has a global index, node cpu‐buffer index, node gpu‐buffer index, a fast‐shared‐memory index in a sub‐part of the GPU…
1E+0 1E+1 1E+2 1E+3 1E+4 1E+5 1 10 100 Execution time (s) Number of nodes
Pricer ‐ parallel MC
1E+0 1E+1 1E+2 1E+3 1E+4 1 10 100 Energy (Wh) Number of nodes
Pricer – parallel MC
1E+0 1E+1 1E+2 1E+3 1 10 100 Execution time (s) Number of nodes
PDE Solver ‐ Synchronous
1E+0 1E+1 1E+2 1 10 100 Energy (Wh) Number of nodes
PDE Solver ‐ Synchronous
1E+1 1E+2 1E+3 1E+4 1 10 100 Execution time (s) Number of nodes
Jacobi Relaxation
1E+0 1E+1 1E+2 1E+3 1 10 100 Energy (Wh) Number of nodes
Jacobi Relaxation
Monocore‐CPU cluster Multicore‐CPU cluster Manycore‐GPU cluster
Number of nodes Jacobi ‐ Gains vs multicore CPU‐cluster SU ncoreGPU vs ncoreCPU‐cluster ED ncoreGPU vs multicoreCPU‐cluster 1 10 Number of nodes 2D Relaxation ‐ Gains vs 1 monocore‐node SU ncore‐GPU vs 1monocore‐node‐CPU EG ncore GPU vs 1monocore‐node‐CPU 1 10 Number of nodes Jacobi Relaxation ‐ Gains vs 1 multicore‐node SU ncore‐GPU vs 1multicore‐node‐CPU EG ncore GPU vs 1monocore‐node‐CPU 1 10
You have a GPU cluster you have a CPU cluster! You succeed to program a GPU cluster you can program a CPU cluster! Compare a GPU cluster to a CPU cluster (not to one CPU core…) when possible Comparison will be really different
Speedup Energy gain
1E+0 1E+1 1E+2 1 10 100 GPU cluster vs multicore CPU cluster Number of nodes
Jacobi Relaxation
Oops…
1E+0 1E+1 1E+2 1E+3 1 10 100 GPU cluster vs multicore CPU cluster Number of nodes
Pricer – parallel MC
OK
1E+0 1E+1 1E+2 1 10 100 GPU cluster vs multicore CPU cluster Number of nodes
PDE Solver ‐ synchronous
Hum…
Sylvain Contassot‐Vivier Stephane Vialle 2009‐2010
T‐exec(s) T‐exec(s) Nb of fast nodes Nb of fast nodes
Nb of fast nodes Nb of fast nodes
Nb of fast nodes Nb of fast nodes Energy consumption (W.h) Energy consumption (W.h)
Nb of fast nodes Nb of fast nodes Energy overhead factor Energy overhead factor
Speedup
Energy gain
Lokman Abbas‐Turki Stephane Vialle 2010‐2011‐2012
20 40 60 80 100 120 140 160 "5Stk‐16Ktt" "5Stk‐32Ktt" "5Stk‐64Ktt"
GPU vs CPU mono‐node performances
Speedup Energy Gain
CPU version is parallel but probably not
2 8 32 128 512 1 2 4 8 16 T‐comput (s) Nb of nodes
Pricer AmerMal ‐ 5Stk
T‐256Ktt‐5Stk T‐128Ktt‐5Stk T‐64Ktt‐5Stk 1 2 4 8 16 32 64 1 2 4 8 16 Energy consumption (W.h) Number of nodes
Pricer AmerMal – 5Stk
256Ktt‐5Stk 128Ktt‐5Stk 64Ktt‐5Stk
AlGorille INRIA Project Team