Tackling Performance Bottlenecks in the Diversifying CUDA HPC Ecosystem: a Molecular Dynamics Perspective
Szilárd Páll
KTH Royal Institute of Technology
Tackling Performance Bottlenecks in the Diversifying CUDA HPC - - PowerPoint PPT Presentation
Tackling Performance Bottlenecks in the Diversifying CUDA HPC Ecosystem: a Molecular Dynamics Perspective Szilrd Pll KTH Royal Institute of Technology GTC 2015 Diversifying hardware & complex parallelism Compute Cloud
KTH Royal Institute of Technology
– parallel on multiple levels – heterogenous – power constrained
x102-5 x2-8 Increasing CPU/GPU core count NUMA on die “Skinny” workstations to fat compute nodes NUMA, NUAA, ...
Compute Cloud
Widening SIMD/SIMT units
Mini-clusters, petaflop machines, & On-demand computing
x102-5 x2-8 Increasing CPU/GPU core count NUMA on die “Skinny” workstations to fat compute nodes NUMA, NUAA, ...
Compute Cloud
Widening SIMD/SIMT units
Mini-clusters, petaflop machines, & On-demand computing
Reproduced under CC BY-SA from: http://commons.wikimedia.org/wiki/File:MM_PEF.png
– N particles – masses – potential V
– acceleration → velocities – velocities
Newton's equations
– few but localized
imbalance challenge (threading & domain decomposition) →
→ too expensive: limit the interaction range (cut-off)
Over all atom-pairs!
– can use a cut-off
– cut-off is not good enough
https://gerrit.gromacs.org
– 10k's academic & industry – 100k's through F@H
arbitrary units cells parallel constraints virtual interaction sites Triclinic unit cell with load balancing and staggered cell boundaries eighth shell domain decomposition
– C++98 (subset) – CMake
– LOC: ~2 mil. ½ of which is SIMD!
arbitrary units cells parallel constraints virtual interaction sites Triclinic unit cell with load balancing and staggered cell boundaries eighth shell domain decomposition
Pair Search distance check LJ + Coulomb tabulated (F) LJ + Coulomb tabulated (F+E) Coulomb tabulated (F) Coulomb tabulated (F+E) 1,4 nonbonded interactions Calc Weights Spread Q Bspline Gather F Bspline 3D-FFT Angles Propers Settle
Non-bonded PME Bonded
Bonded F PME Integration Pair search Pair-search step every 10-50 iterations MD iteration = step Constraints Non-bonded F
~ milliseconds or less
since GROMACS v4.6
– 2-4x speedup – offload
multi-GPU “for free” →
– wide feature support
– added latencies, overheads – load balancing
Bonded F PME Integration Constraints Pair search, DD
Pair search/domain-decomposition: every 10-50 iterations “regular” MD step
Non-bonded F
100s of microseconds at peak!
Average CPU-GPU overlap: 60-80% per step Bonded F PME Integration, Constraints Non-bonded F & Pair-list pruning Wait GPU
Idle Idle Idle
“regular” MD step
H2D x,q D2H F Clear F Launch GPU
since GROMACS v4.6
– 3-4x speedup – offload
multi-GPU “for free” →
– wide feature support
– added latency, re-casting work for GPUs – load balancing: intra-GPU, intra-node,...
GPU
CUDA Pair search
Idle
H2D pair-list
CPU
OpenMP threads Pair search: every 10-50 iterations
Bonded F PME Integration Constraints Local NB F pist p. Wait for nl F Local pair search D2H non- local F\ H2D local x,q H2D local pair-list Non-Local pair search H2D non-local pair-list Non-local non-bonded F list pruning Local stream Non-local stream Idle D2H local F Wait local F
MPI receive non-local x MPI send non-local F
H2D non- local x,q Idle Idle
Pair-search/domain-decompostion step every 10-50 iterations “regular” MD step
Clear F
CPU
OpenMP threads
GPU
CUDA
Local non-bonded F list pruning
– lends itself well to efficient fine-grained parallelization – adaptable to the characteristics of the architecture
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 8 6 5 3 9 10 11 15 Classical 1x1 neighborlist on 4-way SIMD
1 1 1 1 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 7 6 5 4 12 13 14 15 2 2 2 3 3 3 3 2 4x4 setup on 4-way SIMD 12 13 14 15 11 10 9 8 0 1 2 3 4 5 6 7 8 9 10 11 1 2 3 4 5 6 7 8 9 10 11 4x4 setup on SIMT
– data reuse – arithmetic
– cache-efficiency
Traditional algorithm: cache pressure, ill data reuse, in-register shuffle bound Cluster algorithm for SSE4, VMX, 128-bit AVX: cache friendly, 4-way j-reuse Cluster algorithm for fine-grained hardware threading: SIMT, etc.
– create enough idependent work units = blocks per
SM[X|M]
– sort them
– 2-3x on “narrower” – 3-5x on “wider” GPUs
– lowers j-particle data reuse – atomic clashes
SMX0 SMX1 SMX2 SMX3 SMX4 SMX5 SMX6 SMX7 SMX8 SMX9 SMX10 SMX11 SMX12
50 100 150 200 250 300 KCycles
SMX0 SMX1 SMX2 SMX3 SMX4 SMX5 SMX6 SMX7 SMX8 SMX9 SMX10 SMX11 SMX12
50 100 150 200 250 300
200 400 600 800 list size list size 200 400 600 800
raw pair list reshaped list
100 200 300 400 100 200 300 400
Workload per SMX: Tesla K20c, 1500 atoms
Raw lists: too few blocks imbalanced Regularized lists: balanced SMX execution Regularized: 4x faster execution
0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5
kernel time (ms)
TESLA K20c Force kernel tuning
Scanning list splits from 0 1000 →
960 atoms 1.5k atoms 3k atoms 6k atoms 12k atoms
– better inter-SM load balancing – more consistent list lengths – concurrent atomic operations
0.96 1.5 3 6 12 24 48 96 192 384 768 0.1 0.15 0.2 0.25 0.3
Tesla C2070 CUDA non-bonded force kernel
PME, rc=1.0 nm, nstlist=20
Initial target First alpha Second alpha Pair list tweaks Kepler back-port Improved sorting CU55 + buffer improvement GMX 5.0 Reduction tweak
system size (x1000 atoms) step time per 1000 atoms (ms)
– better inter-SM load balancing – more consistent list lengths – concurrent atomic operations
0.96 1.5 3 6 12 24 48 96 192 384 768 0.01 0.06 0.11 0.16 0.21 0.26 0.31
Tesla C2070 CUDA non-bonded force kernel
PME, rc=1.0 nm, nstlist=20
Initial target First alpha Second alpha Pair list tweaks Kepler back-port Improved sorting CU55 + buffer improvement GMX 5.0 Reduction tweak GTX TITAN X
system size (x1000 atoms) step time per 1000 atoms (ms)
j-loop i-loop
Particle cluster processing on GF, GK1xx, GM Particle cluster processing on GK210
12 13 14 15 11 10 9 8 0 1 2 3 4 5 6 7 8 9 10 11 1 2 3 4 5 6 7 8 9 10 11 4x4 setup on SIMT
12 13 14 15 11 10 9 8 0 1 2 3 4 5 6 7 8 9 10 11 1 2 3 4 5 6 7 8 9 10 11 4x4 setup on SIMT
K80 64x16 K80 128x16 K80 128x15 K80 128x14 K80 256x8 K40 64x16 K40 128x8 K40 256x4
3.95 2.90 2.91 3.08 3.06 3.41 3.57 5.50
kernel time (ms) GPU, #threads x #blocks
– First 2 gens of Kepler moderate: Tesla K20 7.5%, K40 17.5%
– Tesla K80 aggressive boost: 562 Mhz
875 Mhz (55.7%) →
load balance concern →
– adjust clock at run-time or warn user if:
+------------------------------------------------------+ | NVIDIA-SMI 346.47 Driver Version: 346.47 | |-------------------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | |===============================+======================+ | 0 GeForce GTX TITAN On | 0000:02:00.0 Off | | 58% 80C P0 191W / 250W | 107MiB / 6143MiB | +-------------------------------+----------------------+ | 1 Quadro M6000 On | 0000:03:00.0 Off | | 54% 83C P0 199W / 250W | 533MiB / 12287MiB | +-------------------------------+----------------------+
power consumption asymmetry → performance asymmetry
– fan speed capped by default to <60% – throttle-prone: temperature limit (~80C)
100 200 300 400 500
samples (every 2 sec)
50 100 150
Power (W), temperature (C), Clock (MHz/10)
50 100 150
Chip0 pwr Chip0 temp Chip1 pwr Chip1 temp Chip1 clk/10 Chip0 clk/10
power consumption asymmetry → performance asymmetry
– fan speed capped by default to <60% – throttle-prone: temperature limit (~80C)
100 200 300 400 500 sample (every 2 sec) 50 100 150 temperature (C) 50 100 150 clock frequency (MHz) GTX TITAN temp GTX TITAN freq
100 200 300 400 sample (every 2 sec) 50 100 150 temperature (C) 500 1000 1500 clockfrequency (MHz)
AC 1151 MHz temp AC 1151 MHz freq AC 987 MHz temp AC 987 MHz freq AC 1113 MHz no boost temp AC 1113 MHz no boost freq
The Quadro M6000 throttles regardless of AC if auto- boost is on
The GTX TITAN can throttle by 10-20% in well-cooled chassis (non OC cards) K80 Chip1: hotter slower more hungry
xorg.conf: Section "ServerLayout" Identifier "dual" Screen 0 "Screen0" Screen 1 "Screen1" RightOf "Screen0" EndSection Section "Device" Identifier "nvidia0" Driver "nvidia" VendorName "NVIDIA" BoardName "GeForce GTX TITAN" Option "UseDisplayDevice" "none" Option "Coolbits" "4" BusID "PCI:2:0:0" EndSection Section "Device" [...] EndSection Section "Screen" Identifier "Screen0" Device "nvidia0" EndSection Section "Screen" [...] EndSection
Start the X server export DISPLAY=:0 /usr/bin/X $DISPLAY -nolisten tcp vt7 -novtswitch & nvidia-settings -a [gpu:0]/GPUFanControlState=1 -a [fan:0]/GPUCurrentFanSpeed=80 nvidia-settings -a [gpu:1]/GPUFanControlState=1 -a [fan:1]/GPUCurrentFanSpeed=80 +------------------------------------------------------+ | NVIDIA-SMI 346.47 Driver Version: 346.47 | |-------------------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | |===============================+======================+ | 0 GeForce GTX TITAN On | 0000:02:00.0 Off | | 80% 69C P8 205W / 250W | 549MiB / 6143MiB | +-------------------------------+----------------------+ | 1 Quadro M6000 On | 0000:03:00.0 Off | | 80% 66C P0 203W / 250W | 533MiB / 12287MiB | +-------------------------------+----------------------+
Thanks to Stefan Fleischmann for figuring out the details!
but not across accelerators
– Improve portability – Wide user-base: allow using the hardware
– AMD OpenCL:
~200W vs 145W (-25% wrt 980)
– NVIDIA OpenCL
severely lacking
Strong scaling regime: This is where most of our efforts go! Benchmark “show-off” regime: This is where the “free lunch” from new hardware comes in full effect
0.96 1.5 3 6 12 24 48 96 192 384 768 1536 3072 0.01 0.03 0.05 0.07 0.09 0.11 0.13 0.15 M2090 GTX 760 GTX TITAN K20c – 758 MHz K40c – 875 MHz GTX 980 K80 875 MHz M6000 1151 MHz GTX TITAN X AMD R290X system size (x1000 atoms)
kernel time (ms)
– x86: SSE2, SSE4.1, AVX (+FMA4),
AVX2, AVX-512
– ARM: Neon, Neon-ASIMD, – IBM: QPX, VSX, VMX – Sparc64
→ Facilitated by GROMACS' generic SIMD layer
evil
– challenges: increasing
core/hardware thread count
F E D C B A 20 40 60 80 100 120 140 160 180 200 195.6 145 54.8 11.3 8.3 1.65 performance (ns/day)
Input: VSD ion channel embedded in a membrane 47k atoms Settings: PME, cut-off >=1 nm, 5 fs, all bonds constrained Hardware: Core i7 5960X & GeForce GTX 980
A B C D E F 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%
Rest Constraints Update Virtual sites NB X/F buffer op. Wait for GPU PME mesh Force Launch GPU op. Neighbor search
Fraction of wall-time (%)
A: 1 threads B: 1 threads + AVX2 C: 8 threads D: 8 threads + AVX2 E: 8 ranks + AVX2 + GPU F: 8 threads + AVX2 + GPU
Intel!
(Blue Waters, BLue Gene, x86 servers)
Pow8 1x12 (1T/C) Pow8 1x24 (2T/C) IVB 1x10 HSW 1x18 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%
ADH 134k atoms, PME, 2fs, all bonds constr.
Constraints Update NB X/F buffer ops. Wait for GPU PME mesh Force Launch GPU ops. Neighbor search
Pow8 8x6 (2T/C) Pow8 8x12 (4T/C) IVB 2x10 IVB 4x5 HSW 4x9 HSW 6x6 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%
ADH 134k atoms, PME, 2fs, all bonds constr.
Constraints Update NB X/F buffer ops. Wait for GPU PME mesh Bonded Force
. Launch GPU ops. Neighbor search Domain decomp.
1 socket + K40 2 sockets + 2 K40s
AHD 134k Ethanol 180k 5 10 15 20 25 30 35 40 45 50
22.8 17.9 31.2 21.1 25.9 18.9 34.5 29.94 46.3 42.2 42.9 40.8
IVB 1S HSW 1S Power8 1S IVB 2S HSW 2S Performance (ns/day)
Hardware: 2 socket nodes
→ implicit buffer
allows rlist = 0 in some cases*
Páll, S., & Hess, B. (2013). Com. Phys. Comm., 184(12), 2641–2650. http://doi.org/10.1016/j.cpc.2013.06.003
Implicit buffer explicit buffer
*Buffer size or list cut-off in GROMACS terminology
– search less often => larger buffer – cost tradeoff: pair search + domain decomposition vs non-bonded work
Bonded F PME Integration, Constraints
Non-bonded F
& Pair-list pruning Wait GPU
Idle Idle Idle
CPU
SIMD, OpenMP
GPU
CUDA OpenCL
Pair search, DD Idle
MD step
H2D pair list H2D x,q D2H F Clear F Launch GPU
More CPU-GPU overlap: higher GPU utilization
5 15 20 25 40 50 75 100 1 1.1 1.2 1.3 1.4 1.5 1.6 1.7
PME 2 fs h-bonds PME 5 fs all-bonds RF 2 fs h-bonds
pair list rebuild frequency (steps) rlist = buffered cut-off (nm)
No explicit buffer needed
The cost of less frequent list updates:
→ increasing non-bonded cost In practice:
→ optimal value: 15-50
20 40 60 80 100 120
neighbor searching frequency
20 25 30 35 40 45 50
performance (ns/d)
AQP 10 MPI x 2 OpenMP AQP 4 MPI x 5 OpenMP AQP 2 MPI x 10 OpenMP RIB 4 MPI x 5 OpenMP (ns/d *10!)
DD initialization Kv2.1 Voltage Sensor Domain (VSD)
Node 1 Node 0
Rank 0 Rank 1 Rank 3 Rank 2
Node M
Rank n
DD load balance for a few 1000 steps
– measure force load per rank – shift cell boundaries every N steps – fast & eager algorithm
– rdtsc/rdtsc – sync or actual clock needed
– executed ~serialized
=> lucky rank gets scheduled 1st
– use custom block scheduling?
– #NUMA regions > #GPUs – reducing multi-threaded
bottlenecks
– overlap comm/kernel from
different ranks
sharing a GPU
Rank 0 Rank 1 Rank 3 Rank 2
4 ranks x 4 cores each sharing a single GPU
The unlucky rank waits! But should we balance on this?
Rank 0 Rank 2 Rank 3
wall-time/step for DLB
CPU CPU CPU CPU GPU
Rank 1
GPU GPU GPU
The unlucky rank waits! But should we balance on this? Consistently unlucky ranks would get overloaded!
Rank 0 Rank 2 Rank 3
wall-time/step for DLB
CPU CPU CPU CPU GPU
Rank 1
GPU GPU GPU
not sharing
– Can't measure actual kernel
times with multiple streams! (CUDA feature bug)
– Need to resort to the crude
method: redistribute the wait
Rank 0 Rank 1 Rank 3 Rank 2
4 ranks x 4 cores each sharing a single GPU
shifts work from long- to short-range
– MPMD : PP – PME ranks – non-bonded offload: CPU-GPU
– Twin cut=off kernel (3-5% slower) – LJPME in v5.0 allows scaling both
Short-range: non-bonded Long-range: PME Short-range cut-off 0.9 nm Increase cut-off → increase grid spacing
PME to PP LJ cut-off fixed LJ-PME
DD load balance for a few 1000s
At the same time, scan the possible scaled cut-offs/PME grids
step 0 uniform DD grid step 5000 DD grid: non- uniform, staggered
rcoul
rcoul → 1.7rcoul possible
step 0 Uniform DD grid step 5000 staggered DD grid DD load balance for a few 1000s
Scanning of the possible increased the cut-offs gets limited by the new cell size!
rcoul
r'coul → 1.1rcoul possible rcoul → 1.7rcoul possible
rcoul
rcoul → 1.7rcoul possible
step 0 Uniform DD grid
minimum cell size
is observed)
performance fluctuation
– Network/routing is bandwidth optimized – Need to use feature flags – Increase GNI eager buffer size
1 2 4 8 16 32 48 64 96 128 256 1 10 100 1000
ethanol 45k ethanol 432k GLIC 2fs
#nodes performance (ns/day)
Latency sensitive codes: stay here if you can Using SLURM? → get feature flags Implemented by your admins or Cray!
Piz Daint: Xeon E5-2670 2.6 GHz (SNB) Tesla K20X (w/o application clock support)
Heavily PME-bound regime: No Aries tweaks used!
380 300 167
– Can use it in parallel runs and
Aquaporin ion channel ion channel vsites methanol 54k 10 20 30 40 50 60 70 80 90 100
22.2 24.2 53.1 77.3 27.7 28.3 59.1 88.5
GTX TITAN Quadro M6000
performance (ns/day)
Core i7 5860X, gcc 4.9, CUDA 6.5 GROMACS 5.1-dev
villin 5fs rnase dodec 5fs GLIC 5fs 500 1000 1500 2000
356.2 119.0 10.9 890.4 346.8 34.2 1091.6 373.2 30.2 512.4 203.2 19.5 1382.1 535.5 59.3 1921.3 701.7 65.6 183.3 63.7 8.3 539.7 225.8 23.6 1430.8 390.4 38.1
i7-3930K i7-3930K + K20c i7-3930K + Tesla K20c – 6 repl i7-5690X i7-5690X + M6000 i7-5690X + Quadro M6000 – 6 repl Opteron 6376 Opteron 6376 + GTX TITAN ns/day
Quadro M6000 (TITAN X): up to 25% extra total performance
150k atoms 16k atoms 6k atoms
3225 FPS !!!
performance (ns / d)
MEM RIB
performance (ns / d)
Experiments done by Carsten Kutzner, MPI
– strong benefits (enabling all users) – challenges (scaling is less pretty)
– efforts required on all levels (and it will get harder) – more offload, more load balancing, more automation needed
→ shifting into evolution mode
– Where is the big Denver chip? – Integration, integration, integration (I hope AMD APUs become a model)
– Jiri Kraus, Mark Berger – and the many many engineers
Vincent Hindriksen Anca Hamuraru Teemu Virolainen
VMD rendering: Viveca Lindahl