[PPT] - Tackling Performance Bottlenecks in the Diversifying CUDA HPC PowerPoint Presentation

SLIDE 1

Tackling Performance Bottlenecks in the Diversifying CUDA HPC Ecosystem: a Molecular Dynamics Perspective

Szilárd Páll

KTH Royal Institute of Technology

GTC 2015

SLIDE 2

Diversifying hardware & complex parallelism

Increasingly:

– parallel on multiple levels – heterogenous – power constrained

x102-5 x2-8 Increasing CPU/GPU core count NUMA on die “Skinny” workstations to fat compute nodes NUMA, NUAA, ...

Compute Cloud

Widening SIMD/SIMT units

+

Mini-clusters, petaflop machines, & On-demand computing

SLIDE 3

Diversifying hardware & complex parallelism

x102-5 x2-8 Increasing CPU/GPU core count NUMA on die “Skinny” workstations to fat compute nodes NUMA, NUAA, ...

Compute Cloud

Widening SIMD/SIMT units

+

Mini-clusters, petaflop machines, & On-demand computing

Need to address each level
Choice of parallelization important
How much of the burden is placed on the user?

SLIDE 4

Molecular dynamics: modelling physics

Reproduced under CC BY-SA from: http://commons.wikimedia.org/wiki/File:MM_PEF.png

SLIDE 5

Molecular dynamics: basics

Given:

– N particles – masses – potential V

Integrate (leap frog)

– acceleration → velocities – velocities

coordinates →

Newton's equations

SLIDE 6

Molecular dynamics: interactions

SLIDE 7

Molecular dynamics: forces

Bonded forces: loop over all interactions

– few but localized

imbalance challenge (threading & domain decomposition) →

Non-bonded: a double loop?

→ too expensive: limit the interaction range (cut-off)

Bonded Non-bonded

Over all atom-pairs!

SLIDE 8

Pair interactions: cut-off

LJ decays fast: 1/r^6

– can use a cut-off

Coulomb decays slowly: 1/r

– cut-off is not good enough

→ treat long-range part separately: Particle Mesh Ewald

SLIDE 9

GROMACS: fast, flexible, free

Developers: Stockholm & Uppsala,

SE and many more worldwide

Open source: LGPLv2
Open development:

https://gerrit.gromacs.org

Large user base:

– 10k's academic & industry – 100k's through F@H

Supports all major force-fields

arbitrary units cells parallel constraints virtual interaction sites Triclinic unit cell with load balancing and staggered cell boundaries eighth shell domain decomposition

SLIDE 10

GROMACS: fast, flexible, free

Code: portability of great

importance

– C++98 (subset) – CMake

Pretty large:

– LOC: ~2 mil. ½ of which is SIMD!

Bottom-up performance tuning

→ absolute performance is what matters (to users)

arbitrary units cells parallel constraints virtual interaction sites Triclinic unit cell with load balancing and staggered cell boundaries eighth shell domain decomposition

SLIDE 11

Costs in MD

Every step: 106 -108 Flops
Every simulation: 106 -108 steps

Pair Search distance check LJ + Coulomb tabulated (F) LJ + Coulomb tabulated (F+E) Coulomb tabulated (F) Coulomb tabulated (F+E) 1,4 nonbonded interactions Calc Weights Spread Q Bspline Gather F Bspline 3D-FFT Angles Propers Settle

What are flops spent on?

Non-bonded PME Bonded

SLIDE 12

Molecular dynamics step

Bonded F PME Integration Pair search Pair-search step every 10-50 iterations MD iteration = step Constraints Non-bonded F

~ milliseconds or less

Goal: do it as fast as possible!

SLIDE 13

Herogenous accelerated GROMACS

2nd gen GPU acceleration:

since GROMACS v4.6

Advantages:

– 2-4x speedup – offload

multi-GPU “for free” →

– wide feature support

Challenges:

– added latencies, overheads – load balancing

Bonded F PME Integration Constraints Pair search, DD

Pair search/domain-decomposition: every 10-50 iterations “regular” MD step

Non-bonded F

100s of microseconds at peak!

ffload

SLIDE 14

Herogenous accelerated MD

Average CPU-GPU overlap: 60-80% per step Bonded F PME Integration, Constraints Non-bonded F & Pair-list pruning Wait GPU

Idle Idle Idle

“regular” MD step

H2D x,q D2H F Clear F Launch GPU

2nd gen GPU acceleration:

since GROMACS v4.6

Advatages:

– 3-4x speedup – offload

multi-GPU “for free” →

– wide feature support

Challenges:

– added latency, re-casting work for GPUs – load balancing: intra-GPU, intra-node,...

GPU

CUDA Pair search

Idle

H2D pair-list

CPU

OpenMP threads Pair search: every 10-50 iterations

SLIDE 15

Parallel heterogeneous MD

Bonded F PME Integration Constraints Local NB F pist p. Wait for nl F Local pair search D2H non- local F\ H2D local x,q H2D local pair-list Non-Local pair search H2D non-local pair-list Non-local non-bonded F list pruning Local stream Non-local stream Idle D2H local F Wait local F

MPI receive non-local x MPI send non-local F

H2D non- local x,q Idle Idle

Pair-search/domain-decompostion step every 10-50 iterations “regular” MD step

Clear F

CPU

OpenMP threads

GPU

CUDA

Local non-bonded F list pruning

...

SLIDE 16

Intra-node: The accelerator

SLIDE 17

SIMD/SIMT targeted algorithms

Cluster pair interaction algorithm

– lends itself well to efficient fine-grained parallelization – adaptable to the characteristics of the architecture

SLIDE 18

Particle cluster algorithm: SIMD implementation

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 8 6 5 3 9 10 11 15 Classical 1x1 neighborlist on 4-way SIMD

1 1 1 1 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 7 6 5 4 12 13 14 15 2 2 2 3 3 3 3 2 4x4 setup on 4-way SIMD 12 13 14 15 11 10 9 8 0 1 2 3 4 5 6 7 8 9 10 11 1 2 3 4 5 6 7 8 9 10 11 4x4 setup on SIMT

Cluster size and

grouping are the “knobs” to adjust for a specific arch:

– data reuse – arithmetic

intensity

– cache-efficiency

Traditional algorithm: cache pressure, ill data reuse, in-register shuffle bound Cluster algorithm for SSE4, VMX, 128-bit AVX: cache friendly, 4-way j-reuse Cluster algorithm for fine-grained hardware threading: SIMT, etc.

SLIDE 19

Tuning kernels: feeding the GPU

Avoid load imbalance:

– create enough idependent work units = blocks per

SM[X|M]

– sort them

Workload regularization improves by up to:

– 2-3x on “narrower” – 3-5x on “wider” GPUs

(Re-)tuning is needed for new architectures
Tradeoffs:

– lowers j-particle data reuse – atomic clashes

SMX0 SMX1 SMX2 SMX3 SMX4 SMX5 SMX6 SMX7 SMX8 SMX9 SMX10 SMX11 SMX12

50 100 150 200 250 300 KCycles

SMX0 SMX1 SMX2 SMX3 SMX4 SMX5 SMX6 SMX7 SMX8 SMX9 SMX10 SMX11 SMX12

50 100 150 200 250 300

200 400 600 800 list size list size 200 400 600 800

raw pair list reshaped list

100 200 300 400 100 200 300 400

Workload per SMX: Tesla K20c, 1500 atoms

Raw lists: too few blocks imbalanced Regularized lists: balanced SMX execution Regularized: 4x faster execution

SLIDE 20

Tuning kernels: ready for automation

0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5

kernel time (ms)

TESLA K20c Force kernel tuning

Scanning list splits from 0 1000 →

960 atoms 1.5k atoms 3k atoms 6k atoms 12k atoms

SLIDE 21

Tuning kernels: still getting faster

Up to 2x faster wrt the first version
We still keep finding ways to improve

performance; most recently:

– better inter-SM load balancing – more consistent list lengths – concurrent atomic operations

0.96 1.5 3 6 12 24 48 96 192 384 768 0.1 0.15 0.2 0.25 0.3

Tesla C2070 CUDA non-bonded force kernel

PME, rc=1.0 nm, nstlist=20

Initial target First alpha Second alpha Pair list tweaks Kepler back-port Improved sorting CU55 + buffer improvement GMX 5.0 Reduction tweak

system size (x1000 atoms) step time per 1000 atoms (ms)

SLIDE 22

Tuning kernels: still getting faster

Up to 2x faster wrt the first version
We still keep finding ways to improve

performance most recently:

– better inter-SM load balancing – more consistent list lengths – concurrent atomic operations

But NVIDIA does too!

0.96 1.5 3 6 12 24 48 96 192 384 768 0.01 0.06 0.11 0.16 0.21 0.26 0.31

Tesla C2070 CUDA non-bonded force kernel

PME, rc=1.0 nm, nstlist=20

Initial target First alpha Second alpha Pair list tweaks Kepler back-port Improved sorting CU55 + buffer improvement GMX 5.0 Reduction tweak GTX TITAN X

system size (x1000 atoms) step time per 1000 atoms (ms)

SLIDE 23

Adapting the cluster algorithm to the GK210

Doubled register size

=> can fit 2x threads/block 1 i-cluster vs 2 j-clusters

j-loop i-loop

Particle cluster processing on GF, GK1xx, GM Particle cluster processing on GK210

12 13 14 15 11 10 9 8 0 1 2 3 4 5 6 7 8 9 10 11 1 2 3 4 5 6 7 8 9 10 11 4x4 setup on SIMT

8

12 13 14 15 11 10 9 8 0 1 2 3 4 5 6 7 8 9 10 11 1 2 3 4 5 6 7 8 9 10 11 4x4 setup on SIMT

SLIDE 24

K80 kernel performance

Pre-GK210: 64 threads/block, 16 blocks per SM
ccupancy: max 50%, achieved ~0.495%
Extra registers allows 128 threads/block, still , 16 blocks per SM
ccupancy: max 100%, achieved ~0.92%

K80 64x16 K80 128x16 K80 128x15 K80 128x14 K80 256x8 K40 64x16 K40 128x8 K40 256x4

3.95 2.90 2.91 3.08 3.06 3.41 3.57 5.50

kernel time (ms) GPU, #threads x #blocks

SLIDE 25

GPU Application clocks

Available on GK110[B], GK210 Tesla, Quadro (and GeForce if you're lucky)

– First 2 gens of Kepler moderate: Tesla K20 7.5%, K40 17.5%

Lots of power/thermal headroom left: K40 peak ~155W (max 230W)

– Tesla K80 aggressive boost: 562 Mhz

875 Mhz (55.7%) →

good for heterogeneous codes with alternating GPU utilization MD
can get close to the power limit (145 W)
Throttling-prone

load balance concern →

Fully supported in GROMACS 5.1 [contribution by: Jiri Kraus (NVIDIA)]

– adjust clock at run-time or warn user if:

permissions don't allow
not linked against NVML

SLIDE 26

GPU Throttling

K80: aggressive boost close to TDP + dual chip

power consumption asymmetry → performance asymmetry

Desktop cards: GeForce & Quadro

– fan speed capped by default to <60% – throttle-prone: temperature limit (~80C)

=> load balancing issues

SLIDE 27

100 200 300 400 500

samples (every 2 sec)

50 100 150

Power (W), temperature (C), Clock (MHz/10)

50 100 150

Chip0 pwr Chip0 temp Chip1 pwr Chip1 temp Chip1 clk/10 Chip0 clk/10

GPU Throttling

K80: aggressive boost close to TDP + dual chip

power consumption asymmetry → performance asymmetry

Desktop cards: GeForce & Quadro

– fan speed capped by default to <60% – throttle-prone: temperature limit (~80C)

=> load balancing issues

100 200 300 400 500 sample (every 2 sec) 50 100 150 temperature (C) 50 100 150 clock frequency (MHz) GTX TITAN temp GTX TITAN freq

100 200 300 400 sample (every 2 sec) 50 100 150 temperature (C) 500 1000 1500 clockfrequency (MHz)

AC 1151 MHz temp AC 1151 MHz freq AC 987 MHz temp AC 987 MHz freq AC 1113 MHz no boost temp AC 1113 MHz no boost freq

The Quadro M6000 throttles regardless of AC if auto- boost is on

10%

The GTX TITAN can throttle by 10-20% in well-cooled chassis (non OC cards) K80 Chip1: hotter slower more hungry

SLIDE 28

Tip: set force GPU fan speed manually

xorg.conf: Section "ServerLayout" Identifier "dual" Screen 0 "Screen0" Screen 1 "Screen1" RightOf "Screen0" EndSection Section "Device" Identifier "nvidia0" Driver "nvidia" VendorName "NVIDIA" BoardName "GeForce GTX TITAN" Option "UseDisplayDevice" "none" Option "Coolbits" "4" BusID "PCI:2:0:0" EndSection Section "Device" [...] EndSection Section "Screen" Identifier "Screen0" Device "nvidia0" EndSection Section "Screen" [...] EndSection

Start the X server export DISPLAY=:0 /usr/bin/X $DISPLAY -nolisten tcp vt7 -novtswitch & nvidia-settings -a [gpu:0]/GPUFanControlState=1 -a [fan:0]/GPUCurrentFanSpeed=80 nvidia-settings -a [gpu:1]/GPUFanControlState=1 -a [fan:1]/GPUCurrentFanSpeed=80 +------------------------------------------------------+ | NVIDIA-SMI 346.47 Driver Version: 346.47 | |-------------------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | |===============================+======================+ | 0 GeForce GTX TITAN On | 0000:02:00.0 Off | | 80% 69C P8 205W / 250W | 549MiB / 6143MiB | +-------------------------------+----------------------+ | 1 Quadro M6000 On | 0000:03:00.0 Off | | 80% 66C P0 203W / 250W | 533MiB / 12287MiB | +-------------------------------+----------------------+

Thanks to Stefan Fleischmann for figuring out the details!

SLIDE 29

OpenCL port

Collaboration with Streamcomputing
GROMACS is highly portable across CPUs

but not across accelerators

Goals:

– Improve portability – Wide user-base: allow using the hardware

AMD & NVIDIA GPUs supported
Status: merge into v5.1 pending
Lessons learned:

– AMD OpenCL:

AMD R9 290X ~ GTX 970

~200W vs 145W (-25% wrt 980)

– NVIDIA OpenCL

severely lacking

v1.1?
performance: 2-3x lower than CUDA

SLIDE 30

Kernel performance and scaling

Strong scaling regime: This is where most of our efforts go! Benchmark “show-off” regime: This is where the “free lunch” from new hardware comes in full effect

0.96 1.5 3 6 12 24 48 96 192 384 768 1536 3072 0.01 0.03 0.05 0.07 0.09 0.11 0.13 0.15 M2090 GTX 760 GTX TITAN K20c – 758 MHz K40c – 875 MHz GTX 980 K80 875 MHz M6000 1151 MHz GTX TITAN X AMD R290X system size (x1000 atoms)

kernel time (ms)

SLIDE 31

Intra-node: CPU+GPU

SLIDE 32

CPU SIMD and threading

SIMD support:

– x86: SSE2, SSE4.1, AVX (+FMA4),

AVX2, AVX-512

– ARM: Neon, Neon-ASIMD, – IBM: QPX, VSX, VMX – Sparc64

→ Facilitated by GROMACS' generic SIMD layer

Threading: OpenMP - the lesser

evil

– challenges: increasing

core/hardware thread count

F E D C B A 20 40 60 80 100 120 140 160 180 200 195.6 145 54.8 11.3 8.3 1.65 performance (ns/day)

Input: VSD ion channel embedded in a membrane 47k atoms Settings: PME, cut-off >=1 nm, 5 fs, all bonds constrained Hardware: Core i7 5960X & GeForce GTX 980

A B C D E F 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%

Rest Constraints Update Virtual sites NB X/F buffer op. Wait for GPU PME mesh Force Launch GPU op. Neighbor search

Fraction of wall-time (%)

A: 1 threads B: 1 threads + AVX2 C: 8 threads D: 8 threads + AVX2 E: 8 ranks + AVX2 + GPU F: 8 threads + AVX2 + GPU

SLIDE 33

Power8

Power8/8+ and OpenPower: a the only promising competition for

Intel!

I hope that the lessons of the past are used to the best possible extent

(Blue Waters, BLue Gene, x86 servers)

+

Host: pow800 Indexes: physical Date: Thu 26 Feb 2015 06:10:24 PM PST Machine (511GB total) Group0 Group0 NUMANode P#0 (128GB) Package P#0 NUMANode P#1 (128GB) Package P#1 NUMANode P#16 (128GB) Package P#16 NUMANode P#17 (127GB) Package P#17 L3 (8192KB) L3 (8192KB) L3 (8192KB) L3 (8192KB) L3 (8192KB) L3 (8192KB) L3 (8192KB) L3 (8192KB) L3 (8192KB) L3 (8192KB) L3 (8192KB) L3 (8192KB) L3 (8192KB) L3 (8192KB) L3 (8192KB) L3 (8192KB) L3 (8192KB) L3 (8192KB) L3 (8192KB) L3 (8192KB) L3 (8192KB) L3 (8192KB) L3 (8192KB) L3 (8192KB) L2 (512KB) L2 (512KB) L2 (512KB) L2 (512KB) L2 (512KB) L2 (512KB) L2 (512KB) L2 (512KB) L2 (512KB) L2 (512KB) L2 (512KB) L2 (512KB) L2 (512KB) L2 (512KB) L2 (512KB) L2 (512KB) L2 (512KB) L2 (512KB) L2 (512KB) L2 (512KB) L2 (512KB) L2 (512KB) L2 (512KB) L2 (512KB) PCI 14e4:168e PCI 14e4:168e L1d (64KB) L1d (64KB) L1d (64KB) L1d (64KB) L1d (64KB) L1d (64KB) L1d (64KB) L1d (64KB) L1d (64KB) L1d (64KB) L1d (64KB) L1d (64KB) L1d (64KB) L1d (64KB) L1d (64KB) L1d (64KB) L1d (64KB) L1d (64KB) L1d (64KB) L1d (64KB) L1d (64KB) L1d (64KB) L1d (64KB) L1d (64KB) eth4 eth5 L1i (32KB) L1i (32KB) L1i (32KB) L1i (32KB) L1i (32KB) L1i (32KB) PCI 1014:034a L1i (32KB) L1i (32KB) L1i (32KB) L1i (32KB) L1i (32KB) L1i (32KB) PCI 1014:034a PCI 14e4:1657 PCI 14e4:1657 PCI 14e4:1657 PCI 14e4:1657 L1i (32KB) L1i (32KB) L1i (32KB) L1i (32KB) L1i (32KB) L1i (32KB) L1i (32KB) L1i (32KB) L1i (32KB) L1i (32KB) L1i (32KB) L1i (32KB) Core P#32 Core P#40 Core P#48 Core P#96 Core P#104 Core P#112 sr0 sda Core P#160 Core P#168 Core P#176 Core P#224 Core P#232 Core P#240 sdb eth0 eth1 eth2 eth3 Core P#2080 Core P#2088 Core P#2096 Core P#2144 Core P#2152 Core P#2160 Core P#2208 Core P#2216 Core P#2224 Core P#2272 Core P#2280 Core P#2288 PU P#0 PU P#1 PU P#2 PU P#3 PU P#4 PU P#5 PU P#6 PU P#7 PU P#8 PU P#9 PU P#10 PU P#11 PU P#12 PU P#13 PU P#14 PU P#15 PU P#16 PU P#17 PU P#18 PU P#19 PU P#20 PU P#21 PU P#22 PU P#23 PU P#24 PU P#25 PU P#26 PU P#27 PU P#28 PU P#29 PU P#30 PU P#31 PU P#32 PU P#33 PU P#34 PU P#35 PU P#36 PU P#37 PU P#38 PU P#39 PU P#40 PU P#41 PU P#42 PU P#43 PU P#44 PU P#45 PU P#46 PU P#47 PU P#48 PU P#49 PU P#50 PU P#51 PU P#52 PU P#53 PU P#54 PU P#55 PU P#56 PU P#57 PU P#58 PU P#59 PU P#60 PU P#61 PU P#62 PU P#63 PU P#64 PU P#65 PU P#66 PU P#67 PU P#68 PU P#69 PU P#70 PU P#71 PU P#72 PU P#73 PU P#74 PU P#75 PU P#76 PU P#77 PU P#78 PU P#79 PU P#80 PU P#81 PU P#82 PU P#83 PU P#84 PU P#85 PU P#86 PU P#87 PU P#88 PU P#89 PU P#90 PU P#91 PU P#92 PU P#93 PU P#94 PU P#95 PU P#96 PU P#97 PU P#98 PU P#99 PU P#100 PU P#101 PU P#102 PU P#103 PU P#104 PU P#105 PU P#106 PU P#107 PU P#108 PU P#109 PU P#110 PU P#111 PU P#112 PU P#113 PU P#114 PU P#115 PU P#116 PU P#117 PU P#118 PU P#119 PU P#120 PU P#121 PU P#122 PU P#123 PU P#124 PU P#125 PU P#126 PU P#127 PU P#128 PU P#129 PU P#130 PU P#131 PU P#132 PU P#133 PU P#134 PU P#135 PU P#136 PU P#137 PU P#138 PU P#139 PU P#140 PU P#141 PU P#142 PU P#143 PU P#144 PU P#145 PU P#146 PU P#147 PU P#148 PU P#149 PU P#150 PU P#151 PU P#152 PU P#153 PU P#154 PU P#155 PU P#156 PU P#157 PU P#158 PU P#159 PU P#160 PU P#161 PU P#162 PU P#163 PU P#164 PU P#165 PU P#166 PU P#167 PU P#168 PU P#169 PU P#170 PU P#171 PU P#172 PU P#173 PU P#174 PU P#175 PU P#176 PU P#177 PU P#178 PU P#179 PU P#180 PU P#181 PU P#182 PU P#183 PU P#184 PU P#185 PU P#186 PU P#187 PU P#188 PU P#189 PU P#190 PU P#191 PCI 10de:1023 PCI 10de:1023 PCI 10de:1023 PCI 10de:1023 card0 card1 card2 card3

SLIDE 34

Power8 vs IVB / HSW

Pow8 1x12 (1T/C) Pow8 1x24 (2T/C) IVB 1x10 HSW 1x18 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%

ADH 134k atoms, PME, 2fs, all bonds constr.

Constraints Update NB X/F buffer ops. Wait for GPU PME mesh Force Launch GPU ops. Neighbor search

Pow8 8x6 (2T/C) Pow8 8x12 (4T/C) IVB 2x10 IVB 4x5 HSW 4x9 HSW 6x6 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%

ADH 134k atoms, PME, 2fs, all bonds constr.

Constraints Update NB X/F buffer ops. Wait for GPU PME mesh Bonded Force

Comm. X/F

. Launch GPU ops. Neighbor search Domain decomp.

1 socket + K40 2 sockets + 2 K40s

AHD 134k Ethanol 180k 5 10 15 20 25 30 35 40 45 50

22.8 17.9 31.2 21.1 25.9 18.9 34.5 29.94 46.3 42.2 42.9 40.8

IVB 1S HSW 1S Power8 1S IVB 2S HSW 2S Performance (ns/day)

Hardware: 2 socket nodes

Power8: Power8 PSG node 12 cores @ 4 GHz (? W)
IVB: Intel Xeon E5 2690 v2 10 cores @ 3.0 GHz (2x130W)
HSW: Intel Xeon E5-2699 v3 18 cores @ 2.3 GHz (2x145W)

SLIDE 35

Buffering & Calculating useful zeros

Given: target drift & interaction cut-off
Automated buffer estimate based on:
atomic displacement distribution
potential at cut-off,
constraints, vsites,...
Clusters crossed by the cut-off

→ implicit buffer

allows rlist = 0 in some cases*

Much tighter estimates in 5.0, further improvements coming!

Páll, S., & Hess, B. (2013). Com. Phys. Comm., 184(12), 2641–2650. http://doi.org/10.1016/j.cpc.2013.06.003

Implicit buffer explicit buffer

*Buffer size or list cut-off in GROMACS terminology

SLIDE 36

Pair list rebuild frequency: tunable!

Goal: increase CPU-GPU overlap

– search less often => larger buffer – cost tradeoff: pair search + domain decomposition vs non-bonded work

Based on physics/math not guessing!

Bonded F PME Integration, Constraints

Non-bonded F

& Pair-list pruning Wait GPU

Idle Idle Idle

CPU

SIMD, OpenMP

GPU

CUDA OpenCL

Pair search, DD Idle

MD step

H2D pair list H2D x,q D2H F Clear F Launch GPU

More CPU-GPU overlap: higher GPU utilization

SLIDE 37

Pair list rebuild frequency: tuning

5 15 20 25 40 50 75 100 1 1.1 1.2 1.3 1.4 1.5 1.6 1.7

PME 2 fs h-bonds PME 5 fs all-bonds RF 2 fs h-bonds

pair list rebuild frequency (steps) rlist = buffered cut-off (nm)

No explicit buffer needed

The cost of less frequent list updates:

increasing buffer size

→ increasing non-bonded cost In practice:

GPUs are fast,
rel. throughput increases with nstlist

→ optimal value: 15-50

20 40 60 80 100 120

neighbor searching frequency

20 25 30 35 40 45 50

performance (ns/d)

AQP 10 MPI x 2 OpenMP AQP 4 MPI x 5 OpenMP AQP 2 MPI x 10 OpenMP RIB 4 MPI x 5 OpenMP (ns/d *10!)

SLIDE 38

Multi- CPU+GPU

SLIDE 39

Domain decomposition

DD initialization Kv2.1 Voltage Sensor Domain (VSD)

SLIDE 40

Schematics of CPU-GPU assignment

Node 1 Node 0

Rank 0 Rank 1 Rank 3 Rank 2

Node M

Rank n

...

SLIDE 41

Dynamic load balancing (DLB)

DD load balance for a few 1000 steps

SLIDE 42

Dynamic load balancing (DLB)

Fully automated

– measure force load per rank – shift cell boundaries every N steps – fast & eager algorithm

Uses cycle counters

– rdtsc/rdtsc – sync or actual clock needed

SLIDE 43

DLB with ranks sharing a GPU

Can't split the GPU like a CPU

– executed ~serialized

=> lucky rank gets scheduled 1st

– use custom block scheduling?

Benefits:

– #NUMA regions > #GPUs – reducing multi-threaded

bottlenecks

– overlap comm/kernel from

different ranks

Artificial load imbalance on ranks

sharing a GPU

Rank 0 Rank 1 Rank 3 Rank 2

4 ranks x 4 cores each sharing a single GPU

The unlucky rank waits! But should we balance on this?

Rank 0 Rank 2 Rank 3

wall-time/step for DLB

CPU CPU CPU CPU GPU

Rank 1

GPU GPU GPU

SLIDE 44

DLB with ranks sharing a GPU

The unlucky rank waits! But should we balance on this? Consistently unlucky ranks would get overloaded!

Rank 0 Rank 2 Rank 3

wall-time/step for DLB

CPU CPU CPU CPU GPU

Rank 1

GPU GPU GPU

Solution: make it look like they are

not sharing

Challenges:

– Can't measure actual kernel

times with multiple streams! (CUDA feature bug)

– Need to resort to the crude

method: redistribute the wait

Rank 0 Rank 1 Rank 3 Rank 2

4 ranks x 4 cores each sharing a single GPU

SLIDE 45

PP-PME load balancing

Task load balancing:

shifts work from long- to short-range

– MPMD : PP – PME ranks – non-bonded offload: CPU-GPU

Need to keep LJ cut-off fixed!

– Twin cut=off kernel (3-5% slower) – LJPME in v5.0 allows scaling both

Scaling tradeoff!

Short-range: non-bonded Long-range: PME Short-range cut-off 0.9 nm Increase cut-off → increase grid spacing

PME to PP LJ cut-off fixed LJ-PME

SLIDE 46

DLB + PP-PME load balancing

DD load balance for a few 1000s

f steps

At the same time, scan the possible scaled cut-offs/PME grids

Let's try!

step 0 uniform DD grid step 5000 DD grid: non- uniform, staggered

SLIDE 47

DLB + PP-PME load balancing: unwanted interaction

rcoul

rcoul → 1.7rcoul possible

step 0 Uniform DD grid step 5000 staggered DD grid DD load balance for a few 1000s

f steps

Scanning of the possible increased the cut-offs gets limited by the new cell size!

rcoul

nly

r'coul → 1.1rcoul possible rcoul → 1.7rcoul possible

SLIDE 48

DLB + PP-PME load balancing: eliminating unwanted interaction

rcoul

rcoul → 1.7rcoul possible

step 0 Uniform DD grid

Solution: multi-stage load-balancing
1. lock DLB
2. scan for scaled cut-off/PME grid setups
3. unlock DLB
4. Re-scan cut-off/PME grid setups with preserving

minimum cell size

5. (repeat 4 periodically/if CPU-GPU load imbalance

is observed)

SLIDE 49

GROMACS scaling: Hydra

Using GROMACS 4.6
Hardware:

Xeon E5-2680 v2 + K20X

Load balancing issues well-

illustrated: partly addressed since!

SLIDE 50

Cray XC30 performance: Piz Daint

Cray Aries challenges: up to 2-4x

performance fluctuation

– Network/routing is bandwidth optimized – Need to use feature flags – Increase GNI eager buffer size

1 2 4 8 16 32 48 64 96 128 256 1 10 100 1000

ethanol 45k ethanol 432k GLIC 2fs

#nodes performance (ns/day)

Latency sensitive codes: stay here if you can Using SLURM? → get feature flags Implemented by your admins or Cray!

Piz Daint: Xeon E5-2670 2.6 GHz (SNB) Tesla K20X (w/o application clock support)

Heavily PME-bound regime: No Aries tweaks used!

380 300 167

SLIDE 51

Performance & acceleration: single node

Peaking at 1.4 /day = 300 us/step
Multi-simulations make better use
f the hardware!

– Can use it in parallel runs and

interleave ranks

Aquaporin ion channel ion channel vsites methanol 54k 10 20 30 40 50 60 70 80 90 100

22.2 24.2 53.1 77.3 27.7 28.3 59.1 88.5

GTX TITAN Quadro M6000

performance (ns/day)

Core i7 5860X, gcc 4.9, CUDA 6.5 GROMACS 5.1-dev

villin 5fs rnase dodec 5fs GLIC 5fs 500 1000 1500 2000

356.2 119.0 10.9 890.4 346.8 34.2 1091.6 373.2 30.2 512.4 203.2 19.5 1382.1 535.5 59.3 1921.3 701.7 65.6 183.3 63.7 8.3 539.7 225.8 23.6 1430.8 390.4 38.1

i7-3930K i7-3930K + K20c i7-3930K + Tesla K20c – 6 repl i7-5690X i7-5690X + M6000 i7-5690X + Quadro M6000 – 6 repl Opteron 6376 Opteron 6376 + GTX TITAN ns/day

Quadro M6000 (TITAN X): up to 25% extra total performance

150k atoms 16k atoms 6k atoms

3225 FPS !!!

SLIDE 52

Power efficiency

hardware costs (€)

performance (ns / d)

MEM RIB

hardware costs (€)

performance (ns / d)

Experiments done by Carsten Kutzner, MPI

SLIDE 53

Conclusions for MD and not only

Bottom-up performance engineering:

– strong benefits (enabling all users) – challenges (scaling is less pretty)

Heterogeneous code:

– efforts required on all levels (and it will get harder) – more offload, more load balancing, more automation needed

GPUs are a revolution

→ shifting into evolution mode

GeForce rules the (MD) research world
IBM Power8 + CUDA = the opportunity
ARM64 + CUDA = the promise

– Where is the big Denver chip? – Integration, integration, integration (I hope AMD APUs become a model)

SLIDE 54

Acknowledgements

Berk Hess, Erik Lindahl Mark Abraham

GMX developers

Carsten Kutzner Roland Schulz The GROMACS developers & community

NVIDIA:

– Jiri Kraus, Mark Berger – and the many many engineers

Hardware / support Funding

Vincent Hindriksen Anca Hamuraru Teemu Virolainen

VMD rendering: Viveca Lindahl