[PPT] - Acceleration of HPC applications on hybrid CPU-GPU systems: When can PowerPoint Presentation

SLIDE 1

This work was performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under contract DE- AC52-07NA27344. Lawrence Livermore National Security, LLC

Acceleration of HPC applications on hybrid CPU-GPU systems: When can Multi-Process Service (MPS) help?

GTC 2018 March 28, 2018

Olga Pearce (Lawrence Livermore National Laboratory) http://people.llnl.gov/olga Max Katz (NVIDIA), Leopold Grinberg (IBM)

LLNL-PRES-746880 Slide 1

SLIDE 2

Multi-Process Service (MPS)

Allows kernels launched from different (MPI) processes to be processed concurrently on the same GPU

◮ Utilize inactive SMs when

the work is small

Time on GPU schedule SMs Share GPU ‘in space’

LLNL-PRES-746880 Slide 2

SLIDE 3

Multi-Process Service (MPS)

Allows kernels launched from different (MPI) processes to be processed concurrently on the same GPU

◮ Utilize inactive SMs when

the work is small

◮ Processes take turns if

every SM is occupied

Time on GPU schedule SMs Share GPU ‘in space’ Time on GPU schedule SMs Share GPU ‘in time’

LLNL-PRES-746880 Slide 2

SLIDE 4

Sierra system architecture finalized and currently under deployment at LLNL

Compute System 4,320 nodes 1.29 PB Memory 240 Compute Racks 125 PFLOPS ≈ 12 MW

LLNL-PRES-746880 Slide 3

SLIDE 5

Sierra system architecture finalized and currently under deployment at LLNL

Compute System Compute Node 4,320 nodes 2 IBM POWER9 CPUs 1.29 PB Memory 4 NVIDIA Volta GPUs 240 Compute Racks 256 GiB DDR4 125 PFLOPS NVMe-compatible PCIe 1.6 TB SSD ≈ 12 MW 16 GiB Globally addressable HBM2 associated with each GPU Coherent Shared Memory

LLNL-PRES-746880 Slide 3

SLIDE 6

Sierra system architecture finalized and currently under deployment at LLNL

Compute System Compute Node 4,320 nodes 2 IBM POWER9 CPUs 1.29 PB Memory 4 NVIDIA Volta GPUs 240 Compute Racks 256 GiB DDR4 125 PFLOPS NVMe-compatible PCIe 1.6 TB SSD ≈ 12 MW 16 GiB Globally addressable HBM2 associated with each GPU Coherent Shared Memory 5% FLOPS 95% FLOPS

LLNL-PRES-746880 Slide 3

SLIDE 7

Ways to utilize a node of Sierra (showing one socket)

CPU GPU GPU

LLNL-PRES-746880 Slide 4

SLIDE 8

Ways to utilize a node of Sierra (showing one socket)

◮ MPI process/core

CPU GPU GPU

LLNL-PRES-746880

Slide 4

SLIDE 9

Ways to utilize a node of Sierra (showing one socket)

◮ MPI process/core

CPU GPU GPU

◮ MPI process/GPU

CPU GPU GPU

LLNL-PRES-746880

Slide 4

SLIDE 10

Ways to utilize a node of Sierra (showing one socket)

◮ MPI process/core

CPU GPU GPU

◮ MPI process/GPU

CPU GPU GPU

◮ MPI process/core,

MPS for GPU

CPU GPU GPU

LLNL-PRES-746880

Slide 4

SLIDE 11

Ways to utilize a node of Sierra (showing one socket)

◮ MPI process/core

CPU GPU GPU

◮ MPI process/GPU

CPU GPU GPU

◮ MPI process/core,

MPS for GPU

CPU GPU GPU

LLNL-PRES-746880

Slide 4

SLIDE 12

Parallel performance of multiphysics simulations

Decide how to run each phase of the multiphysics simulation

◮ On GPU vs. on CPU ◮ How many MPI processes (one per CPU core or one per GPU?) ◮ If some phases use one MPI process per CPU core, can we use MPS for the

accelerated phases?

LLNL-PRES-746880 Slide 5

SLIDE 13

Parallel performance of multiphysics simulations

Decide how to run each phase of the multiphysics simulation

◮ On GPU vs. on CPU ◮ How many MPI processes (one per CPU core or one per GPU?) ◮ If some phases use one MPI process per CPU core, can we use MPS for the

accelerated phases? Outline:

◮ Tools used for measurement ◮ Multiphysics application ◮ How the application is accelerated ◮ Results: MPI process/GPU vs. 4 MPI processes/GPU + MPS

◮ Impact on kernel performance ◮ Impact on communication

LLNL-PRES-746880 Slide 5

SLIDE 14

Tool: Caliper [SC’16] https://github.com/LLNL/Caliper

◮ Performance analysis toolbox, leverages existing tools ◮ Developed at LLNL ◮ Caliper team is responsive to our needs

1. Annotate: begin/end API similar to timers libraries

◮ Annotation of libraries (e.g., SAMRAI, hypre) combined seamlessly

2. Collect: Runtime parameters to instruct Caliper to measure:

◮ Measure MPI function calls ◮ Linux perf_event sampling (Libpfm) ◮ Measure CUDA driver/runtime calls (using CUPTI)

3. Analyze

◮ Using JSON output format

LLNL-PRES-746880 Slide 6

SLIDE 15

Application: ARES is a massively parallel, multi-dimensional, multi-physics code at LLNL

Physics Capabilities:

◮

ALE-AMR Hydrodynamics

◮

High-order Eulerian Hydrodynamics

◮

Elastic-Plastic flow

◮

3T plasma physics

◮

High-Explosive modeling

◮

Diffusion, SN Radiation

◮

Particulate flow

◮

Laser ray-tracing

◮

Magnetohydrodynamics (MHD)

◮

Dynamic mixing

◮

Non-LTE opacities

Applications:

◮

Inertial Confinement Fusion (ICF)

◮

Pulsed power

◮

National Ignition Facility debris

◮

High-Explosive experiments

LLNL-PRES-746880 Slide 7

SLIDE 16

ARES

◮ 800k lines of C/C++ with MPI ◮ 22 years old, used daily on our current supercomputers ◮ Single code base effectively utilizes all HPC platforms

LLNL-PRES-746880 Slide 8

SLIDE 17

ARES uses RAJA https://github.com/LLNL/RAJA

◮ 800k lines of C/C++ with MPI ◮ 22 years old, used daily on our current supercomputers ◮ Single code base effectively utilizes all HPC platforms ◮ Use RAJA as an abstraction layer for on-node parallelization

◮ RAJA is a collection of C++ software abstractions ◮ Separation of concerns

C-style for-loop: RAJA-style loop:

1: double* x; double* y; 2: double a; 3: for( int i = begin; 4:

i < end; ++i ) {

5:

y[i] += a * x[i];

6: } 1: double* x; double* y; 2: double a; 3: RAJA::forall<exec_policy> 4:

(begin, end, [=] (int i) {

5:

y[i] += a * x[i];

6: });

◮ Use different RAJA backends (CUDA, OpenMP)

LLNL-PRES-746880 Slide 8

SLIDE 18

Results

3D Sedov blastwave problem

◮ Hydrodynamics calculation ◮ ≈ 80 kernels

LLNL-PRES-746880 Slide 9

SLIDE 19

Results

3D Sedov blastwave problem

◮ Hydrodynamics calculation ◮ ≈ 80 kernels ◮ Pre-SIERRA machine (rzmanta) - Minsky nodes:

◮ 2x Power8+ CPUs (20 cores) ◮ 4x NVIDIA P100 (Pascal) GPUs with 16GB memory each ◮ NVLINK 1.0 ◮ * Some results generated with pre-release versions of compilers; improvements in

performance expected in future releases

◮ All results shown use 4 Minsky nodes (16 GPUs)

LLNL-PRES-746880 Slide 9

SLIDE 20

Domain decomposition with and without MPS

MPI process/ GPU 4 MPI processes/ GPU + MPS Differences:

◮ Computation: Work per MPI process ◮ Communication: Neighbors and surface to volume ratio

LLNL-PRES-746880 Slide 10

SLIDE 21

Overall runtime with and without MPS

200 250 300 350 400 10 20 30 Problem size (zones 3) Time (sec) MPI process/GPU 4 MPI processes/GPU + MPS

LLNL-PRES-746880 Slide 11

SLIDE 22

Overall runtime with and without MPS

200 250 300 350 400 10 20 30 Problem size (zones 3) Time (sec) MPI process/GPU 4 MPI processes/GPU + MPS

Differences:

◮ Computation ◮ Communication ◮ Memory

LLNL-PRES-746880 Slide 11

SLIDE 23

Computation time: Small kernels

200 250 300 350 400 1 2 Problem size (zones 3) Time (sec) MPI process/GPU 4 MPI processes/GPU + MPS

LLNL-PRES-746880 Slide 12

SLIDE 24

Computation time: Small kernels

200 250 300 350 400 1 2 Problem size (zones 3) Time (sec) MPI process/GPU 4 MPI processes/GPU + MPS ◮ Few zones, small amount

f work per zone

◮ Dominated by kernel

launch overhead

◮ MPS may be slightly

slower

LLNL-PRES-746880 Slide 12

SLIDE 25

Computation time: Large kernels

200 250 300 350 400 10 20 Problem size (zones 3) Time (sec) MPI process/GPU 4 MPI processes/GPU + MPS

LLNL-PRES-746880 Slide 13

SLIDE 26

Computation time: Large kernels

200 250 300 350 400 10 20 Problem size (zones 3) Time (sec) MPI process/GPU 4 MPI processes/GPU + MPS

MPS is faster especially when problem size is large

◮ Utilizing GPU better?

◮ GPU utilization? ◮ GPU occupancy?

◮ Utilizing CPU better?

◮ More parallelization? ◮ Better utilization of CPU

memory bandwidth?

LLNL-PRES-746880 Slide 13

SLIDE 27

Waiting on the GPU: cudaDeviceSynchronize

200 250 300 350 400 2 4 6 8 Problem size (zones 3) Time (sec) MPI process/GPU 4 MPI processes/GPU + MPS

LLNL-PRES-746880 Slide 14

SLIDE 28

Waiting on the GPU: cudaDeviceSynchronize

200 250 300 350 400 2 4 6 8 Problem size (zones 3) Time (sec) MPI process/GPU 4 MPI processes/GPU + MPS ◮ Appear to be waiting on

the GPU longer without MPS

LLNL-PRES-746880 Slide 14

SLIDE 29

Domain decomposition with and without MPS

MPI process/ GPU 4 MPI processes/ GPU + MPS

LLNL-PRES-746880 Slide 15

SLIDE 30

Domain decomposition with and without MPS

MPI process/ GPU 4 MPI processes/ GPU + MPS

LLNL-PRES-746880 Slide 15

SLIDE 31

Domain decomposition with and without MPS

MPI process/ GPU 4 MPI processes/ GPU + MPS Differences in communication:

◮ Number of neighbors in halo exchange ◮ Surface to volume ratio ◮ Processor mapping ◮ Other

LLNL-PRES-746880 Slide 15

SLIDE 32

Communication time (MPI)

200 250 300 350 400 10 20 30 Problem size (zones 3) Time (sec) MPI process/GPU 4 MPI processes/GPU + MPS, decomp1

LLNL-PRES-746880 Slide 16

SLIDE 33

Communication time (MPI)

200 250 300 350 400 10 20 30 Problem size (zones 3) Time (sec) MPI process/GPU 4 MPI processes/GPU + MPS, decomp1 ◮ More MPI processes =

more communication

LLNL-PRES-746880 Slide 16

SLIDE 34

Communication time (MPI)

200 250 300 350 400 10 20 30 Problem size (zones 3) Time (sec) MPI process/GPU 4 MPI processes/GPU + MPS, decomp1 4 MPI processes/GPU + MPS, decomp2 ◮ More MPI processes =

more communication

LLNL-PRES-746880 Slide 16

SLIDE 35

Communication time (MPI)

200 250 300 350 400 10 20 30 Problem size (zones 3) Time (sec) MPI process/GPU 4 MPI processes/GPU + MPS, decomp1 4 MPI processes/GPU + MPS, decomp2 4 MPI processes/GPU + MPS, decomp3 ◮ More MPI processes =

more communication

◮ Not all decompositions

result in the same communication time

LLNL-PRES-746880 Slide 16

SLIDE 36

Conclusions

◮ MPS can be useful if non-accelerated portions of the code need all CPU cores ◮ MPS can help to utilize the GPU better ◮ However, using more MPI processes makes communication more expensive - many

factors may have impact

◮ Caliper measures many aspects of performance, but there are more questions

◮ Does using more CPU cores increase CPU memory bandwidth utilization? ◮ How well am I utilizing/occupying the GPU? ◮ What is the bottleneck now: the CPU or the GPU? ◮ Other issues on new platforms

LLNL-PRES-746880 Slide 17