Acceleration of HPC applications on hybrid CPU-GPU systems: When can - - PowerPoint PPT Presentation

acceleration of hpc applications on hybrid cpu gpu
SMART_READER_LITE
LIVE PREVIEW

Acceleration of HPC applications on hybrid CPU-GPU systems: When can - - PowerPoint PPT Presentation

Acceleration of HPC applications on hybrid CPU-GPU systems: When can Multi-Process Service (MPS) help? GTC 2018 March 28, 2018 Olga Pearce (Lawrence Livermore National Laboratory) http://people.llnl.gov/olga Max Katz (NVIDIA), Leopold Grinberg


slide-1
SLIDE 1

This work was performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under contract DE- AC52-07NA27344. Lawrence Livermore National Security, LLC

Acceleration of HPC applications on hybrid CPU-GPU systems: When can Multi-Process Service (MPS) help?

GTC 2018 March 28, 2018

Olga Pearce (Lawrence Livermore National Laboratory) http://people.llnl.gov/olga Max Katz (NVIDIA), Leopold Grinberg (IBM)

LLNL-PRES-746880 Slide 1

slide-2
SLIDE 2

Multi-Process Service (MPS)

Allows kernels launched from different (MPI) processes to be processed concurrently on the same GPU

◮ Utilize inactive SMs when

the work is small

Time on GPU schedule SMs Share GPU ‘in space’

LLNL-PRES-746880 Slide 2

slide-3
SLIDE 3

Multi-Process Service (MPS)

Allows kernels launched from different (MPI) processes to be processed concurrently on the same GPU

◮ Utilize inactive SMs when

the work is small

◮ Processes take turns if

every SM is occupied

Time on GPU schedule SMs Share GPU ‘in space’ Time on GPU schedule SMs Share GPU ‘in time’

LLNL-PRES-746880 Slide 2

slide-4
SLIDE 4

Sierra system architecture finalized and currently under deployment at LLNL

Compute System 4,320 nodes 1.29 PB Memory 240 Compute Racks 125 PFLOPS ≈ 12 MW

LLNL-PRES-746880 Slide 3

slide-5
SLIDE 5

Sierra system architecture finalized and currently under deployment at LLNL

Compute System Compute Node 4,320 nodes 2 IBM POWER9 CPUs 1.29 PB Memory 4 NVIDIA Volta GPUs 240 Compute Racks 256 GiB DDR4 125 PFLOPS NVMe-compatible PCIe 1.6 TB SSD ≈ 12 MW 16 GiB Globally addressable HBM2 associated with each GPU Coherent Shared Memory

LLNL-PRES-746880 Slide 3

slide-6
SLIDE 6

Sierra system architecture finalized and currently under deployment at LLNL

Compute System Compute Node 4,320 nodes 2 IBM POWER9 CPUs 1.29 PB Memory 4 NVIDIA Volta GPUs 240 Compute Racks 256 GiB DDR4 125 PFLOPS NVMe-compatible PCIe 1.6 TB SSD ≈ 12 MW 16 GiB Globally addressable HBM2 associated with each GPU Coherent Shared Memory 5% FLOPS 95% FLOPS

LLNL-PRES-746880 Slide 3

slide-7
SLIDE 7

Ways to utilize a node of Sierra (showing one socket)

CPU GPU GPU

LLNL-PRES-746880 Slide 4

slide-8
SLIDE 8

Ways to utilize a node of Sierra (showing one socket)

◮ MPI process/core

CPU GPU GPU

  • LLNL-PRES-746880

Slide 4

slide-9
SLIDE 9

Ways to utilize a node of Sierra (showing one socket)

◮ MPI process/core

CPU GPU GPU

  • ◮ MPI process/GPU

CPU GPU GPU

  • LLNL-PRES-746880

Slide 4

slide-10
SLIDE 10

Ways to utilize a node of Sierra (showing one socket)

◮ MPI process/core

CPU GPU GPU

  • ◮ MPI process/GPU

CPU GPU GPU

  • ◮ MPI process/core,

MPS for GPU

CPU GPU GPU

  • LLNL-PRES-746880

Slide 4

slide-11
SLIDE 11

Ways to utilize a node of Sierra (showing one socket)

◮ MPI process/core

CPU GPU GPU

  • ◮ MPI process/GPU

CPU GPU GPU

  • ◮ MPI process/core,

MPS for GPU

CPU GPU GPU

  • LLNL-PRES-746880

Slide 4

slide-12
SLIDE 12

Parallel performance of multiphysics simulations

Decide how to run each phase of the multiphysics simulation

◮ On GPU vs. on CPU ◮ How many MPI processes (one per CPU core or one per GPU?) ◮ If some phases use one MPI process per CPU core, can we use MPS for the

accelerated phases?

LLNL-PRES-746880 Slide 5

slide-13
SLIDE 13

Parallel performance of multiphysics simulations

Decide how to run each phase of the multiphysics simulation

◮ On GPU vs. on CPU ◮ How many MPI processes (one per CPU core or one per GPU?) ◮ If some phases use one MPI process per CPU core, can we use MPS for the

accelerated phases? Outline:

◮ Tools used for measurement ◮ Multiphysics application ◮ How the application is accelerated ◮ Results: MPI process/GPU vs. 4 MPI processes/GPU + MPS

◮ Impact on kernel performance ◮ Impact on communication

LLNL-PRES-746880 Slide 5

slide-14
SLIDE 14

Tool: Caliper [SC’16] https://github.com/LLNL/Caliper

◮ Performance analysis toolbox, leverages existing tools ◮ Developed at LLNL ◮ Caliper team is responsive to our needs

  • 1. Annotate: begin/end API similar to timers libraries

◮ Annotation of libraries (e.g., SAMRAI, hypre) combined seamlessly

  • 2. Collect: Runtime parameters to instruct Caliper to measure:

◮ Measure MPI function calls ◮ Linux perf_event sampling (Libpfm) ◮ Measure CUDA driver/runtime calls (using CUPTI)

  • 3. Analyze

◮ Using JSON output format

LLNL-PRES-746880 Slide 6

slide-15
SLIDE 15

Application: ARES is a massively parallel, multi-dimensional, multi-physics code at LLNL

Physics Capabilities:

ALE-AMR Hydrodynamics

High-order Eulerian Hydrodynamics

Elastic-Plastic flow

3T plasma physics

High-Explosive modeling

Diffusion, SN Radiation

Particulate flow

Laser ray-tracing

Magnetohydrodynamics (MHD)

Dynamic mixing

Non-LTE opacities

Applications:

Inertial Confinement Fusion (ICF)

Pulsed power

National Ignition Facility debris

High-Explosive experiments

LLNL-PRES-746880 Slide 7

slide-16
SLIDE 16

ARES

◮ 800k lines of C/C++ with MPI ◮ 22 years old, used daily on our current supercomputers ◮ Single code base effectively utilizes all HPC platforms

LLNL-PRES-746880 Slide 8

slide-17
SLIDE 17

ARES uses RAJA https://github.com/LLNL/RAJA

◮ 800k lines of C/C++ with MPI ◮ 22 years old, used daily on our current supercomputers ◮ Single code base effectively utilizes all HPC platforms ◮ Use RAJA as an abstraction layer for on-node parallelization

◮ RAJA is a collection of C++ software abstractions ◮ Separation of concerns

C-style for-loop: RAJA-style loop:

1: double* x; double* y; 2: double a; 3: for( int i = begin; 4:

i < end; ++i ) {

5:

y[i] += a * x[i];

6: } 1: double* x; double* y; 2: double a; 3: RAJA::forall<exec_policy> 4:

(begin, end, [=] (int i) {

5:

y[i] += a * x[i];

6: });

◮ Use different RAJA backends (CUDA, OpenMP)

LLNL-PRES-746880 Slide 8

slide-18
SLIDE 18

Results

3D Sedov blastwave problem

◮ Hydrodynamics calculation ◮ ≈ 80 kernels

LLNL-PRES-746880 Slide 9

slide-19
SLIDE 19

Results

3D Sedov blastwave problem

◮ Hydrodynamics calculation ◮ ≈ 80 kernels ◮ Pre-SIERRA machine (rzmanta) - Minsky nodes:

◮ 2x Power8+ CPUs (20 cores) ◮ 4x NVIDIA P100 (Pascal) GPUs with 16GB memory each ◮ NVLINK 1.0 ◮ * Some results generated with pre-release versions of compilers; improvements in

performance expected in future releases

◮ All results shown use 4 Minsky nodes (16 GPUs)

LLNL-PRES-746880 Slide 9

slide-20
SLIDE 20

Domain decomposition with and without MPS

MPI process/ GPU 4 MPI processes/ GPU + MPS Differences:

◮ Computation: Work per MPI process ◮ Communication: Neighbors and surface to volume ratio

LLNL-PRES-746880 Slide 10

slide-21
SLIDE 21

Overall runtime with and without MPS

200 250 300 350 400 10 20 30 Problem size (zones 3) Time (sec) MPI process/GPU 4 MPI processes/GPU + MPS

LLNL-PRES-746880 Slide 11

slide-22
SLIDE 22

Overall runtime with and without MPS

200 250 300 350 400 10 20 30 Problem size (zones 3) Time (sec) MPI process/GPU 4 MPI processes/GPU + MPS

Differences:

◮ Computation ◮ Communication ◮ Memory

LLNL-PRES-746880 Slide 11

slide-23
SLIDE 23

Computation time: Small kernels

200 250 300 350 400 1 2 Problem size (zones 3) Time (sec) MPI process/GPU 4 MPI processes/GPU + MPS

LLNL-PRES-746880 Slide 12

slide-24
SLIDE 24

Computation time: Small kernels

200 250 300 350 400 1 2 Problem size (zones 3) Time (sec) MPI process/GPU 4 MPI processes/GPU + MPS ◮ Few zones, small amount

  • f work per zone

◮ Dominated by kernel

launch overhead

◮ MPS may be slightly

slower

LLNL-PRES-746880 Slide 12

slide-25
SLIDE 25

Computation time: Large kernels

200 250 300 350 400 10 20 Problem size (zones 3) Time (sec) MPI process/GPU 4 MPI processes/GPU + MPS

LLNL-PRES-746880 Slide 13

slide-26
SLIDE 26

Computation time: Large kernels

200 250 300 350 400 10 20 Problem size (zones 3) Time (sec) MPI process/GPU 4 MPI processes/GPU + MPS

MPS is faster especially when problem size is large

◮ Utilizing GPU better?

◮ GPU utilization? ◮ GPU occupancy?

◮ Utilizing CPU better?

◮ More parallelization? ◮ Better utilization of CPU

memory bandwidth?

LLNL-PRES-746880 Slide 13

slide-27
SLIDE 27

Waiting on the GPU: cudaDeviceSynchronize

200 250 300 350 400 2 4 6 8 Problem size (zones 3) Time (sec) MPI process/GPU 4 MPI processes/GPU + MPS

LLNL-PRES-746880 Slide 14

slide-28
SLIDE 28

Waiting on the GPU: cudaDeviceSynchronize

200 250 300 350 400 2 4 6 8 Problem size (zones 3) Time (sec) MPI process/GPU 4 MPI processes/GPU + MPS ◮ Appear to be waiting on

the GPU longer without MPS

LLNL-PRES-746880 Slide 14

slide-29
SLIDE 29

Domain decomposition with and without MPS

MPI process/ GPU 4 MPI processes/ GPU + MPS

LLNL-PRES-746880 Slide 15

slide-30
SLIDE 30

Domain decomposition with and without MPS

MPI process/ GPU 4 MPI processes/ GPU + MPS

LLNL-PRES-746880 Slide 15

slide-31
SLIDE 31

Domain decomposition with and without MPS

MPI process/ GPU 4 MPI processes/ GPU + MPS Differences in communication:

◮ Number of neighbors in halo exchange ◮ Surface to volume ratio ◮ Processor mapping ◮ Other

LLNL-PRES-746880 Slide 15

slide-32
SLIDE 32

Communication time (MPI)

200 250 300 350 400 10 20 30 Problem size (zones 3) Time (sec) MPI process/GPU 4 MPI processes/GPU + MPS, decomp1

LLNL-PRES-746880 Slide 16

slide-33
SLIDE 33

Communication time (MPI)

200 250 300 350 400 10 20 30 Problem size (zones 3) Time (sec) MPI process/GPU 4 MPI processes/GPU + MPS, decomp1 ◮ More MPI processes =

more communication

LLNL-PRES-746880 Slide 16

slide-34
SLIDE 34

Communication time (MPI)

200 250 300 350 400 10 20 30 Problem size (zones 3) Time (sec) MPI process/GPU 4 MPI processes/GPU + MPS, decomp1 4 MPI processes/GPU + MPS, decomp2 ◮ More MPI processes =

more communication

LLNL-PRES-746880 Slide 16

slide-35
SLIDE 35

Communication time (MPI)

200 250 300 350 400 10 20 30 Problem size (zones 3) Time (sec) MPI process/GPU 4 MPI processes/GPU + MPS, decomp1 4 MPI processes/GPU + MPS, decomp2 4 MPI processes/GPU + MPS, decomp3 ◮ More MPI processes =

more communication

◮ Not all decompositions

result in the same communication time

LLNL-PRES-746880 Slide 16

slide-36
SLIDE 36

Conclusions

◮ MPS can be useful if non-accelerated portions of the code need all CPU cores ◮ MPS can help to utilize the GPU better ◮ However, using more MPI processes makes communication more expensive - many

factors may have impact

◮ Caliper measures many aspects of performance, but there are more questions

◮ Does using more CPU cores increase CPU memory bandwidth utilization? ◮ How well am I utilizing/occupying the GPU? ◮ What is the bottleneck now: the CPU or the GPU? ◮ Other issues on new platforms

LLNL-PRES-746880 Slide 17

slide-37
SLIDE 37

Thank you

◮ ARES team

Lawrence Livermore National Laboratory

◮ Caliper team

https://github.com/LLNL/Caliper

◮ RAJA team

https://github.com/LLNL/RAJA

◮ Steve Rennich, Max Katz

NVIDIA

◮ Leopold Grinberg

IBM

LLNL-PRES-746880 Slide 18