[PPT] - Framework Tan Nguyen, John Bachan, Samuel Williams, David Donofrio, PowerPoint Presentation

SLIDE 1

1

Tan Nguyen, John Bachan, Samuel Williams, David Donofrio, John Shalf, Cy Chan

Lawrence Berkeley National Laboratory GPU Technology Conference - May 8, 2017

Performance Analysis, Modeling and Optimization

f CUDA codes with the RambutanAcc DAG-based

Framework

SLIDE 2

2

Session Info

S7335 - PERFORMANCE ANALYSIS, MODELING, AND OPTIMIZATION OF

CUDA CODES WITH A DAG-BASED FRAMEWORK

We'll discuss how programmers can scale CUDA codes to SMs of a GPUs

as well as to many GPUs of a cluster. By representing the application as a DAG (directed acyclic graph) introduced by our RambutanAcc framework, the programmer can improve the application performance and scalability with a fine-grained task scheduler and lightweight communication handler implemented in a runtime system. The programmer can also gain insight into the application behavior by using analysis and modeling tools embedded in the framework.

25-minute Talk
Cy Chan, Computer Research Scientist, Lawrence Berkeley National

Laboratory, Computational Research Division

SLIDE 3

3

Outline

GPU Programming Challenges
RambutanAcc Programming and Execution

Model

Experimental Results
Related Project Modeling Areas
Conclusion

SLIDE 4

4

Outline

GPU Programming Challenges
RambutanAcc Programming and Execution

Model

Experimental Results
Related Project Modeling Areas
Conclusion

SLIDE 5

5

Programming Challenges

Scaling to all the cores of a GPU is a non-trivial task

– Tesla C1060 (240 cores), Fermi C2050 (448 cores) – Kepler K20 (2496 cores), K40 (2880 cores), K80 (2x 2496 cores) – Pascal P100 (3584 cores)

SLIDE 6

6

Programming Challenges (cont’d)

Placing computation on both

CPU and GPU

– Maximizing performance at low programmer effort

Data management

– GPU provides higher performance but less memory (on-device) – Host CPU has lower performance but more memory – Manage data across multiple nodes – Need runtime support for data management

CPU CPU DRAM GPU DRAM PCIe

SLIDE 7

7

Programming Challenges (cont’d)

Interconnect CPU DRAM CPU DRAM

GPU DRAM

Direct Memory Access (DMA) DMA

GPU DRAM GPU DRAM GPU DRAM GPU DRAM GPU DRAM GPU DRAM GPU DRAM

Explicit

communication complicates the application code

Optimizing

communication code is challenging

– Asynchronous communication – Direct communication among GPUs

RDMA

SLIDE 8

8

Limitations of Existing Runtimes

Lack of fine-grain scheduling on GPU

– Many runtime systems do not provide an effective mechanism to co-schedule workloads on a GPU – Instead, the programmer must launch kernels on the GPU and hope Hyper-Q will schedule them in a smart way (very challenging in many cases)

Lack of other GPU-aware optimizations

– Not so many runtimes support direct communication among GPUs – Likewise for load balancing on a GPU and among GPUs

Lack of performance analysis and modeling capabilities

– Most runtimes are developed primarily for production runs – Performance analysis and modeling are important in hardware & software design exploration

SLIDE 9

9

Outline

GPU Programming Challenges
RambutanAcc Programming and Execution

Model

Experimental Results
Related Project Modeling Areas
Conclusion

SLIDE 10

10

The RambutanAcc Project

Objectives

– Analyze and model performance behavior of applications under various execution models and communication policies – Scale our applications to effectively utilize multiple GPUs – Optimize the performance at low programming cost

Methodology

– Extend Rambutan, an asynchronous programming model

Represent application code with a task graph (Directed Acyclic Graph)

– Support GPU execution within the runtime system – Analyze the performance behavior

SLIDE 11

11

RambutanAcc Task Graphs (DAGs)

F(0) S(1,0) U(1,0,1) U(2,0,1) U(3,0,1) F(1) S(2,1) U(2,1,2) U(3,1,2) F(2) S(3,2) U(3,2,3) F(3) S(2,0) S(3,0) U(2,0,2) U(3,0,2) U(3,0,3) S(3,1) U(3,1,3)

Cholesky Factorization DAG 3D Stencil DAG 2.5D Cannon Matrix Multiply DAG CNS/SMC DAG

Shift B Shift A Accumulate C

SLIDE 12

12

Task Spaces and Data Spaces

Task Space

– A task space encapsulates the behavior of a class of tasks – Tasks are dynamically created at runtime – E.g. Task <0, 1> (iter 1) will not be created until task <0, 0> (iter 0) completes

Data Space

– A data space encapsulates access and management of a class of data – Data parcels are the granularity of data handled by the runtime – A task may require data inputs, each a partition of the data space called a parcel – Tasks may produce output parcels on execution

Mapping of Task Spaces and Data Spaces

– A parcel is associated with locale, indicating where data resides (CPU or GPU DRAM) – The runtime system is responsible for migrating parcels

3D Stencil DAG

Data Space

SLIDE 13

13

Defining Tasks

Three Task Types

Type 1: Tasks running on host Type 2: Tasks running on host and offload compute intensive kernels to GPUs Type 3: Tasks running on GPUs

Specifying inputs/outputs
Specifying task computation
Specifying post-completion action

(e.g. create a new task)

SLIDE 14

14

Defining Tasks

Type 2 tasks employ the traditional CUDA non-blocking kernel launch model

Three Task Types

Type 1: Tasks running on host Type 2: Tasks running on host and

ffload compute intensive kernels to

GPUs – Port legacy CUDA codes quickly Type 3: Tasks running on GPUs

void launch(cudaStream_t stream){ kernel1<…, stream, 0> (arguments) } void finish(){ //post-completion action create new task which launches kernel2 }

SLIDE 15

15

Defining Tasks

Three Task Types

Type 1: Tasks running on host Type 2: Tasks running on host and

ffload compute intensive kernels to

GPUs Type 3: Tasks running on GPUs

– Low launching overhead – Run tasks on individual SMs

__device__ void myKernel(void* sArgs, void* appArgs){ //sArgs contains information of threadIdx, blockIdx, dimBlock and dimGrid //compute } void finish(){ //post-completion action create a new task }

F(0) S(1,0) U(1,0,1) U(2,0,1) U(3,0,1) F(1) S(2,1) U(2,1,2) U(3,1,2) F(2) S(3,2) U(3,2,3) F(3) S(2,0) S(3,0) U(2,0,2) U(3,0,2) U(3,0,3) S(3,1) U(3,1,3)

Grid U(3,2,3) Grid U(2,0,2)

SLIDE 16

16

Implementation

Task management system

– Tasks are created at runtime – Existing tasks issue requests to create new tasks

Task scheduler

– Depending on type, tasks are scheduled on host or GPU – Tasks are buffered to reduce scheduling overhead

Communication handler

– Handle all types of communication (host-host, host-GPU, GPU-GPU) – Asynchronous fashion

Create task

Accelerator

Information

f

new task

scheduler Update worker status and commit Task

Host ready queue Acc ready queue Create New Task Fetching queue task buffer task buffer Host worker Host worker Acc worker Acc worker

Communication handler

tasks

RambutanAcc runtime system

SLIDE 17

17

Type-3 Tasks: Persistent Kernel

Initially, we launch a

persistent CUDA kernel and keep only a few thread blocks per SM

Task scheduler running on

host sends tasks to task buffer on GPU using cudaMemcpyAsync

After servicing a task, this

kernel notifies task scheduler

n host using UVM

TB TB TB TB TB TB TB TB TB TB TB TB TB TB TB TB

CUDA Thread Grid Worker 0 Worker 1 Worker 2 Worker 3 SM 0-1 SM 2-3 SM 4-5 SM 6-7

Persistent kernel servicing tasks on GPU

SLIDE 18

18

Communication Handler

We use GASNet to handle communication among GPUs
Data can be routed through hosts or transferred directly

among GPUs depending on hardware support

host host cudaMemcpyAsync cudaMemcpyAsync gasnet_put_nb (4) (1) Requesting remote procedural call (2) cudaStreamQuery (3) (7) (5) (6) Responding remote procedural call

accelerator accelerator

host host

accelerator

gasnet_put_nb (2) (send data location)

accelerator

(1) Remote procedural call to request a parcel (4) Notify

wner

DMA(3)

SLIDE 19

19

Outline

GPU Programming Challenges
RambutanAcc Programming and Execution

Model

Experimental Results
Related Project Modeling Areas
Conclusion

SLIDE 20

20

Studying Task Scheduling & Data Comm. Optimizations

Applications Characteristics

Sparse Cholesky Matrix Factorization Irregular algorithm, requiring tasks to be small to balance tasks on processors/GPUs. However, it is challenging to scale fine-grain tasks on high-end GPUs 3D Jacobi Iterative Solver Tasks have the same size due to structured grids. Task performance is bounded by memory bandwidth. It is also challenging to scale this application to many GPUs due to high communication costs Dense Matrix Multiply Tasks are compute intensive. However, GPUs can process these tasks quickly. Thus, the overall performance is also sensitive to communication costs.

SLIDE 21

21

Sparse Cholesky Matrix Factorization

Each matrix is represented as a

sparse list of small tiles

The smaller the tile size, the harder

to scale computations to all the cores of a GPU

With the persistent kernel (type 3

tasks), we can schedule each computation task on a subset of available SMs

Co-schedule tasks on the same

GPU improves the performance substantially

1 2 3 4 5 6 7 8 128 256 512 Type 2: CUDA Launch Type 3: Persistent Kernel

Matrix size milliseconds

SLIDE 22

22

3D Jacobi Iterative Solver

Strong scaling study with grid

size 512^3

Asynchronous code variants

scale better since they can tolerate communication cost via communication overlap

On 6 GPUs, communication
verlap without DMA

becomes less effective due to lack of computation

With DMA, the asynchronous

code performs well in all cases

50 100 150 200 250 300 350 400 1 2 4 6

Bulk Synchronous Asynchronous Asynchronous+DMA Free Comm (Theoretical)

GFLOPS/S #GPUs

SLIDE 23

23

Dense Matrix Multiply

2.5D Cannon (Communication

Avoiding) algorithm on 4 GPUs

High communication
verhead since dense matrix

multiply ops can be processed by GPUs quickly

Thus, the asynchronous code

variants works well only with DMA

Also, communication avoiding

improves the performance further

500 1000 1500 2000 2500 3000 3500 1024 2048 4096

Synchronous Asynchronous-2D Asynchronous-2D+DMA Asynchronous-2.5D+DMA Free Comm (Theoretical)

GFLOPS/S Matrix size

SLIDE 24

24

Outline

GPU Programming Challenges
RambutanAcc Programming and Execution

Model

Experimental Results
Related Project Modeling Areas
Conclusion

SLIDE 25

25

Other Rambutan Modeling Activities

Task Placement and Load Balancing Strategies

– Dynamic and Semi-static load balancers

Asynchronous task execution models need intelligent schedulers

– Topology-Aware mapping algorithms

Application communication modeling
Network topology modeling
Asynchrony Sensitivity

– Estimate the benefit of asynchronous execution

Use a task graph representation of the algorithm
Experiment with varying degrees of asynchrony
Hardware Co-Design

– Estimate the impact of hardware parameters on application performance

Node level (CPU, GPU, memory) and Machine level (NIC, switches and links,

topology)

For various applications and execution models

SLIDE 26

26

AMR Modeling and Simulation for Exascale

Compiler

Analysis

Func on

and Loop

A ribute

Analysis Arithme c Ops, Read/Write, Stencil Access Analysis

Input Code

Performance Spreadsheet Dependency Graph Exascale Machine Config <XML> ROSE Frontend AST

ExaSAT Framework

User Parameters Code Descrip on <XML>

Performance

Model

Working

Set and Memory Traffic Modeling Loop Dependency Memory Footprint Analysis

Need models to estimate performance of AMR codes on

future architectures for co-design

Many interacting design choices impact performance:
Solver algorithms (e.g. multigrid)
AMR mesh/box layout
Network interconnect topology
Execution model (e.g. degree of synchronization)
Focus on performance at level of distributed solvers
Composition of on-node and inter-node performance

models Aries Dragonfly Interconnect Topology ExaSAT: On-Node Performance Model AMR Task Dependency Graph

BoxLib AMR Library ExaSAT and ProgrAMR Task Graph Genera on Mota Topology-Aware Mapping Library SST/macro Network Simulator

Domain Problem Parameters

Box Hierarchy

0: (( 0, 0, 0) (15,31,15)) 16 32 16

0:

((16, 0, 0) (39,31,15)) 24 32 16 1: ((30, 0, 0) (47,31,31)) 18 32 32

1:

((48,14,10) (67,29,29)) 20 16 20 …

Box Hierarchy

0: (( 0, 0, 0) (15,31,15)) 16 32 16

0:

((16, 0, 0) (39,31,15)) 24 32 16 1: ((30, 0, 0) (47,31,31)) 18 32 32

1:

((48,14,10) (67,29,29)) 20 16 20 …

Task Dependency Graph (XML) Task Dependency Graph (XML) Topology-Aware

Box

Placement

Es mated Execu on Performance

AMR Modeling Workflow

SLIDE 27

27

Impact of Topology-Aware Mappings: Reduced Network Hop Traversals

1 2 3 4 5 6 7 8 9 10 rdm rr ks sfcs pfcm gr rcm rb

3DT-1536 Hop Metrics

Avg. Hops / Byte
Avg. Hops / Msg

Topo-Aware Non-Topo-Aware

Kiviat, 12k nodes, 10 TB/s DRAM Dragonfly, 1.5k ranks (normalized)

Suite of tools in Rambutan Project explores topology-aware data

and task placement for AMR simulation codes and their interactions with asynchronous execution models:

– ExaSAT (on-node performance modeling), ProgrAMR

(asynchronous task graph generation), and Mota Mapper (topology mapping algorithms)

Topology-aware placement improves performance up to 10%

compared to non-topology-aware algorithms for balanced scenarios, and up to 55% for more extreme scenarios

Greater opportunity for improvement for larger diameter networks

and hop-latency bound interconnects

We plan to extend our model to include intra-node memory

regions (e.g. NUMA domains, GPU memory, HBM, etc.)

We expect our techniques to generalize to other domains

Example Topology Mappings: Z-Morton 3D SFC and Reverse Cuthill-McKee

SLIDE 28

28

Modeling the Impact of System Irregularity

Model the impact of algorithm and system

irregularity on application performance

System noise can cause imbalance even for

highly regular algorithms

Goal to quantify the irregularity inherent in

algorithms and the potential benefit of alternative execution models

Investigating irregularity metrics, noise

sensitivity, and hardware specialization

Task-based DAG Execution Model

SLIDE 29

29

Conclusion

We have presented a software framework that makes it easy

to study the impact of optimization techniques as well as modeling and analyzing the performance behavior of applications

Interesting findings:

– A persistent CUDA kernel can help application scale to many cores of a GPU without complicating application kernel code – Once written in a DAG from, application can tolerate communication via overlapping with computation – Direct memory access and communication avoiding techniques help improve performance and scalability

SLIDE 30

30

Acknowledgement

This material is based upon work supported by the Advanced

Scientific Computing Research Program in the U.S. Department of Energy, Office of Science, under Award Number DE-AC02-05CH11231.

RambutanAcc and application codes were developed and

Performance Analysis, Modeling and Optimization

Framework

Session Info

Outline

Model

Outline

Model

Programming Challenges

Programming Challenges (cont’d)

CPU and GPU

Programming Challenges (cont’d)

communication complicates the application code

communication code is challenging

Limitations of Existing Runtimes

Outline

Model

The RambutanAcc Project

– Analyze and model performance behavior of applications under various execution models and communication policies – Scale our applications to effectively utilize multiple GPUs – Optimize the performance at low programming cost

RambutanAcc Task Graphs (DAGs)

Task Spaces and Data Spaces

Data Space

Defining Tasks

Defining Tasks

Defining Tasks

Implementation

Type-3 Tasks: Persistent Kernel

persistent CUDA kernel and keep only a few thread blocks per SM

host sends tasks to task buffer on GPU using cudaMemcpyAsync

kernel notifies task scheduler

Communication Handler

among GPUs depending on hardware support

Outline

Model

Studying Task Scheduling & Data Comm. Optimizations

Applications Characteristics

Sparse Cholesky Matrix Factorization

sparse list of small tiles

to scale computations to all the cores of a GPU

tasks), we can schedule each computation task on a subset of available SMs

GPU improves the performance substantially

3D Jacobi Iterative Solver

size 512^3

scale better since they can tolerate communication cost via communication overlap

becomes less effective due to lack of computation

code performs well in all cases

Dense Matrix Multiply

Avoiding) algorithm on 4 GPUs

multiply ops can be processed by GPUs quickly

variants works well only with DMA

improves the performance further

Outline

Model

Other Rambutan Modeling Activities

AMR Modeling and Simulation for Exascale

Impact of Topology-Aware Mappings: Reduced Network Hop Traversals

Modeling the Impact of System Irregularity

Conclusion

to study the impact of optimization techniques as well as modeling and analyzing the performance behavior of applications

Acknowledgement

Scientific Computing Research Program in the U.S. Department of Energy, Office of Science, under Award Number DE-AC02-05CH11231.

evaluated on TiDA, a system housed at LBNL, with GPUs provided by NVIDIA.