Framework Tan Nguyen, John Bachan, Samuel Williams, David Donofrio, - - PowerPoint PPT Presentation

framework
SMART_READER_LITE
LIVE PREVIEW

Framework Tan Nguyen, John Bachan, Samuel Williams, David Donofrio, - - PowerPoint PPT Presentation

Performance Analysis, Modeling and Optimization of CUDA codes with the RambutanAcc DAG-based Framework Tan Nguyen, John Bachan, Samuel Williams, David Donofrio, John Shalf, Cy Chan Lawrence Berkeley National Laboratory GPU Technology Conference


slide-1
SLIDE 1

1

Tan Nguyen, John Bachan, Samuel Williams, David Donofrio, John Shalf, Cy Chan

Lawrence Berkeley National Laboratory GPU Technology Conference - May 8, 2017

Performance Analysis, Modeling and Optimization

  • f CUDA codes with the RambutanAcc DAG-based

Framework

slide-2
SLIDE 2

2

Session Info

  • S7335 - PERFORMANCE ANALYSIS, MODELING, AND OPTIMIZATION OF

CUDA CODES WITH A DAG-BASED FRAMEWORK

  • We'll discuss how programmers can scale CUDA codes to SMs of a GPUs

as well as to many GPUs of a cluster. By representing the application as a DAG (directed acyclic graph) introduced by our RambutanAcc framework, the programmer can improve the application performance and scalability with a fine-grained task scheduler and lightweight communication handler implemented in a runtime system. The programmer can also gain insight into the application behavior by using analysis and modeling tools embedded in the framework.

  • 25-minute Talk
  • Cy Chan, Computer Research Scientist, Lawrence Berkeley National

Laboratory, Computational Research Division

slide-3
SLIDE 3

3

Outline

  • GPU Programming Challenges
  • RambutanAcc Programming and Execution

Model

  • Experimental Results
  • Related Project Modeling Areas
  • Conclusion
slide-4
SLIDE 4

4

Outline

  • GPU Programming Challenges
  • RambutanAcc Programming and Execution

Model

  • Experimental Results
  • Related Project Modeling Areas
  • Conclusion
slide-5
SLIDE 5

5

Programming Challenges

  • Scaling to all the cores of a GPU is a non-trivial task

– Tesla C1060 (240 cores), Fermi C2050 (448 cores) – Kepler K20 (2496 cores), K40 (2880 cores), K80 (2x 2496 cores) – Pascal P100 (3584 cores)

slide-6
SLIDE 6

6

Programming Challenges (cont’d)

  • Placing computation on both

CPU and GPU

– Maximizing performance at low programmer effort

  • Data management

– GPU provides higher performance but less memory (on-device) – Host CPU has lower performance but more memory – Manage data across multiple nodes – Need runtime support for data management

CPU CPU DRAM GPU DRAM PCIe

slide-7
SLIDE 7

7

Programming Challenges (cont’d)

Interconnect CPU DRAM CPU DRAM

GPU DRAM

Direct Memory Access (DMA) DMA

GPU DRAM GPU DRAM GPU DRAM GPU DRAM GPU DRAM GPU DRAM GPU DRAM

  • Explicit

communication complicates the application code

  • Optimizing

communication code is challenging

– Asynchronous communication – Direct communication among GPUs

RDMA

slide-8
SLIDE 8

8

Limitations of Existing Runtimes

  • Lack of fine-grain scheduling on GPU

– Many runtime systems do not provide an effective mechanism to co-schedule workloads on a GPU – Instead, the programmer must launch kernels on the GPU and hope Hyper-Q will schedule them in a smart way (very challenging in many cases)

  • Lack of other GPU-aware optimizations

– Not so many runtimes support direct communication among GPUs – Likewise for load balancing on a GPU and among GPUs

  • Lack of performance analysis and modeling capabilities

– Most runtimes are developed primarily for production runs – Performance analysis and modeling are important in hardware & software design exploration

slide-9
SLIDE 9

9

Outline

  • GPU Programming Challenges
  • RambutanAcc Programming and Execution

Model

  • Experimental Results
  • Related Project Modeling Areas
  • Conclusion
slide-10
SLIDE 10

10

The RambutanAcc Project

  • Objectives

– Analyze and model performance behavior of applications under various execution models and communication policies – Scale our applications to effectively utilize multiple GPUs – Optimize the performance at low programming cost

  • Methodology

– Extend Rambutan, an asynchronous programming model

  • Represent application code with a task graph (Directed Acyclic Graph)

– Support GPU execution within the runtime system – Analyze the performance behavior

slide-11
SLIDE 11

11

RambutanAcc Task Graphs (DAGs)

F(0) S(1,0) U(1,0,1) U(2,0,1) U(3,0,1) F(1) S(2,1) U(2,1,2) U(3,1,2) F(2) S(3,2) U(3,2,3) F(3) S(2,0) S(3,0) U(2,0,2) U(3,0,2) U(3,0,3) S(3,1) U(3,1,3)

Cholesky Factorization DAG 3D Stencil DAG 2.5D Cannon Matrix Multiply DAG CNS/SMC DAG

Shift B Shift A Accumulate C

slide-12
SLIDE 12

12

Task Spaces and Data Spaces

  • Task Space

– A task space encapsulates the behavior of a class of tasks – Tasks are dynamically created at runtime – E.g. Task <0, 1> (iter 1) will not be created until task <0, 0> (iter 0) completes

  • Data Space

– A data space encapsulates access and management of a class of data – Data parcels are the granularity of data handled by the runtime – A task may require data inputs, each a partition of the data space called a parcel – Tasks may produce output parcels on execution

  • Mapping of Task Spaces and Data Spaces

– A parcel is associated with locale, indicating where data resides (CPU or GPU DRAM) – The runtime system is responsible for migrating parcels

3D Stencil DAG

Data Space

slide-13
SLIDE 13

13

Defining Tasks

  • Three Task Types

Type 1: Tasks running on host Type 2: Tasks running on host and offload compute intensive kernels to GPUs Type 3: Tasks running on GPUs

  • Specifying inputs/outputs
  • Specifying task computation
  • Specifying post-completion action

(e.g. create a new task)

slide-14
SLIDE 14

14

Defining Tasks

Type 2 tasks employ the traditional CUDA non-blocking kernel launch model

  • Three Task Types

Type 1: Tasks running on host Type 2: Tasks running on host and

  • ffload compute intensive kernels to

GPUs – Port legacy CUDA codes quickly Type 3: Tasks running on GPUs

void launch(cudaStream_t stream){ kernel1<…, stream, 0> (arguments) } void finish(){ //post-completion action create new task which launches kernel2 }

slide-15
SLIDE 15

15

Defining Tasks

  • Three Task Types

Type 1: Tasks running on host Type 2: Tasks running on host and

  • ffload compute intensive kernels to

GPUs Type 3: Tasks running on GPUs

– Low launching overhead – Run tasks on individual SMs

__device__ void myKernel(void* sArgs, void* appArgs){ //sArgs contains information of threadIdx, blockIdx, dimBlock and dimGrid //compute } void finish(){ //post-completion action create a new task }

F(0) S(1,0) U(1,0,1) U(2,0,1) U(3,0,1) F(1) S(2,1) U(2,1,2) U(3,1,2) F(2) S(3,2) U(3,2,3) F(3) S(2,0) S(3,0) U(2,0,2) U(3,0,2) U(3,0,3) S(3,1) U(3,1,3)

Grid U(3,2,3) Grid U(2,0,2)

slide-16
SLIDE 16

16

Implementation

  • Task management system

– Tasks are created at runtime – Existing tasks issue requests to create new tasks

  • Task scheduler

– Depending on type, tasks are scheduled on host or GPU – Tasks are buffered to reduce scheduling overhead

  • Communication handler

– Handle all types of communication (host-host, host-GPU, GPU-GPU) – Asynchronous fashion

Create task

Accelerator

Information

  • f

new task

scheduler Update worker status and commit Task

Host ready queue Acc ready queue Create New Task Fetching queue task buffer task buffer Host worker Host worker Acc worker Acc worker

Communication handler

tasks

RambutanAcc runtime system

slide-17
SLIDE 17

17

Type-3 Tasks: Persistent Kernel

  • Initially, we launch a

persistent CUDA kernel and keep only a few thread blocks per SM

  • Task scheduler running on

host sends tasks to task buffer on GPU using cudaMemcpyAsync

  • After servicing a task, this

kernel notifies task scheduler

  • n host using UVM

TB TB TB TB TB TB TB TB TB TB TB TB TB TB TB TB

CUDA Thread Grid Worker 0 Worker 1 Worker 2 Worker 3 SM 0-1 SM 2-3 SM 4-5 SM 6-7

Persistent kernel servicing tasks on GPU

slide-18
SLIDE 18

18

Communication Handler

  • We use GASNet to handle communication among GPUs
  • Data can be routed through hosts or transferred directly

among GPUs depending on hardware support

host host cudaMemcpyAsync cudaMemcpyAsync gasnet_put_nb (4) (1) Requesting remote procedural call (2) cudaStreamQuery (3) (7) (5) (6) Responding remote procedural call

accelerator accelerator

host host

accelerator

gasnet_put_nb (2) (send data location)

accelerator

(1) Remote procedural call to request a parcel (4) Notify

  • wner

DMA(3)

slide-19
SLIDE 19

19

Outline

  • GPU Programming Challenges
  • RambutanAcc Programming and Execution

Model

  • Experimental Results
  • Related Project Modeling Areas
  • Conclusion
slide-20
SLIDE 20

20

Studying Task Scheduling & Data Comm. Optimizations

Applications Characteristics

Sparse Cholesky Matrix Factorization Irregular algorithm, requiring tasks to be small to balance tasks on processors/GPUs. However, it is challenging to scale fine-grain tasks on high-end GPUs 3D Jacobi Iterative Solver Tasks have the same size due to structured grids. Task performance is bounded by memory bandwidth. It is also challenging to scale this application to many GPUs due to high communication costs Dense Matrix Multiply Tasks are compute intensive. However, GPUs can process these tasks quickly. Thus, the overall performance is also sensitive to communication costs.

slide-21
SLIDE 21

21

Sparse Cholesky Matrix Factorization

  • Each matrix is represented as a

sparse list of small tiles

  • The smaller the tile size, the harder

to scale computations to all the cores of a GPU

  • With the persistent kernel (type 3

tasks), we can schedule each computation task on a subset of available SMs

  • Co-schedule tasks on the same

GPU improves the performance substantially

1 2 3 4 5 6 7 8 128 256 512 Type 2: CUDA Launch Type 3: Persistent Kernel

Matrix size milliseconds

slide-22
SLIDE 22

22

3D Jacobi Iterative Solver

  • Strong scaling study with grid

size 512^3

  • Asynchronous code variants

scale better since they can tolerate communication cost via communication overlap

  • On 6 GPUs, communication
  • verlap without DMA

becomes less effective due to lack of computation

  • With DMA, the asynchronous

code performs well in all cases

50 100 150 200 250 300 350 400 1 2 4 6

Bulk Synchronous Asynchronous Asynchronous+DMA Free Comm (Theoretical)

GFLOPS/S #GPUs

slide-23
SLIDE 23

23

Dense Matrix Multiply

  • 2.5D Cannon (Communication

Avoiding) algorithm on 4 GPUs

  • High communication
  • verhead since dense matrix

multiply ops can be processed by GPUs quickly

  • Thus, the asynchronous code

variants works well only with DMA

  • Also, communication avoiding

improves the performance further

500 1000 1500 2000 2500 3000 3500 1024 2048 4096

Synchronous Asynchronous-2D Asynchronous-2D+DMA Asynchronous-2.5D+DMA Free Comm (Theoretical)

GFLOPS/S Matrix size

slide-24
SLIDE 24

24

Outline

  • GPU Programming Challenges
  • RambutanAcc Programming and Execution

Model

  • Experimental Results
  • Related Project Modeling Areas
  • Conclusion
slide-25
SLIDE 25

25

Other Rambutan Modeling Activities

  • Task Placement and Load Balancing Strategies

– Dynamic and Semi-static load balancers

  • Asynchronous task execution models need intelligent schedulers

– Topology-Aware mapping algorithms

  • Application communication modeling
  • Network topology modeling
  • Asynchrony Sensitivity

– Estimate the benefit of asynchronous execution

  • Use a task graph representation of the algorithm
  • Experiment with varying degrees of asynchrony
  • Hardware Co-Design

– Estimate the impact of hardware parameters on application performance

  • Node level (CPU, GPU, memory) and Machine level (NIC, switches and links,

topology)

  • For various applications and execution models
slide-26
SLIDE 26

26

AMR Modeling and Simulation for Exascale

  • Compiler

Analysis

  • Func on
and Loop
  • A ribute
Analysis Arithme c Ops, Read/Write, Stencil Access Analysis

Input Code

Performance Spreadsheet Dependency Graph Exascale Machine Config <XML> ROSE Frontend AST

ExaSAT Framework

User Parameters Code Descrip on <XML>

  • Performance

Model

  • Working
Set and Memory Traffic Modeling Loop Dependency Memory Footprint Analysis
  • Need models to estimate performance of AMR codes on

future architectures for co-design

  • Many interacting design choices impact performance:
  • Solver algorithms (e.g. multigrid)
  • AMR mesh/box layout
  • Network interconnect topology
  • Execution model (e.g. degree of synchronization)
  • Focus on performance at level of distributed solvers
  • Composition of on-node and inter-node performance

models Aries Dragonfly Interconnect Topology ExaSAT: On-Node Performance Model AMR Task Dependency Graph

BoxLib AMR Library ExaSAT and ProgrAMR Task Graph Genera on Mota Topology-Aware Mapping Library SST/macro Network Simulator

Domain Problem Parameters

Box Hierarchy

0: (( 0, 0, 0) (15,31,15)) 16 32 16

  • 0:

((16, 0, 0) (39,31,15)) 24 32 16 1: ((30, 0, 0) (47,31,31)) 18 32 32

  • 1:

((48,14,10) (67,29,29)) 20 16 20 …

Box Hierarchy

0: (( 0, 0, 0) (15,31,15)) 16 32 16

  • 0:

((16, 0, 0) (39,31,15)) 24 32 16 1: ((30, 0, 0) (47,31,31)) 18 32 32

  • 1:

((48,14,10) (67,29,29)) 20 16 20 …

Task Dependency Graph (XML) Task Dependency Graph (XML) Topology-Aware

  • Box

Placement

Es mated Execu on Performance

AMR Modeling Workflow

slide-27
SLIDE 27

27

Impact of Topology-Aware Mappings: Reduced Network Hop Traversals

1 2 3 4 5 6 7 8 9 10 rdm rr ks sfcs pfcm gr rcm rb

3DT-1536 Hop Metrics

  • Avg. Hops / Byte
  • Avg. Hops / Msg

Topo-Aware Non-Topo-Aware

Kiviat, 12k nodes, 10 TB/s DRAM Dragonfly, 1.5k ranks (normalized)

  • Suite of tools in Rambutan Project explores topology-aware data

and task placement for AMR simulation codes and their interactions with asynchronous execution models:

– ExaSAT (on-node performance modeling), ProgrAMR

(asynchronous task graph generation), and Mota Mapper (topology mapping algorithms)

  • Topology-aware placement improves performance up to 10%

compared to non-topology-aware algorithms for balanced scenarios, and up to 55% for more extreme scenarios

  • Greater opportunity for improvement for larger diameter networks

and hop-latency bound interconnects

  • We plan to extend our model to include intra-node memory

regions (e.g. NUMA domains, GPU memory, HBM, etc.)

  • We expect our techniques to generalize to other domains

Example Topology Mappings: Z-Morton 3D SFC and Reverse Cuthill-McKee

slide-28
SLIDE 28

28

Modeling the Impact of System Irregularity

  • Model the impact of algorithm and system

irregularity on application performance

  • System noise can cause imbalance even for

highly regular algorithms

  • Goal to quantify the irregularity inherent in

algorithms and the potential benefit of alternative execution models

  • Investigating irregularity metrics, noise

sensitivity, and hardware specialization

Task-based DAG Execution Model

slide-29
SLIDE 29

29

Conclusion

  • We have presented a software framework that makes it easy

to study the impact of optimization techniques as well as modeling and analyzing the performance behavior of applications

  • Interesting findings:

– A persistent CUDA kernel can help application scale to many cores of a GPU without complicating application kernel code – Once written in a DAG from, application can tolerate communication via overlapping with computation – Direct memory access and communication avoiding techniques help improve performance and scalability

slide-30
SLIDE 30

30

Acknowledgement

  • This material is based upon work supported by the Advanced

Scientific Computing Research Program in the U.S. Department of Energy, Office of Science, under Award Number DE-AC02-05CH11231.

  • RambutanAcc and application codes were developed and

evaluated on TiDA, a system housed at LBNL, with GPUs provided by NVIDIA.