1
Tan Nguyen, John Bachan, Samuel Williams, David Donofrio, John Shalf, Cy Chan
Lawrence Berkeley National Laboratory GPU Technology Conference - May 8, 2017
Performance Analysis, Modeling and Optimization
- f CUDA codes with the RambutanAcc DAG-based
Framework Tan Nguyen, John Bachan, Samuel Williams, David Donofrio, - - PowerPoint PPT Presentation
Performance Analysis, Modeling and Optimization of CUDA codes with the RambutanAcc DAG-based Framework Tan Nguyen, John Bachan, Samuel Williams, David Donofrio, John Shalf, Cy Chan Lawrence Berkeley National Laboratory GPU Technology Conference
1
Tan Nguyen, John Bachan, Samuel Williams, David Donofrio, John Shalf, Cy Chan
Lawrence Berkeley National Laboratory GPU Technology Conference - May 8, 2017
2
CUDA CODES WITH A DAG-BASED FRAMEWORK
as well as to many GPUs of a cluster. By representing the application as a DAG (directed acyclic graph) introduced by our RambutanAcc framework, the programmer can improve the application performance and scalability with a fine-grained task scheduler and lightweight communication handler implemented in a runtime system. The programmer can also gain insight into the application behavior by using analysis and modeling tools embedded in the framework.
Laboratory, Computational Research Division
3
4
5
– Tesla C1060 (240 cores), Fermi C2050 (448 cores) – Kepler K20 (2496 cores), K40 (2880 cores), K80 (2x 2496 cores) – Pascal P100 (3584 cores)
6
– Maximizing performance at low programmer effort
– GPU provides higher performance but less memory (on-device) – Host CPU has lower performance but more memory – Manage data across multiple nodes – Need runtime support for data management
CPU CPU DRAM GPU DRAM PCIe
7
Interconnect CPU DRAM CPU DRAM
GPU DRAM
Direct Memory Access (DMA) DMA
GPU DRAM GPU DRAM GPU DRAM GPU DRAM GPU DRAM GPU DRAM GPU DRAM
– Asynchronous communication – Direct communication among GPUs
RDMA
8
– Many runtime systems do not provide an effective mechanism to co-schedule workloads on a GPU – Instead, the programmer must launch kernels on the GPU and hope Hyper-Q will schedule them in a smart way (very challenging in many cases)
– Not so many runtimes support direct communication among GPUs – Likewise for load balancing on a GPU and among GPUs
– Most runtimes are developed primarily for production runs – Performance analysis and modeling are important in hardware & software design exploration
9
10
– Extend Rambutan, an asynchronous programming model
– Support GPU execution within the runtime system – Analyze the performance behavior
11
F(0) S(1,0) U(1,0,1) U(2,0,1) U(3,0,1) F(1) S(2,1) U(2,1,2) U(3,1,2) F(2) S(3,2) U(3,2,3) F(3) S(2,0) S(3,0) U(2,0,2) U(3,0,2) U(3,0,3) S(3,1) U(3,1,3)
Cholesky Factorization DAG 3D Stencil DAG 2.5D Cannon Matrix Multiply DAG CNS/SMC DAG
Shift B Shift A Accumulate C
12
– A task space encapsulates the behavior of a class of tasks – Tasks are dynamically created at runtime – E.g. Task <0, 1> (iter 1) will not be created until task <0, 0> (iter 0) completes
– A data space encapsulates access and management of a class of data – Data parcels are the granularity of data handled by the runtime – A task may require data inputs, each a partition of the data space called a parcel – Tasks may produce output parcels on execution
– A parcel is associated with locale, indicating where data resides (CPU or GPU DRAM) – The runtime system is responsible for migrating parcels
3D Stencil DAG
13
Type 1: Tasks running on host Type 2: Tasks running on host and offload compute intensive kernels to GPUs Type 3: Tasks running on GPUs
(e.g. create a new task)
14
Type 2 tasks employ the traditional CUDA non-blocking kernel launch model
Type 1: Tasks running on host Type 2: Tasks running on host and
GPUs – Port legacy CUDA codes quickly Type 3: Tasks running on GPUs
void launch(cudaStream_t stream){ kernel1<…, stream, 0> (arguments) } void finish(){ //post-completion action create new task which launches kernel2 }
15
Type 1: Tasks running on host Type 2: Tasks running on host and
GPUs Type 3: Tasks running on GPUs
– Low launching overhead – Run tasks on individual SMs
__device__ void myKernel(void* sArgs, void* appArgs){ //sArgs contains information of threadIdx, blockIdx, dimBlock and dimGrid //compute } void finish(){ //post-completion action create a new task }
F(0) S(1,0) U(1,0,1) U(2,0,1) U(3,0,1) F(1) S(2,1) U(2,1,2) U(3,1,2) F(2) S(3,2) U(3,2,3) F(3) S(2,0) S(3,0) U(2,0,2) U(3,0,2) U(3,0,3) S(3,1) U(3,1,3)
Grid U(3,2,3) Grid U(2,0,2)
16
– Tasks are created at runtime – Existing tasks issue requests to create new tasks
– Depending on type, tasks are scheduled on host or GPU – Tasks are buffered to reduce scheduling overhead
– Handle all types of communication (host-host, host-GPU, GPU-GPU) – Asynchronous fashion
Create task
Accelerator
Information
new task
scheduler Update worker status and commit Task
Host ready queue Acc ready queue Create New Task Fetching queue task buffer task buffer Host worker Host worker Acc worker Acc worker
Communication handler
tasks
RambutanAcc runtime system
17
TB TB TB TB TB TB TB TB TB TB TB TB TB TB TB TB
CUDA Thread Grid Worker 0 Worker 1 Worker 2 Worker 3 SM 0-1 SM 2-3 SM 4-5 SM 6-7
Persistent kernel servicing tasks on GPU
18
host host cudaMemcpyAsync cudaMemcpyAsync gasnet_put_nb (4) (1) Requesting remote procedural call (2) cudaStreamQuery (3) (7) (5) (6) Responding remote procedural call
accelerator accelerator
host host
accelerator
gasnet_put_nb (2) (send data location)
accelerator
(1) Remote procedural call to request a parcel (4) Notify
DMA(3)
19
20
Sparse Cholesky Matrix Factorization Irregular algorithm, requiring tasks to be small to balance tasks on processors/GPUs. However, it is challenging to scale fine-grain tasks on high-end GPUs 3D Jacobi Iterative Solver Tasks have the same size due to structured grids. Task performance is bounded by memory bandwidth. It is also challenging to scale this application to many GPUs due to high communication costs Dense Matrix Multiply Tasks are compute intensive. However, GPUs can process these tasks quickly. Thus, the overall performance is also sensitive to communication costs.
21
1 2 3 4 5 6 7 8 128 256 512 Type 2: CUDA Launch Type 3: Persistent Kernel
Matrix size milliseconds
22
50 100 150 200 250 300 350 400 1 2 4 6
Bulk Synchronous Asynchronous Asynchronous+DMA Free Comm (Theoretical)
GFLOPS/S #GPUs
23
500 1000 1500 2000 2500 3000 3500 1024 2048 4096
Synchronous Asynchronous-2D Asynchronous-2D+DMA Asynchronous-2.5D+DMA Free Comm (Theoretical)
GFLOPS/S Matrix size
24
25
– Dynamic and Semi-static load balancers
– Topology-Aware mapping algorithms
– Estimate the benefit of asynchronous execution
– Estimate the impact of hardware parameters on application performance
topology)
26
Analysis
Input Code
Performance Spreadsheet Dependency Graph Exascale Machine Config <XML> ROSE Frontend AST
ExaSAT Framework
User Parameters Code Descrip on <XML>
Model
future architectures for co-design
models Aries Dragonfly Interconnect Topology ExaSAT: On-Node Performance Model AMR Task Dependency Graph
BoxLib AMR Library ExaSAT and ProgrAMR Task Graph Genera on Mota Topology-Aware Mapping Library SST/macro Network Simulator
Domain Problem Parameters
Box Hierarchy
0: (( 0, 0, 0) (15,31,15)) 16 32 16
((16, 0, 0) (39,31,15)) 24 32 16 1: ((30, 0, 0) (47,31,31)) 18 32 32
((48,14,10) (67,29,29)) 20 16 20 …
Box Hierarchy
0: (( 0, 0, 0) (15,31,15)) 16 32 16
((16, 0, 0) (39,31,15)) 24 32 16 1: ((30, 0, 0) (47,31,31)) 18 32 32
((48,14,10) (67,29,29)) 20 16 20 …
Task Dependency Graph (XML) Task Dependency Graph (XML) Topology-Aware
Placement
Es mated Execu on Performance
AMR Modeling Workflow
27
1 2 3 4 5 6 7 8 9 10 rdm rr ks sfcs pfcm gr rcm rb
3DT-1536 Hop Metrics
Topo-Aware Non-Topo-Aware
Kiviat, 12k nodes, 10 TB/s DRAM Dragonfly, 1.5k ranks (normalized)
and task placement for AMR simulation codes and their interactions with asynchronous execution models:
– ExaSAT (on-node performance modeling), ProgrAMR
(asynchronous task graph generation), and Mota Mapper (topology mapping algorithms)
compared to non-topology-aware algorithms for balanced scenarios, and up to 55% for more extreme scenarios
and hop-latency bound interconnects
regions (e.g. NUMA domains, GPU memory, HBM, etc.)
Example Topology Mappings: Z-Morton 3D SFC and Reverse Cuthill-McKee
28
irregularity on application performance
highly regular algorithms
algorithms and the potential benefit of alternative execution models
sensitivity, and hardware specialization
Task-based DAG Execution Model
29
– A persistent CUDA kernel can help application scale to many cores of a GPU without complicating application kernel code – Once written in a DAG from, application can tolerate communication via overlapping with computation – Direct memory access and communication avoiding techniques help improve performance and scalability
30