[PPT] - GPU Program Optimizations Yixun Liu, Eddy Z. Zhang , Xipeng Shen PowerPoint Presentation

SLIDE 1

A Cross-Input Adaptive Framework for GPU Program Optimizations

Yixun Liu, Eddy Z. Zhang, Xipeng Shen

Computer Science Department The College of William & Mary

SLIDE 2

Eddy Z. Zhang

Outline

GPU overview
G-Adapt Framework
Evaluation
Related & Future Work
Conclusion

2

SLIDE 3

Eddy Z. Zhang

GPU (Graphics Processing Unit)

Architecture

▫ SIMD parallel ▫ Multithreaded ▫ Many core

Feature

▫ Tremendous computational horsepower ▫ High mem badwidth

Applications

▫ Traditional graphic rendering ▫ Emerging: general data parallel computing

3

SLIDE 4

Eddy Z. Zhang

Programming GPU

▫ High level model

 Abstraction to multithread platform  C-like programming  No explicit mapping to graphics rendering  E.g, CUDA, Brook+,

penCL

▫ NVIDIA CUDA

 Kernel func. on GPU  Threads->blocks->grids

4

Graph From CUDA Manual

SLIDE 5

Eddy Z. Zhang

Optimization Challenges

Goal

▫ Maximize throughput

 Increase occupancy, reduce latency, dynamic instr.

Difficulties

▫ Hard to predict optimization effects

 Non-linearity, coupling, undisclosed CUDA details

▫ GPU hardware complexities

 Limits: 512 threads per block, 768 threads per SM, etc  Various types of memories: constant, texture, etc

▫ Input sensitivity

5

SLIDE 6

Eddy Z. Zhang

Matrix-Vector Multiplication

6

SLIDE 7

Eddy Z. Zhang

Outline

GPU Overview
G-Adapt Framework
Evaluation
Related & Future Work
Conclusion

7

SLIDE 8

Eddy Z. Zhang

G-ADAPT

Empirical search-based optimization

▫ Three obstacles to address

 Construction of the optimization space  Space pruning  Cross-input adaptation

8

SLIDE 9

Eddy Z. Zhang

G-ADAPT: Overview

Source-to-source

compiler

Cross-input

adaptation

Automatic search

& transformations

Easy integration of user knowledge through

pragmas

Code with pragmas & input Empirical search & data collection

<Input, best

ptimizations>

Pattern recognition & code generation

Optimized input-adaptive GPU program

Stage 1 Stage 2 9

SLIDE 10

Eddy Z. Zhang

G-ADAPT Pragmas

Supports a programmer-compiler synergy
Covers 2 levels of optimizations

▫ Execution configurations

 E.g, thread block dimensions

▫ Code transformations

 E.g, loop tile size, unrolling levels

10

SLIDE 11

Eddy Z. Zhang

Pragma Examples

#pragma erange 64, 512, 2 #define BLKSZ 256 #pragma lpur_lrange 0, min(BLKSZ, 16), 2 For (i=1; i < BLKSZ; i++) { …… }

11

SLIDE 12

Eddy Z. Zhang

Stage I: Search & Collect

G-ADAPT compiler Performance Calibrator Perf. DB Optimization Agent Code with pragmas & inputs

Optimized GPU code Performance

Optimization Parameters

12

SLIDE 13

Eddy Z. Zhang

G-ADAPT Compiler

Two functionalities

▫ Recognize opt. space ▫ Program transformations

Based on Cetus [Purdue Univ]
Source-to-source
GPU extensions
Support G-ADAPT

pragmas

G-ADAPT compiler Performance Calibrator Perf. DB

Optimization Agent

Code with pragmas & inputs Optimized GPU code Performance

Optimization Parameters

13

SLIDE 14

Eddy Z. Zhang

Performance Calibrator

Invokes CUDA

compiler and runs the executable

Collect running

time and GPU

ccupancy

G-ADAPT compiler Performance Calibrator Perf. DB

Optimization Agent

Code with pragmas & inputs Optimized GPU code Performance

Optimization Parameters

14

SLIDE 15

Eddy Z. Zhang

Optimization Agent

Determines the
ptimization
param. to try

climbing to

vercome space

explosion problem

G-ADAPT compiler Performance Calibrator Perf. DB

Optimization Agent

Code with pragmas & inputs Optimized GPU code Performance

Optimization Parameters

15

SLIDE 16

Eddy Z. Zhang

G-ADAPT: Overview

Code with pragmas & input Empirical search & data collection

<Input, best

ptimizations>

Pattern recognition & code generation Optimized input- adaptive GPU program

Stage 1 Stage 2 16

SLIDE 17

Eddy Z. Zhang

Stage II: PR & Code Gen

Pattern recognizer

▫ Recognize input best parameters ▫ Regression Trees with Least Mean Square

Options for code generator

▫ Multiple versions ▫ JIT compilers ▫ Linker

Perf. DB Pattern Recognizer G-ADAPT Code Generator Optimized input adaptive GPU program

17

SLIDE 18

Eddy Z. Zhang

G-ADAPT

G-ADAPT compiler Performance Calibrator Perf. DB

Optimization Agent

Code with pragmas & inputs Optimized GPU code Performance

Optimization Parameters

Pattern Recognizer G-ADAPT Code Generator Final input adaptive GPU program

18

SLIDE 19

Eddy Z. Zhang

Outline

GPU overview
G-Adapt Framework
Evaluation
Related & Future Work
Conclusion

19

SLIDE 20

Eddy Z. Zhang

Evaluation - Platform

GPU: NVIDIA GeForce 8800 GT

▫ 14 multiprocessors (MP), 112 cores ▫ 512M global mem, 16KB shared mem/MP, 8192 registers/MP ▫ CUDA 2.0

Host: Intel Xeon 3.6 GHz, Suse Linux 2.6.22

20

SLIDE 21

Eddy Z. Zhang

Benchmarks

Benchmark Description #of Inputs Convolution Convolution filter of a 2D signal 10 matrixMul Dense matrix multiplication 9 mvMul Dense matrix vector multiplication (by Fujimoto) 15 reduction Sum of an array 15 scalarProd Scalar products 7 transpose Matrix transpose 18 Transpose-co Coalescing matrix transpose 18

21

SLIDE 22

Eddy Z. Zhang

Training and Prediction

Benchmark Training iterations Training time (s) Prediction accuracy convolution 200 2825 100% matrixMul 196 2539 100% mvMul 124 124 93.3% reduction 75 29 80% scalarProd 93 237 100% transpose 54 1639 100% Transpose-co 54 631 100%

22

SLIDE 23

Eddy Z. Zhang

Matrix Vector Multiplication

Best Parameter V.S. Input Speed up V.S. Input

23

SLIDE 24

Eddy Z. Zhang

24

Speed up over default

24

SLIDE 25

Eddy Z. Zhang

25

Speed up over default

25

SLIDE 26

Eddy Z. Zhang

26

Speed up over default

26

http://www.cs.wm.edu/~xshen/Publications/ipdps09.pdf

SLIDE 27

Eddy Z. Zhang

Outline

GPU overview
G-Adapt Framework
Evaluation
Related & Future Work
Conclusion

27

SLIDE 28

Eddy Z. Zhang

Related Work

Ryoo+: CGO’08

▫ Efficiency and utilization model for search ▫ Manual transformation; assumptions on applications

Baskaran+:ICS’08

▫ Polyhedral model for optimizing memory access ▫ Limited to affine loop nests

28

Features of G-Adapt First generally applicable framework Cross-input adaptation

SLIDE 29

Eddy Z. Zhang

Future Work

More optimization options

▫ Algorithm Selection ▫ Memory optimization ▫ Divergence Elimination

General Support

▫ Cetus – ANSI C compiler

 Non ANSI C features, C++

▫ CUDA built-in types

 E.g. float4, texture and etc

29

SLIDE 30

Eddy Z. Zhang

Outline

GPU overview
G-Adapt Framework
Evaluation
Related & Future Work
Conclusion

30

SLIDE 31

Eddy Z. Zhang

Conclusion

A general tool for GPU optimization
Cross-input adaptation
Synergy between compilers and programmers
Alternative of manual tuning, enabling easy

adaptation across architectures

31

SLIDE 32

Eddy Z. Zhang

Acknowledgement

Cetus authors at Purdue

▫ Group led by Eigenmann and Midkiff

John Owens
NVIDIA

▫ donation of device

NSF grants

32

SLIDE 33

Eddy Z. Zhang

Thank you!

33

A Cross-Input Adaptive Framework for GPU Program Optimizations

Yixun Liu, Eddy Z. Zhang, Xipeng Shen

Computer Science Department The College of William & Mary

Outline

GPU (Graphics Processing Unit)

▫ SIMD parallel ▫ Multithreaded ▫ Many core

▫ Tremendous computational horsepower ▫ High mem badwidth

▫ Traditional graphic rendering ▫ Emerging: general data parallel computing

Programming GPU

▫ High level model

 Abstraction to multithread platform  C-like programming  No explicit mapping to graphics rendering  E.g, CUDA, Brook+,

▫ NVIDIA CUDA

 Kernel func. on GPU  Threads->blocks->grids

Optimization Challenges

▫ Maximize throughput

 Increase occupancy, reduce latency, dynamic instr.

▫ Hard to predict optimization effects

 Non-linearity, coupling, undisclosed CUDA details

▫ GPU hardware complexities

 Limits: 512 threads per block, 768 threads per SM, etc  Various types of memories: constant, texture, etc

▫ Input sensitivity

Matrix-Vector Multiplication

Outline

G-ADAPT

▫ Three obstacles to address

 Construction of the optimization space  Space pruning  Cross-input adaptation

G-ADAPT: Overview

compiler

adaptation

& transformations

pragmas

G-ADAPT Pragmas

▫ Execution configurations

 E.g, thread block dimensions

▫ Code transformations

 E.g, loop tile size, unrolling levels

Pragma Examples

#pragma erange 64, 512, 2 #define BLKSZ 256 #pragma lpur_lrange 0, min(BLKSZ, 16), 2 For (i=1; i < BLKSZ; i++) { …… }

Stage I: Search & Collect

G-ADAPT Compiler

pragmas

Performance Calibrator

compiler and runs the executable

time and GPU

Optimization Agent

next

climbing to

explosion problem

G-ADAPT: Overview

Stage II: PR & Code Gen

▫ Recognize input best parameters ▫ Regression Trees with Least Mean Square

▫ Multiple versions ▫ JIT compilers ▫ Linker

G-ADAPT

Outline

Evaluation - Platform

▫ 14 multiprocessors (MP), 112 cores ▫ 512M global mem, 16KB shared mem/MP, 8192 registers/MP ▫ CUDA 2.0

Benchmarks

Training and Prediction

Matrix Vector Multiplication

Speed up over default

Speed up over default

Speed up over default

Outline

Related Work

▫ Efficiency and utilization model for search ▫ Manual transformation; assumptions on applications

▫ Polyhedral model for optimizing memory access ▫ Limited to affine loop nests

Features of G-Adapt First generally applicable framework Cross-input adaptation

Future Work

▫ Algorithm Selection ▫ Memory optimization ▫ Divergence Elimination

▫ Cetus – ANSI C compiler

 Non ANSI C features, C++

▫ CUDA built-in types

 E.g. float4, texture and etc

Outline

Conclusion

adaptation across architectures

Acknowledgement

▫ Group led by Eigenmann and Midkiff

▫ donation of device

Thank you!