GPU Program Optimizations Yixun Liu, Eddy Z. Zhang , Xipeng Shen - - PowerPoint PPT Presentation

gpu program optimizations
SMART_READER_LITE
LIVE PREVIEW

GPU Program Optimizations Yixun Liu, Eddy Z. Zhang , Xipeng Shen - - PowerPoint PPT Presentation

A Cross-Input Adaptive Framework for GPU Program Optimizations Yixun Liu, Eddy Z. Zhang , Xipeng Shen Computer Science Department The College of William & Mary Eddy Z. Zhang Outline GPU overview G-Adapt Framework Evaluation


slide-1
SLIDE 1

A Cross-Input Adaptive Framework for GPU Program Optimizations

Yixun Liu, Eddy Z. Zhang, Xipeng Shen

Computer Science Department The College of William & Mary

slide-2
SLIDE 2

Eddy Z. Zhang

Outline

  • GPU overview
  • G-Adapt Framework
  • Evaluation
  • Related & Future Work
  • Conclusion

2

slide-3
SLIDE 3

Eddy Z. Zhang

GPU (Graphics Processing Unit)

  • Architecture

▫ SIMD parallel ▫ Multithreaded ▫ Many core

  • Feature

▫ Tremendous computational horsepower ▫ High mem badwidth

  • Applications

▫ Traditional graphic rendering ▫ Emerging: general data parallel computing

3

slide-4
SLIDE 4

Eddy Z. Zhang

Programming GPU

▫ High level model

 Abstraction to multithread platform  C-like programming  No explicit mapping to graphics rendering  E.g, CUDA, Brook+,

  • penCL

▫ NVIDIA CUDA

 Kernel func. on GPU  Threads->blocks->grids

4

Graph From CUDA Manual

slide-5
SLIDE 5

Eddy Z. Zhang

Optimization Challenges

  • Goal

▫ Maximize throughput

 Increase occupancy, reduce latency, dynamic instr.

  • Difficulties

▫ Hard to predict optimization effects

 Non-linearity, coupling, undisclosed CUDA details

▫ GPU hardware complexities

 Limits: 512 threads per block, 768 threads per SM, etc  Various types of memories: constant, texture, etc

▫ Input sensitivity

5

slide-6
SLIDE 6

Eddy Z. Zhang

Matrix-Vector Multiplication

6

slide-7
SLIDE 7

Eddy Z. Zhang

Outline

  • GPU Overview
  • G-Adapt Framework
  • Evaluation
  • Related & Future Work
  • Conclusion

7

slide-8
SLIDE 8

Eddy Z. Zhang

G-ADAPT

  • Empirical search-based optimization

▫ Three obstacles to address

 Construction of the optimization space  Space pruning  Cross-input adaptation

8

slide-9
SLIDE 9

Eddy Z. Zhang

G-ADAPT: Overview

  • Source-to-source

compiler

  • Cross-input

adaptation

  • Automatic search

& transformations

  • Easy integration of user knowledge through

pragmas

Code with pragmas & input Empirical search & data collection

<Input, best

  • ptimizations>

Pattern recognition & code generation

Optimized input-adaptive GPU program

Stage 1 Stage 2 9

slide-10
SLIDE 10

Eddy Z. Zhang

G-ADAPT Pragmas

  • Supports a programmer-compiler synergy
  • Covers 2 levels of optimizations

▫ Execution configurations

 E.g, thread block dimensions

▫ Code transformations

 E.g, loop tile size, unrolling levels

10

slide-11
SLIDE 11

Eddy Z. Zhang

Pragma Examples

#pragma erange 64, 512, 2 #define BLKSZ 256 #pragma lpur_lrange 0, min(BLKSZ, 16), 2 For (i=1; i < BLKSZ; i++) { …… }

11

slide-12
SLIDE 12

Eddy Z. Zhang

Stage I: Search & Collect

G-ADAPT compiler Performance Calibrator Perf. DB Optimization Agent Code with pragmas & inputs

Optimized GPU code Performance

Optimization Parameters

12

slide-13
SLIDE 13

Eddy Z. Zhang

G-ADAPT Compiler

  • Two functionalities

▫ Recognize opt. space ▫ Program transformations

  • Based on Cetus [Purdue Univ]
  • Source-to-source
  • GPU extensions
  • Support G-ADAPT

pragmas

G-ADAPT compiler Performance Calibrator Perf. DB

Optimization Agent

Code with pragmas & inputs Optimized GPU code Performance

Optimization Parameters

13

slide-14
SLIDE 14

Eddy Z. Zhang

Performance Calibrator

  • Invokes CUDA

compiler and runs the executable

  • Collect running

time and GPU

  • ccupancy

G-ADAPT compiler Performance Calibrator Perf. DB

Optimization Agent

Code with pragmas & inputs Optimized GPU code Performance

Optimization Parameters

14

slide-15
SLIDE 15

Eddy Z. Zhang

Optimization Agent

  • Determines the
  • ptimization
  • param. to try

next

  • Uses hill

climbing to

  • vercome space

explosion problem

G-ADAPT compiler Performance Calibrator Perf. DB

Optimization Agent

Code with pragmas & inputs Optimized GPU code Performance

Optimization Parameters

15

slide-16
SLIDE 16

Eddy Z. Zhang

G-ADAPT: Overview

Code with pragmas & input Empirical search & data collection

<Input, best

  • ptimizations>

Pattern recognition & code generation Optimized input- adaptive GPU program

Stage 1 Stage 2 16

slide-17
SLIDE 17

Eddy Z. Zhang

Stage II: PR & Code Gen

  • Pattern recognizer

▫ Recognize input best parameters ▫ Regression Trees with Least Mean Square

  • Options for code generator

▫ Multiple versions ▫ JIT compilers ▫ Linker

Perf. DB Pattern Recognizer G-ADAPT Code Generator Optimized input adaptive GPU program

17

slide-18
SLIDE 18

Eddy Z. Zhang

G-ADAPT

G-ADAPT compiler Performance Calibrator Perf. DB

Optimization Agent

Code with pragmas & inputs Optimized GPU code Performance

Optimization Parameters

Pattern Recognizer G-ADAPT Code Generator Final input adaptive GPU program

18

slide-19
SLIDE 19

Eddy Z. Zhang

Outline

  • GPU overview
  • G-Adapt Framework
  • Evaluation
  • Related & Future Work
  • Conclusion

19

slide-20
SLIDE 20

Eddy Z. Zhang

Evaluation - Platform

  • GPU: NVIDIA GeForce 8800 GT

▫ 14 multiprocessors (MP), 112 cores ▫ 512M global mem, 16KB shared mem/MP, 8192 registers/MP ▫ CUDA 2.0

  • Host: Intel Xeon 3.6 GHz, Suse Linux 2.6.22

20

slide-21
SLIDE 21

Eddy Z. Zhang

Benchmarks

Benchmark Description #of Inputs Convolution Convolution filter of a 2D signal 10 matrixMul Dense matrix multiplication 9 mvMul Dense matrix vector multiplication (by Fujimoto) 15 reduction Sum of an array 15 scalarProd Scalar products 7 transpose Matrix transpose 18 Transpose-co Coalescing matrix transpose 18

21

slide-22
SLIDE 22

Eddy Z. Zhang

Training and Prediction

Benchmark Training iterations Training time (s) Prediction accuracy convolution 200 2825 100% matrixMul 196 2539 100% mvMul 124 124 93.3% reduction 75 29 80% scalarProd 93 237 100% transpose 54 1639 100% Transpose-co 54 631 100%

22

slide-23
SLIDE 23

Eddy Z. Zhang

Matrix Vector Multiplication

Best Parameter V.S. Input Speed up V.S. Input

23

slide-24
SLIDE 24

Eddy Z. Zhang

24

Speed up over default

24

slide-25
SLIDE 25

Eddy Z. Zhang

25

Speed up over default

25

slide-26
SLIDE 26

Eddy Z. Zhang

26

Speed up over default

26

http://www.cs.wm.edu/~xshen/Publications/ipdps09.pdf

slide-27
SLIDE 27

Eddy Z. Zhang

Outline

  • GPU overview
  • G-Adapt Framework
  • Evaluation
  • Related & Future Work
  • Conclusion

27

slide-28
SLIDE 28

Eddy Z. Zhang

Related Work

  • Ryoo+: CGO’08

▫ Efficiency and utilization model for search ▫ Manual transformation; assumptions on applications

  • Baskaran+:ICS’08

▫ Polyhedral model for optimizing memory access ▫ Limited to affine loop nests

28

Features of G-Adapt First generally applicable framework Cross-input adaptation

slide-29
SLIDE 29

Eddy Z. Zhang

Future Work

  • More optimization options

▫ Algorithm Selection ▫ Memory optimization ▫ Divergence Elimination

  • General Support

▫ Cetus – ANSI C compiler

 Non ANSI C features, C++

▫ CUDA built-in types

 E.g. float4, texture and etc

29

slide-30
SLIDE 30

Eddy Z. Zhang

Outline

  • GPU overview
  • G-Adapt Framework
  • Evaluation
  • Related & Future Work
  • Conclusion

30

slide-31
SLIDE 31

Eddy Z. Zhang

Conclusion

  • A general tool for GPU optimization
  • Cross-input adaptation
  • Synergy between compilers and programmers
  • Alternative of manual tuning, enabling easy

adaptation across architectures

31

slide-32
SLIDE 32

Eddy Z. Zhang

Acknowledgement

  • Cetus authors at Purdue

▫ Group led by Eigenmann and Midkiff

  • John Owens
  • NVIDIA

▫ donation of device

  • NSF grants

32

slide-33
SLIDE 33

Eddy Z. Zhang

Thank you!

33