Sanjay Rajopadhye Colorado State University n Class objectives, - - PowerPoint PPT Presentation

▶

Jan 11, 2024 115 likes •287 views

Sanjay Rajopadhye Colorado State University n Class objectives, goals, introduction n CUDA performance tuning (wrap up) n Equational Programming (intro) 2 n Parallel Programming is hard n End of the free lunch [Sut05] n

SLIDE 1

Sanjay Rajopadhye Colorado State University

SLIDE 2

n Class objectives, goals, introduction n CUDA performance tuning (wrap up) n Equational Programming (intro)

SLIDE 3

n Parallel Programming is hard

n “End of the free lunch”

[Sut05]

n Arrival of “manycores” signals the end of

“La-Z-Boy Programming” [Pat06]

Becoming a parallel programming expert will get you a good job But your skills may become obsolete – new machines, new languages, … Parallelism must return to La-Z-Boy programming

[Sut05] Herb Sutter. “The Free Lunch Is Over: A Fundamental Turn Toward Concurrency,” in

Software. Dr. Dobb's Journal, vol. 30, no. 3, 2005.

[Pat06] David Patterson, in keynote talk at the International Workshop on Languages and Compilers For Parallel Computers LCPC 2006, New Orleans, LA.

SLIDE 4

n Short term

Become macho GPU programmer: write “heroically tuned” codes.

n Medium term

Do it systematically: tuning for GTX 280 vs tuning for GTX 465: learn principles, not skills

n Long term

Do it automatically: Learn the foundations of automatic

compilation. Focus on a “regular subset” of programs

n Polyhedral Equational Model

SLIDE 5

n Big picture

n Polyhedral Equations as programs: I’m loath

to write C, despite the slogan “C no evil”

n Equations vs (conventional) loop programs

n Equations-to-code (compiling equations)

n Schedule n (processor) allocation n (memory) allocation

n But what about parallelism?

SLIDE 6

10 assignments (basic + advanced) + term project

n CUDA performance tuning

(2)

n Equational programming: Alpha/AlphaZ (1) n Mathematical foundations: polyhedra,

affine functions, and operations (2)

n Alpha analysis/transformation

(1)

n Analysis: scheduling & allocation

(2)

n Code generation/tiling

(2)

SLIDE 7

n Assignments

(30%)

n Midterm (take home)

(30%)

n Final project

(30% = 2+3+5+15+5)

n Proposal n Advancement report n Final report n Quality of work n Final poster

n Participation/Discussion/Quizzes

(10%)

SLIDE 8

n What are polyhedra? n Why are they useful/important n What is the polyhedral model?

SLIDE 9

n What is a model?

n A mathematical/computational/mechanical/

… abstraction of some other (physical) entity

n Objects in the model must “emulate” the

“natural operations” of the modeled entities – semantics

SLIDE 10

From Feautrier’s keynote at LCPC 2009

Introduction Prehistory State of the Art What Next ?

Systolic Array Design Rau Dependences Scheduling Placement Code Generation Irigoin, Lengauer, Rajopadhye The Polytope Model Tiling Array Shrinking Locality HLS Bernstein 1966 Automatic Parallelization 1967 Cousot, Halbwachs 1977

L. Lamport, 1974

Quinton, Robert, 1989 Rajopadhye, 1987 Pugh, 1991 LC Lu, 1991 PF, Pingali, 1994 , Irigoin, JL Xue, 1988 Irigoin Lam Kuck Allen, Kennedy, 1987 Bastoul, PF, Boulet, 1987−−2005 PF, Rajopadhye, Darte, 2005 Bastoul, 2003 Quinton, Risset, 1996

H. T. Kung, 1978

Wolfe + Lam, 1991 Dependence tests, Banerjee, 1976 Karp, Miller, Winograd Irigoin, PF 1988, Pugh, 1992

Quinton, Rajopadhye, Fortes, PF

12 / 39

SLIDE 11

n Physical entity: programs/computations n The Polyhedral Model is a “very high level”

intermediate representation (IR) of “regular computations”

n Polyhedral equational model: real=abstract n Amenable to:

n Mathematical static analysis n Transformation within model: closure n Transformation outside model: (tiled) code

generation

SLIDE 12

n Class objectives, goals, introduction n CUDA performance tuning (wrap up) n Equational Programming (intro)?

SLIDE 13

n Many resources on the web (NVIDIA

webinars)

n Coalescing (HW1a)

n Challenge question: Achieve maximum

bandwidth, with fewest threads-per-block

n For a “strided-by-block” access pattern.

n Arithmetic peak: warps and “virtualization” n Bank conflicts in shared memory

SLIDE 14

n MAXPYrep:

n Repeatedly execute

Y=A*X+Y

n Where A, X and Y are matrices n Matrices are small enough to fit in shared

memory (ignore global memory access coalescing)

n Goal: achieve machine peak

n Port all previous performance to GTX 480

n And beyond … n Teach me

SLIDE 15

n Oxford CUDA conf (CUDA webinar online) n “Identifying Performance Limiters,”

Micikevicius NVIDIA/UCF (CUDA webinar)

n “Roofline for Fast Math” Sam Williams, LBL

Sanjay Rajopadhye Colorado State University

n Class objectives, goals, introduction n CUDA performance tuning (wrap up) n Equational Programming (intro)

n Parallel Programming is hard

n “End of the free lunch”

[Sut05]

n Arrival of “manycores” signals the end of

“La-Z-Boy Programming” [Pat06]

Becoming a parallel programming expert will get you a good job But your skills may become obsolete – new machines, new languages, … Parallelism must return to La-Z-Boy programming

[Sut05] Herb Sutter. “The Free Lunch Is Over: A Fundamental Turn Toward Concurrency,” in

[Pat06] David Patterson, in keynote talk at the International Workshop on Languages and Compilers For Parallel Computers LCPC 2006, New Orleans, LA.

n Short term

Become macho GPU programmer: write “heroically tuned” codes.

n Medium term

Do it systematically: tuning for GTX 280 vs tuning for GTX 465: learn principles, not skills

n Long term

Do it automatically: Learn the foundations of automatic

n Polyhedral Equational Model

n Big picture

n Polyhedral Equations as programs: I’m loath

to write C, despite the slogan “C no evil”

n Equations vs (conventional) loop programs

n Equations-to-code (compiling equations)

n Schedule n (processor) allocation n (memory) allocation

n But what about parallelism?

10 assignments (basic + advanced) + term project

n CUDA performance tuning

(2)

n Equational programming: Alpha/AlphaZ (1) n Mathematical foundations: polyhedra,

affine functions, and operations (2)

n Alpha analysis/transformation

(1)

n Analysis: scheduling & allocation

(2)

n Code generation/tiling

(2)

n Assignments

(30%)

n Midterm (take home)

(30%)

n Final project

(30% = 2+3+5+15+5)

n Proposal n Advancement report n Final report n Quality of work n Final poster

n Participation/Discussion/Quizzes

(10%)

n What are polyhedra? n Why are they useful/important n What is the polyhedral model?

n What is a model?

n A mathematical/computational/mechanical/

… abstraction of some other (physical) entity

n Objects in the model must “emulate” the

“natural operations” of the modeled entities – semantics

From Feautrier’s keynote at LCPC 2009

n Physical entity: programs/computations n The Polyhedral Model is a “very high level”

intermediate representation (IR) of “regular computations”

n Polyhedral equational model: real=abstract n Amenable to:

n Mathematical static analysis n Transformation within model: closure n Transformation outside model: (tiled) code

generation

n Class objectives, goals, introduction n CUDA performance tuning (wrap up) n Equational Programming (intro)?

n Many resources on the web (NVIDIA

webinars)

n Coalescing (HW1a)

n Challenge question: Achieve maximum

bandwidth, with fewest threads-per-block

n For a “strided-by-block” access pattern.

n Arithmetic peak: warps and “virtualization” n Bank conflicts in shared memory

n MAXPYrep:

n Repeatedly execute

Y=A*X+Y

n Where A, X and Y are matrices n Matrices are small enough to fit in shared

memory (ignore global memory access coalescing)

n Goal: achieve machine peak

n Port all previous performance to GTX 480

n And beyond … n Teach me

n Oxford CUDA conf (CUDA webinar online) n “Identifying Performance Limiters,”

Micikevicius NVIDIA/UCF (CUDA webinar)

n “Roofline for Fast Math” Sam Williams, LBL

n Wiki page for Pascal’s Triangle

http://en.wikipedia.org/wiki/Pascal's_triangle

n … and also a non-standard way to compute

Fibonacci numbers