Sanjay Rajopadhye Colorado State University n Class objectives, - - PowerPoint PPT Presentation

sanjay rajopadhye colorado state university n class
SMART_READER_LITE
LIVE PREVIEW

Sanjay Rajopadhye Colorado State University n Class objectives, - - PowerPoint PPT Presentation

Sanjay Rajopadhye Colorado State University n Class objectives, goals, introduction n CUDA performance tuning (wrap up) n Equational Programming (intro) 2 n Parallel Programming is hard n End of the free lunch [Sut05] n


slide-1
SLIDE 1

Sanjay Rajopadhye Colorado State University

slide-2
SLIDE 2

n Class objectives, goals, introduction n CUDA performance tuning (wrap up) n Equational Programming (intro)

2

slide-3
SLIDE 3

n Parallel Programming is hard

n “End of the free lunch”

[Sut05]

n Arrival of “manycores” signals the end of

“La-Z-Boy Programming” [Pat06]

Becoming a parallel programming expert will get you a good job But your skills may become obsolete – new machines, new languages, … Parallelism must return to La-Z-Boy programming

[Sut05] Herb Sutter. “The Free Lunch Is Over: A Fundamental Turn Toward Concurrency,” in

  • Software. Dr. Dobb's Journal, vol. 30, no. 3, 2005.

[Pat06] David Patterson, in keynote talk at the International Workshop on Languages and Compilers For Parallel Computers LCPC 2006, New Orleans, LA.

3

slide-4
SLIDE 4

n Short term

Become macho GPU programmer: write “heroically tuned” codes.

n Medium term

Do it systematically: tuning for GTX 280 vs tuning for GTX 465: learn principles, not skills

n Long term

Do it automatically: Learn the foundations of automatic

  • compilation. Focus on a “regular subset” of programs

n Polyhedral Equational Model

4

slide-5
SLIDE 5

n Big picture

n Polyhedral Equations as programs: I’m loath

to write C, despite the slogan “C no evil”

n Equations vs (conventional) loop programs

n Equations-to-code (compiling equations)

n Schedule n (processor) allocation n (memory) allocation

n But what about parallelism?

5

slide-6
SLIDE 6

10 assignments (basic + advanced) + term project

n CUDA performance tuning

(2)

n Equational programming: Alpha/AlphaZ (1) n Mathematical foundations: polyhedra,

affine functions, and operations (2)

n Alpha analysis/transformation

(1)

n Analysis: scheduling & allocation

(2)

n Code generation/tiling

(2)

6

slide-7
SLIDE 7

n Assignments

(30%)

n Midterm (take home)

(30%)

n Final project

(30% = 2+3+5+15+5)

n Proposal n Advancement report n Final report n Quality of work n Final poster

n Participation/Discussion/Quizzes

(10%)

7

slide-8
SLIDE 8

n What are polyhedra? n Why are they useful/important n What is the polyhedral model?

8

slide-9
SLIDE 9

n What is a model?

n A mathematical/computational/mechanical/

… abstraction of some other (physical) entity

n Objects in the model must “emulate” the

“natural operations” of the modeled entities – semantics

9

slide-10
SLIDE 10

From Feautrier’s keynote at LCPC 2009

10

Introduction Prehistory State of the Art What Next ?

Systolic Array Design Rau Dependences Scheduling Placement Code Generation Irigoin, Lengauer, Rajopadhye The Polytope Model Tiling Array Shrinking Locality HLS Bernstein 1966 Automatic Parallelization 1967 Cousot, Halbwachs 1977

  • L. Lamport, 1974

,

Quinton, Robert, 1989 Rajopadhye, 1987 Pugh, 1991 LC Lu, 1991 PF, Pingali, 1994 , Irigoin, JL Xue, 1988 Irigoin Lam Kuck Allen, Kennedy, 1987 Bastoul, PF, Boulet, 1987−−2005 PF, Rajopadhye, Darte, 2005 Bastoul, 2003 Quinton, Risset, 1996

  • H. T. Kung, 1978

Wolfe + Lam, 1991 Dependence tests, Banerjee, 1976 Karp, Miller, Winograd Irigoin, PF 1988, Pugh, 1992

,

Quinton, Rajopadhye, Fortes, PF

12 / 39

slide-11
SLIDE 11

n Physical entity: programs/computations n The Polyhedral Model is a “very high level”

intermediate representation (IR) of “regular computations”

n Polyhedral equational model: real=abstract n Amenable to:

n Mathematical static analysis n Transformation within model: closure n Transformation outside model: (tiled) code

generation

11

slide-12
SLIDE 12

n Class objectives, goals, introduction n CUDA performance tuning (wrap up) n Equational Programming (intro)?

12

slide-13
SLIDE 13

n Many resources on the web (NVIDIA

webinars)

n Coalescing (HW1a)

n Challenge question: Achieve maximum

bandwidth, with fewest threads-per-block

n For a “strided-by-block” access pattern.

n Arithmetic peak: warps and “virtualization” n Bank conflicts in shared memory

13

slide-14
SLIDE 14

n MAXPYrep:

n Repeatedly execute

Y=A*X+Y

n Where A, X and Y are matrices n Matrices are small enough to fit in shared

memory (ignore global memory access coalescing)

n Goal: achieve machine peak

n Port all previous performance to GTX 480

n And beyond … n Teach me

14

slide-15
SLIDE 15

n Oxford CUDA conf (CUDA webinar online) n “Identifying Performance Limiters,”

Micikevicius NVIDIA/UCF (CUDA webinar)

n “Roofline for Fast Math” Sam Williams, LBL

15

slide-16
SLIDE 16

n Wiki page for Pascal’s Triangle

http://en.wikipedia.org/wiki/Pascal's_triangle

n … and also a non-standard way to compute

Fibonacci numbers

16