Sanjay Rajopadhye Colorado State University n Class objectives, - - PowerPoint PPT Presentation
Sanjay Rajopadhye Colorado State University n Class objectives, - - PowerPoint PPT Presentation
Sanjay Rajopadhye Colorado State University n Class objectives, goals, introduction n CUDA performance tuning (wrap up) n Equational Programming (intro) 2 n Parallel Programming is hard n End of the free lunch [Sut05] n
n Class objectives, goals, introduction n CUDA performance tuning (wrap up) n Equational Programming (intro)
2
n Parallel Programming is hard
n “End of the free lunch”
[Sut05]
n Arrival of “manycores” signals the end of
“La-Z-Boy Programming” [Pat06]
Becoming a parallel programming expert will get you a good job But your skills may become obsolete – new machines, new languages, … Parallelism must return to La-Z-Boy programming
[Sut05] Herb Sutter. “The Free Lunch Is Over: A Fundamental Turn Toward Concurrency,” in
- Software. Dr. Dobb's Journal, vol. 30, no. 3, 2005.
[Pat06] David Patterson, in keynote talk at the International Workshop on Languages and Compilers For Parallel Computers LCPC 2006, New Orleans, LA.
3
n Short term
Become macho GPU programmer: write “heroically tuned” codes.
n Medium term
Do it systematically: tuning for GTX 280 vs tuning for GTX 465: learn principles, not skills
n Long term
Do it automatically: Learn the foundations of automatic
- compilation. Focus on a “regular subset” of programs
n Polyhedral Equational Model
4
n Big picture
n Polyhedral Equations as programs: I’m loath
to write C, despite the slogan “C no evil”
n Equations vs (conventional) loop programs
n Equations-to-code (compiling equations)
n Schedule n (processor) allocation n (memory) allocation
n But what about parallelism?
5
10 assignments (basic + advanced) + term project
n CUDA performance tuning
(2)
n Equational programming: Alpha/AlphaZ (1) n Mathematical foundations: polyhedra,
affine functions, and operations (2)
n Alpha analysis/transformation
(1)
n Analysis: scheduling & allocation
(2)
n Code generation/tiling
(2)
6
n Assignments
(30%)
n Midterm (take home)
(30%)
n Final project
(30% = 2+3+5+15+5)
n Proposal n Advancement report n Final report n Quality of work n Final poster
n Participation/Discussion/Quizzes
(10%)
7
n What are polyhedra? n Why are they useful/important n What is the polyhedral model?
8
n What is a model?
n A mathematical/computational/mechanical/
… abstraction of some other (physical) entity
n Objects in the model must “emulate” the
“natural operations” of the modeled entities – semantics
9
From Feautrier’s keynote at LCPC 2009
10
Introduction Prehistory State of the Art What Next ?
Systolic Array Design Rau Dependences Scheduling Placement Code Generation Irigoin, Lengauer, Rajopadhye The Polytope Model Tiling Array Shrinking Locality HLS Bernstein 1966 Automatic Parallelization 1967 Cousot, Halbwachs 1977
- L. Lamport, 1974
,
Quinton, Robert, 1989 Rajopadhye, 1987 Pugh, 1991 LC Lu, 1991 PF, Pingali, 1994 , Irigoin, JL Xue, 1988 Irigoin Lam Kuck Allen, Kennedy, 1987 Bastoul, PF, Boulet, 1987−−2005 PF, Rajopadhye, Darte, 2005 Bastoul, 2003 Quinton, Risset, 1996
- H. T. Kung, 1978
Wolfe + Lam, 1991 Dependence tests, Banerjee, 1976 Karp, Miller, Winograd Irigoin, PF 1988, Pugh, 1992
,
Quinton, Rajopadhye, Fortes, PF
12 / 39
n Physical entity: programs/computations n The Polyhedral Model is a “very high level”
intermediate representation (IR) of “regular computations”
n Polyhedral equational model: real=abstract n Amenable to:
n Mathematical static analysis n Transformation within model: closure n Transformation outside model: (tiled) code
generation
11
n Class objectives, goals, introduction n CUDA performance tuning (wrap up) n Equational Programming (intro)?
12
n Many resources on the web (NVIDIA
webinars)
n Coalescing (HW1a)
n Challenge question: Achieve maximum
bandwidth, with fewest threads-per-block
n For a “strided-by-block” access pattern.
n Arithmetic peak: warps and “virtualization” n Bank conflicts in shared memory
13
n MAXPYrep:
n Repeatedly execute
Y=A*X+Y
n Where A, X and Y are matrices n Matrices are small enough to fit in shared
memory (ignore global memory access coalescing)
n Goal: achieve machine peak
n Port all previous performance to GTX 480
n And beyond … n Teach me
14
n Oxford CUDA conf (CUDA webinar online) n “Identifying Performance Limiters,”
Micikevicius NVIDIA/UCF (CUDA webinar)
n “Roofline for Fast Math” Sam Williams, LBL
15
n Wiki page for Pascal’s Triangle
http://en.wikipedia.org/wiki/Pascal's_triangle
n … and also a non-standard way to compute
Fibonacci numbers
16