GPU Metaprogramming using PyCUDA: Methods & Applications - - PowerPoint PPT Presentation

gpu metaprogramming using pycuda methods applications
SMART_READER_LITE
LIVE PREVIEW

GPU Metaprogramming using PyCUDA: Methods & Applications - - PowerPoint PPT Presentation

Why GPU Scripting? Scripting CUDA GPU RTCG DG on GPUs Perspectives GPU Metaprogramming using PyCUDA: Methods & Applications Andreas Kl ockner Division of Applied Mathematics Brown University GPU @ BU November 12, 2009 Andreas


slide-1
SLIDE 1

Why GPU Scripting? Scripting CUDA GPU RTCG DG on GPUs Perspectives

GPU Metaprogramming using PyCUDA: Methods & Applications

Andreas Kl¨

  • ckner

Division of Applied Mathematics Brown University

GPU @ BU · November 12, 2009

Andreas Kl¨

  • ckner

Applied Math · Brown University GPU Metaprogramming using PyCUDA: Methods & Applications

slide-2
SLIDE 2

Why GPU Scripting? Scripting CUDA GPU RTCG DG on GPUs Perspectives

Thanks

Tim Warburton (Rice) Jan Hesthaven (Brown) Nicolas Pinto (MIT) PyCUDA contributors Nvidia Corporation

Andreas Kl¨

  • ckner

Applied Math · Brown University GPU Metaprogramming using PyCUDA: Methods & Applications

slide-3
SLIDE 3

Why GPU Scripting? Scripting CUDA GPU RTCG DG on GPUs Perspectives

Outline

1 Why GPU Scripting? 2 Scripting CUDA 3 GPU Run-Time Code Generation 4 DG on GPUs 5 Perspectives

Andreas Kl¨

  • ckner

Applied Math · Brown University GPU Metaprogramming using PyCUDA: Methods & Applications

slide-4
SLIDE 4

Why GPU Scripting? Scripting CUDA GPU RTCG DG on GPUs Perspectives

Outline

1 Why GPU Scripting?

Combining two Strong Tools

2 Scripting CUDA 3 GPU Run-Time Code Generation 4 DG on GPUs 5 Perspectives

Andreas Kl¨

  • ckner

Applied Math · Brown University GPU Metaprogramming using PyCUDA: Methods & Applications

slide-5
SLIDE 5

Why GPU Scripting? Scripting CUDA GPU RTCG DG on GPUs Perspectives Combining two Strong Tools

How are High-Performance Codes constructed?

“Traditional” Construction of High-Performance Codes:

C/C++/Fortran Libraries

“Alternative” Construction of High-Performance Codes:

Scripting for ‘brains’ GPUs for ‘inner loops’

Play to the strengths of each programming environment.

Andreas Kl¨

  • ckner

Applied Math · Brown University GPU Metaprogramming using PyCUDA: Methods & Applications

slide-6
SLIDE 6

Why GPU Scripting? Scripting CUDA GPU RTCG DG on GPUs Perspectives Combining two Strong Tools

Scripting: Means

A scripting language. . . is discoverable and interactive. has comprehensive built-in functionality. manages resources automatically. is dynamically typed. works well for “gluing” lower-level blocks together.

Andreas Kl¨

  • ckner

Applied Math · Brown University GPU Metaprogramming using PyCUDA: Methods & Applications

slide-7
SLIDE 7

Why GPU Scripting? Scripting CUDA GPU RTCG DG on GPUs Perspectives Combining two Strong Tools

Scripting: Interpreted, not Compiled

Program creation workflow: Edit Compile Link Run

Andreas Kl¨

  • ckner

Applied Math · Brown University GPU Metaprogramming using PyCUDA: Methods & Applications

slide-8
SLIDE 8

Why GPU Scripting? Scripting CUDA GPU RTCG DG on GPUs Perspectives Combining two Strong Tools

Scripting: Interpreted, not Compiled

Program creation workflow: Edit Compile Link Run

Andreas Kl¨

  • ckner

Applied Math · Brown University GPU Metaprogramming using PyCUDA: Methods & Applications

slide-9
SLIDE 9

Why GPU Scripting? Scripting CUDA GPU RTCG DG on GPUs Perspectives Combining two Strong Tools

Scripting: Interpreted, not Compiled

Program creation workflow: Edit Compile Link Run

Andreas Kl¨

  • ckner

Applied Math · Brown University GPU Metaprogramming using PyCUDA: Methods & Applications

slide-10
SLIDE 10

Why GPU Scripting? Scripting CUDA GPU RTCG DG on GPUs Perspectives Combining two Strong Tools

Why do Scripting for GPUs?

GPUs are everything that scripting languages are not.

Highly parallel Very architecture-sensitive Built for maximum FP/memory throughput

→ complement each other

Andreas Kl¨

  • ckner

Applied Math · Brown University GPU Metaprogramming using PyCUDA: Methods & Applications

slide-11
SLIDE 11

Why GPU Scripting? Scripting CUDA GPU RTCG DG on GPUs Perspectives Combining two Strong Tools

Why do Scripting for GPUs?

GPUs are everything that scripting languages are not.

Highly parallel Very architecture-sensitive Built for maximum FP/memory throughput

→ complement each other CPU: largely restricted to control tasks (∼1000/sec)

Scripting fast enough

Andreas Kl¨

  • ckner

Applied Math · Brown University GPU Metaprogramming using PyCUDA: Methods & Applications

slide-12
SLIDE 12

Why GPU Scripting? Scripting CUDA GPU RTCG DG on GPUs Perspectives Combining two Strong Tools

Why do Scripting for GPUs?

GPUs are everything that scripting languages are not.

Highly parallel Very architecture-sensitive Built for maximum FP/memory throughput

→ complement each other CPU: largely restricted to control tasks (∼1000/sec)

Scripting fast enough

Python + CUDA = PyCUDA

Andreas Kl¨

  • ckner

Applied Math · Brown University GPU Metaprogramming using PyCUDA: Methods & Applications

slide-13
SLIDE 13

Why GPU Scripting? Scripting CUDA GPU RTCG DG on GPUs Perspectives

Outline

1 Why GPU Scripting? 2 Scripting CUDA

PyCUDA in Detail Do More, Faster with PyCUDA

3 GPU Run-Time Code Generation 4 DG on GPUs 5 Perspectives

Andreas Kl¨

  • ckner

Applied Math · Brown University GPU Metaprogramming using PyCUDA: Methods & Applications

slide-14
SLIDE 14

Why GPU Scripting? Scripting CUDA GPU RTCG DG on GPUs Perspectives PyCUDA in Detail

Whetting your appetite

1 import pycuda.driver as cuda 2 import pycuda.autoinit 3 import numpy 4 5 a = numpy.random.randn(4,4).astype(numpy.float32) 6 a gpu = cuda.mem alloc(a.nbytes) 7 cuda.memcpy htod(a gpu, a)

[This is examples/demo.py in the PyCUDA distribution.]

Andreas Kl¨

  • ckner

Applied Math · Brown University GPU Metaprogramming using PyCUDA: Methods & Applications

slide-15
SLIDE 15

Why GPU Scripting? Scripting CUDA GPU RTCG DG on GPUs Perspectives PyCUDA in Detail

Whetting your appetite

9 mod = cuda.SourceModule(””” 10 global void twice( float ∗a) 11 { 12 int idx = threadIdx.x + threadIdx.y∗4; 13 a[idx] ∗= 2; 14 } 15 ”””) 16 17 func = mod.get function(”twice”) 18 func(a gpu, block=(4,4,1)) 19 20 a doubled = numpy.empty like(a) 21 cuda.memcpy dtoh(a doubled, a gpu) 22 print a doubled 23 print a

Andreas Kl¨

  • ckner

Applied Math · Brown University GPU Metaprogramming using PyCUDA: Methods & Applications

slide-16
SLIDE 16

Why GPU Scripting? Scripting CUDA GPU RTCG DG on GPUs Perspectives PyCUDA in Detail

Whetting your appetite

9 mod = cuda.SourceModule(””” 10 global void twice( float ∗a) 11 { 12 int idx = threadIdx.x + threadIdx.y∗4; 13 a[idx] ∗= 2; 14 } 15 ”””) 16 17 func = mod.get function(”twice”) 18 func(a gpu, block=(4,4,1)) 19 20 a doubled = numpy.empty like(a) 21 cuda.memcpy dtoh(a doubled, a gpu) 22 print a doubled 23 print a

Compute kernel

Andreas Kl¨

  • ckner

Applied Math · Brown University GPU Metaprogramming using PyCUDA: Methods & Applications

slide-17
SLIDE 17

Why GPU Scripting? Scripting CUDA GPU RTCG DG on GPUs Perspectives PyCUDA in Detail

Whetting your appetite, Part II

Did somebody say “Abstraction is good”?

Andreas Kl¨

  • ckner

Applied Math · Brown University GPU Metaprogramming using PyCUDA: Methods & Applications

slide-18
SLIDE 18

Why GPU Scripting? Scripting CUDA GPU RTCG DG on GPUs Perspectives PyCUDA in Detail

Whetting your appetite, Part II

1 import numpy 2 import pycuda.autoinit 3 import pycuda.gpuarray as gpuarray 4 5 a gpu = gpuarray.to gpu( 6 numpy.random.randn(4,4).astype(numpy.float32)) 7 a doubled = (2∗a gpu).get() 8 print a doubled 9 print a gpu

Andreas Kl¨

  • ckner

Applied Math · Brown University GPU Metaprogramming using PyCUDA: Methods & Applications

slide-19
SLIDE 19

Why GPU Scripting? Scripting CUDA GPU RTCG DG on GPUs Perspectives PyCUDA in Detail

PyCUDA Philosophy

Provide complete access Automatically manage resources Provide abstractions Allow interactive use Check for and report errors automatically Integrate tightly with numpy

Andreas Kl¨

  • ckner

Applied Math · Brown University GPU Metaprogramming using PyCUDA: Methods & Applications

slide-20
SLIDE 20

Why GPU Scripting? Scripting CUDA GPU RTCG DG on GPUs Perspectives PyCUDA in Detail

PyCUDA: Completeness

PyCUDA exposes all of CUDA. For example: Arrays and Textures Pagelocked host memory Memory transfers (asynchronous, structured) Streams and Events Device queries GL Interop

Andreas Kl¨

  • ckner

Applied Math · Brown University GPU Metaprogramming using PyCUDA: Methods & Applications

slide-21
SLIDE 21

Why GPU Scripting? Scripting CUDA GPU RTCG DG on GPUs Perspectives PyCUDA in Detail

PyCUDA: Completeness

PyCUDA supports every OS that CUDA supports. Linux Windows OS X

Andreas Kl¨

  • ckner

Applied Math · Brown University GPU Metaprogramming using PyCUDA: Methods & Applications

slide-22
SLIDE 22

Why GPU Scripting? Scripting CUDA GPU RTCG DG on GPUs Perspectives PyCUDA in Detail

PyCUDA: Workflow

Edit Run

Andreas Kl¨

  • ckner

Applied Math · Brown University GPU Metaprogramming using PyCUDA: Methods & Applications

slide-23
SLIDE 23

Why GPU Scripting? Scripting CUDA GPU RTCG DG on GPUs Perspectives PyCUDA in Detail

PyCUDA: Workflow

Edit Run SourceModule("...")

Andreas Kl¨

  • ckner

Applied Math · Brown University GPU Metaprogramming using PyCUDA: Methods & Applications

slide-24
SLIDE 24

Why GPU Scripting? Scripting CUDA GPU RTCG DG on GPUs Perspectives PyCUDA in Detail

PyCUDA: Workflow

Edit PyCUDA Run SourceModule("...")

Andreas Kl¨

  • ckner

Applied Math · Brown University GPU Metaprogramming using PyCUDA: Methods & Applications

slide-25
SLIDE 25

Why GPU Scripting? Scripting CUDA GPU RTCG DG on GPUs Perspectives PyCUDA in Detail

PyCUDA: Workflow

Edit PyCUDA Run SourceModule("...") Cache?

Andreas Kl¨

  • ckner

Applied Math · Brown University GPU Metaprogramming using PyCUDA: Methods & Applications

slide-26
SLIDE 26

Why GPU Scripting? Scripting CUDA GPU RTCG DG on GPUs Perspectives PyCUDA in Detail

PyCUDA: Workflow

Edit PyCUDA Run SourceModule("...") Cache? nvcc

Andreas Kl¨

  • ckner

Applied Math · Brown University GPU Metaprogramming using PyCUDA: Methods & Applications

slide-27
SLIDE 27

Why GPU Scripting? Scripting CUDA GPU RTCG DG on GPUs Perspectives PyCUDA in Detail

PyCUDA: Workflow

Edit PyCUDA Run SourceModule("...") Cache? nvcc .cubin

Andreas Kl¨

  • ckner

Applied Math · Brown University GPU Metaprogramming using PyCUDA: Methods & Applications

slide-28
SLIDE 28

Why GPU Scripting? Scripting CUDA GPU RTCG DG on GPUs Perspectives PyCUDA in Detail

PyCUDA: Workflow

Edit PyCUDA Run SourceModule("...") Cache! nvcc .cubin

Andreas Kl¨

  • ckner

Applied Math · Brown University GPU Metaprogramming using PyCUDA: Methods & Applications

slide-29
SLIDE 29

Why GPU Scripting? Scripting CUDA GPU RTCG DG on GPUs Perspectives PyCUDA in Detail

PyCUDA: Workflow

Edit PyCUDA Run SourceModule("...") Cache! nvcc .cubin Upload to GPU

Andreas Kl¨

  • ckner

Applied Math · Brown University GPU Metaprogramming using PyCUDA: Methods & Applications

slide-30
SLIDE 30

Why GPU Scripting? Scripting CUDA GPU RTCG DG on GPUs Perspectives PyCUDA in Detail

PyCUDA: Workflow

Edit PyCUDA Run SourceModule("...") Cache! nvcc .cubin Upload to GPU Run on GPU

Andreas Kl¨

  • ckner

Applied Math · Brown University GPU Metaprogramming using PyCUDA: Methods & Applications

slide-31
SLIDE 31

Why GPU Scripting? Scripting CUDA GPU RTCG DG on GPUs Perspectives PyCUDA in Detail

gpuarray: Simple Linear Algebra

pycuda.gpuarray: Meant to look and feel just like numpy.

gpuarray.to gpu(numpy array) numpy array = gpuarray.get()

+, -, ∗, /, fill, sin, exp, rand, basic indexing, norm, inner product, . . . Mixed types (int32 + float32 = float64) print gpuarray for debugging. Allows access to raw bits

Use as kernel arguments, textures, etc.

Andreas Kl¨

  • ckner

Applied Math · Brown University GPU Metaprogramming using PyCUDA: Methods & Applications

slide-32
SLIDE 32

Why GPU Scripting? Scripting CUDA GPU RTCG DG on GPUs Perspectives Do More, Faster with PyCUDA

gpuarray: Elementwise expressions

Avoiding extra store-fetch cycles for elementwise math:

from pycuda.curandom import rand as curand a gpu = curand((50,)) b gpu = curand((50,)) from pycuda.elementwise import ElementwiseKernel lin comb = ElementwiseKernel( ” float a, float ∗x, float b, float ∗y, float ∗z”, ”z[ i ] = a∗x[i] + b∗y[i]”) c gpu = gpuarray.empty like(a gpu) lin comb(5, a gpu, 6, b gpu, c gpu) assert la .norm((c gpu − (5∗a gpu+6∗b gpu)).get()) < 1e−5

Andreas Kl¨

  • ckner

Applied Math · Brown University GPU Metaprogramming using PyCUDA: Methods & Applications

slide-33
SLIDE 33

Why GPU Scripting? Scripting CUDA GPU RTCG DG on GPUs Perspectives Do More, Faster with PyCUDA

PyCUDA: Vital Information

http://mathema.tician.de/ software/pycuda Complete documentation X Consortium License (no warranty, free for all use) Requires: numpy, Boost C++, Python 2.4+. Support via mailing list.

Andreas Kl¨

  • ckner

Applied Math · Brown University GPU Metaprogramming using PyCUDA: Methods & Applications

slide-34
SLIDE 34

Why GPU Scripting? Scripting CUDA GPU RTCG DG on GPUs Perspectives

Outline

1 Why GPU Scripting? 2 Scripting CUDA 3 GPU Run-Time Code Generation

Programs that write Programs

4 DG on GPUs 5 Perspectives

Andreas Kl¨

  • ckner

Applied Math · Brown University GPU Metaprogramming using PyCUDA: Methods & Applications

slide-35
SLIDE 35

Why GPU Scripting? Scripting CUDA GPU RTCG DG on GPUs Perspectives Programs that write Programs

Metaprogramming

In GPU scripting, GPU code does not need to be a compile-time constant.

Andreas Kl¨

  • ckner

Applied Math · Brown University GPU Metaprogramming using PyCUDA: Methods & Applications

slide-36
SLIDE 36

Why GPU Scripting? Scripting CUDA GPU RTCG DG on GPUs Perspectives Programs that write Programs

Metaprogramming

In GPU scripting, GPU code does not need to be a compile-time constant. (Key: Code is data–it wants to be reasoned about at run time)

Andreas Kl¨

  • ckner

Applied Math · Brown University GPU Metaprogramming using PyCUDA: Methods & Applications

slide-37
SLIDE 37

Why GPU Scripting? Scripting CUDA GPU RTCG DG on GPUs Perspectives Programs that write Programs

Metaprogramming

Idea

In GPU scripting, GPU code does not need to be a compile-time constant. (Key: Code is data–it wants to be reasoned about at run time)

Andreas Kl¨

  • ckner

Applied Math · Brown University GPU Metaprogramming using PyCUDA: Methods & Applications

slide-38
SLIDE 38

Why GPU Scripting? Scripting CUDA GPU RTCG DG on GPUs Perspectives Programs that write Programs

Metaprogramming

Idea Python Code GPU Code GPU Compiler GPU Binary GPU Result

In GPU scripting, GPU code does not need to be a compile-time constant. (Key: Code is data–it wants to be reasoned about at run time)

Andreas Kl¨

  • ckner

Applied Math · Brown University GPU Metaprogramming using PyCUDA: Methods & Applications

slide-39
SLIDE 39

Why GPU Scripting? Scripting CUDA GPU RTCG DG on GPUs Perspectives Programs that write Programs

Metaprogramming

Idea Python Code GPU Code GPU Compiler GPU Binary GPU Result

Machine In GPU scripting, GPU code does not need to be a compile-time constant. (Key: Code is data–it wants to be reasoned about at run time)

Andreas Kl¨

  • ckner

Applied Math · Brown University GPU Metaprogramming using PyCUDA: Methods & Applications

slide-40
SLIDE 40

Why GPU Scripting? Scripting CUDA GPU RTCG DG on GPUs Perspectives Programs that write Programs

Metaprogramming

Idea Python Code GPU Code GPU Compiler GPU Binary GPU Result

Human In GPU scripting, GPU code does not need to be a compile-time constant. (Key: Code is data–it wants to be reasoned about at run time)

Andreas Kl¨

  • ckner

Applied Math · Brown University GPU Metaprogramming using PyCUDA: Methods & Applications

slide-41
SLIDE 41

Why GPU Scripting? Scripting CUDA GPU RTCG DG on GPUs Perspectives Programs that write Programs

Metaprogramming

Idea Python Code GPU Code GPU Compiler GPU Binary GPU Result

In GPU scripting, GPU code does not need to be a compile-time constant. (Key: Code is data–it wants to be reasoned about at run time) Good for code generation

Andreas Kl¨

  • ckner

Applied Math · Brown University GPU Metaprogramming using PyCUDA: Methods & Applications

slide-42
SLIDE 42

Why GPU Scripting? Scripting CUDA GPU RTCG DG on GPUs Perspectives Programs that write Programs

Metaprogramming

Idea Python Code GPU Code GPU Compiler GPU Binary GPU Result

In GPU scripting, GPU code does not need to be a compile-time constant. (Key: Code is data–it wants to be reasoned about at run time) Good for code generation P y C U D A

Andreas Kl¨

  • ckner

Applied Math · Brown University GPU Metaprogramming using PyCUDA: Methods & Applications

slide-43
SLIDE 43

Why GPU Scripting? Scripting CUDA GPU RTCG DG on GPUs Perspectives Programs that write Programs

Machine-generated Code

Why machine-generate code? Automated Tuning (cf. ATLAS, FFTW) Data types Specialize code for given problem Constants faster than variables (→ register pressure) Loop Unrolling

Andreas Kl¨

  • ckner

Applied Math · Brown University GPU Metaprogramming using PyCUDA: Methods & Applications

slide-44
SLIDE 44

Why GPU Scripting? Scripting CUDA GPU RTCG DG on GPUs Perspectives Programs that write Programs

RTCG via Templates

from jinja2 import Template tpl = Template(””” global void twice({{ type name }} ∗tgt) { int idx = threadIdx.x + {{ thread block size }} ∗ {{ block size }} ∗ blockIdx .x; {% for i in range( block size ) %} {% set offset = i∗ thread block size %} tgt [ idx + {{ offset }}] ∗= 2; {% endfor %} }”””) rendered tpl = tpl.render( type name=”float”, block size =block size, thread block size =thread block size ) smod = SourceModule(rendered tpl)

Andreas Kl¨

  • ckner

Applied Math · Brown University GPU Metaprogramming using PyCUDA: Methods & Applications

slide-45
SLIDE 45

Why GPU Scripting? Scripting CUDA GPU RTCG DG on GPUs Perspectives Programs that write Programs

RTCG via AST Generation

from codepy.cgen import ∗ from codepy.cgen.cuda import CudaGlobal mod = Module([ FunctionBody( CudaGlobal(FunctionDeclaration( Value(”void”, ”twice”), arg decls =[Pointer(POD(dtype, ”tgt”))])), Block([ Initializer (POD(numpy.int32, ”idx”), ”threadIdx.x + %d∗blockIdx.x” % ( thread block size ∗ block size )), ]+[ Assign(”tgt[ idx+%d]” % (o∗thread block size), ”2 ∗tgt[ idx+%d]” % (o∗thread block size)) for o in range( block size )]))]) smod = SourceModule(mod)

Andreas Kl¨

  • ckner

Applied Math · Brown University GPU Metaprogramming using PyCUDA: Methods & Applications

slide-46
SLIDE 46

Why GPU Scripting? Scripting CUDA GPU RTCG DG on GPUs Perspectives

Outline

1 Why GPU Scripting? 2 Scripting CUDA 3 GPU Run-Time Code Generation 4 DG on GPUs

Introduction DG and Metaprogramming Results

5 Perspectives

Andreas Kl¨

  • ckner

Applied Math · Brown University GPU Metaprogramming using PyCUDA: Methods & Applications

slide-47
SLIDE 47

Why GPU Scripting? Scripting CUDA GPU RTCG DG on GPUs Perspectives Introduction

Discontinuous Galerkin Method

Let Ω :=

i Dk ⊂ Rd.

Andreas Kl¨

  • ckner

Applied Math · Brown University GPU Metaprogramming using PyCUDA: Methods & Applications

slide-48
SLIDE 48

Why GPU Scripting? Scripting CUDA GPU RTCG DG on GPUs Perspectives Introduction

Discontinuous Galerkin Method

Let Ω :=

i Dk ⊂ Rd.

Goal Solve a conservation law on Ω: ut + ∇ · F(u) = 0

Andreas Kl¨

  • ckner

Applied Math · Brown University GPU Metaprogramming using PyCUDA: Methods & Applications

slide-49
SLIDE 49

Why GPU Scripting? Scripting CUDA GPU RTCG DG on GPUs Perspectives Introduction

Discontinuous Galerkin Method

Let Ω :=

i Dk ⊂ Rd.

Goal Solve a conservation law on Ω: ut + ∇ · F(u) = 0 Example Maxwell’s Equations: EM field: E(x, t), H(x, t) on Ω governed by ∂tE − 1 ε∇ × H = − j ε, ∂tH + 1 µ∇ × E = 0, ∇ · E = ρ ε, ∇ · H = 0.

Andreas Kl¨

  • ckner

Applied Math · Brown University GPU Metaprogramming using PyCUDA: Methods & Applications

slide-50
SLIDE 50

Why GPU Scripting? Scripting CUDA GPU RTCG DG on GPUs Perspectives Introduction

Discontinuous Galerkin Method

Multiply by test function, integrate by parts: 0 = ˆ

Dk

utϕ + [∇ · F(u)]ϕ dx = ˆ

Dk

utϕ − F(u) · ∇ϕ dx + ˆ

∂Dk

(ˆ n · F)∗ϕ dSx, Integrate by parts again, subsitute in basis functions, introduce elementwise differentiation and “lifting” matrices D, L: ∂tuk = −

  • ν

D∂ν,k[F(uk)] + Lk[ˆ n · F − (ˆ n · F)∗]|A⊂∂Dk. For straight-sided simplicial elements: Reduce D∂ν and L to reference matrices.

Andreas Kl¨

  • ckner

Applied Math · Brown University GPU Metaprogramming using PyCUDA: Methods & Applications

slide-51
SLIDE 51

Why GPU Scripting? Scripting CUDA GPU RTCG DG on GPUs Perspectives DG and Metaprogramming

Metaprogramming for GPU-DG

Specialize code for user-given problem:

Flux Terms

Andreas Kl¨

  • ckner

Applied Math · Brown University GPU Metaprogramming using PyCUDA: Methods & Applications

slide-52
SLIDE 52

Why GPU Scripting? Scripting CUDA GPU RTCG DG on GPUs Perspectives DG and Metaprogramming

Metaprogramming for GPU-DG

Specialize code for user-given problem:

Flux Terms

Automated Tuning:

Memory layout Loop slicing Gather granularity

Andreas Kl¨

  • ckner

Applied Math · Brown University GPU Metaprogramming using PyCUDA: Methods & Applications

slide-53
SLIDE 53

Why GPU Scripting? Scripting CUDA GPU RTCG DG on GPUs Perspectives DG and Metaprogramming

Metaprogramming for GPU-DG

Specialize code for user-given problem:

Flux Terms

Automated Tuning:

Memory layout Loop slicing Gather granularity

Constants instead of variables:

Dimensionality Polynomial degree Element properties Matrix sizes

Andreas Kl¨

  • ckner

Applied Math · Brown University GPU Metaprogramming using PyCUDA: Methods & Applications

slide-54
SLIDE 54

Why GPU Scripting? Scripting CUDA GPU RTCG DG on GPUs Perspectives DG and Metaprogramming

Metaprogramming for GPU-DG

Specialize code for user-given problem:

Flux Terms

Automated Tuning:

Memory layout Loop slicing Gather granularity

Constants instead of variables:

Dimensionality Polynomial degree Element properties Matrix sizes

Loop Unrolling

Andreas Kl¨

  • ckner

Applied Math · Brown University GPU Metaprogramming using PyCUDA: Methods & Applications

slide-55
SLIDE 55

Why GPU Scripting? Scripting CUDA GPU RTCG DG on GPUs Perspectives DG and Metaprogramming

Metaprogramming for GPU-DG

Specialize code for user-given problem:

Flux Terms (*)

Automated Tuning:

Memory layout Loop slicing (*) Gather granularity

Constants instead of variables:

Dimensionality Polynomial degree Element properties Matrix sizes

Loop Unrolling

Andreas Kl¨

  • ckner

Applied Math · Brown University GPU Metaprogramming using PyCUDA: Methods & Applications

slide-56
SLIDE 56

Why GPU Scripting? Scripting CUDA GPU RTCG DG on GPUs Perspectives DG and Metaprogramming

Metaprogramming DG: Flux Terms

0 = ˆ

Dk

utϕ + [∇ · F(u)]ϕ dx − ˆ

∂Dk

[ˆ n · F − (ˆ n · F)∗]ϕ dSx

  • Flux term

Andreas Kl¨

  • ckner

Applied Math · Brown University GPU Metaprogramming using PyCUDA: Methods & Applications

slide-57
SLIDE 57

Why GPU Scripting? Scripting CUDA GPU RTCG DG on GPUs Perspectives DG and Metaprogramming

Metaprogramming DG: Flux Terms

0 = ˆ

Dk

utϕ + [∇ · F(u)]ϕ dx − ˆ

∂Dk

[ˆ n · F − (ˆ n · F)∗]ϕ dSx

  • Flux term

Flux terms: vary by problem expression specified by user evaluated pointwise

Andreas Kl¨

  • ckner

Applied Math · Brown University GPU Metaprogramming using PyCUDA: Methods & Applications

slide-58
SLIDE 58

Why GPU Scripting? Scripting CUDA GPU RTCG DG on GPUs Perspectives DG and Metaprogramming

Metaprogramming DG: Flux Terms Example

Example: Fluxes for Maxwell’s Equations ˆ n · (F − F ∗)E := 1 2 [ˆ n × (H − αˆ n × E)]

Andreas Kl¨

  • ckner

Applied Math · Brown University GPU Metaprogramming using PyCUDA: Methods & Applications

slide-59
SLIDE 59

Why GPU Scripting? Scripting CUDA GPU RTCG DG on GPUs Perspectives DG and Metaprogramming

Metaprogramming DG: Flux Terms Example

Example: Fluxes for Maxwell’s Equations ˆ n · (F − F ∗)E := 1 2 [ˆ n × (H − αˆ n × E)] User writes: Vectorial statement in math. notation

flux = 1/2∗cross(normal, h. int−h.ext −alpha∗cross(normal, e. int−e.ext))

Andreas Kl¨

  • ckner

Applied Math · Brown University GPU Metaprogramming using PyCUDA: Methods & Applications

slide-60
SLIDE 60

Why GPU Scripting? Scripting CUDA GPU RTCG DG on GPUs Perspectives DG and Metaprogramming

Metaprogramming DG: Flux Terms Example

Example: Fluxes for Maxwell’s Equations ˆ n · (F − F ∗)E := 1 2 [ˆ n × (H − αˆ n × E)] We generate: Scalar evaluator in C (6×)

a flux += ( ((( val a field5 − val b field5 )∗ fpair −>normal[2] − ( val a field4 − val b field4 )∗ fpair −>normal[0]) + val a field0 − val b field0 )∗ fpair −>normal[0] − ((( val a field4 − val b field4 ) ∗ fpair −>normal[1] − ( val a field1 − val b field1 )∗ fpair −>normal[2]) + val a field3 − val b field3 ) ∗ fpair −>normal[1] )∗value type (0.5);

Andreas Kl¨

  • ckner

Applied Math · Brown University GPU Metaprogramming using PyCUDA: Methods & Applications

slide-61
SLIDE 61

Why GPU Scripting? Scripting CUDA GPU RTCG DG on GPUs Perspectives DG and Metaprogramming

Loop Slicing on the GPU: A Pattern

Setting: N independent work units + preparation

Preparation

Question: How should one assign work units to threads?

Andreas Kl¨

  • ckner

Applied Math · Brown University GPU Metaprogramming using PyCUDA: Methods & Applications

slide-62
SLIDE 62

Why GPU Scripting? Scripting CUDA GPU RTCG DG on GPUs Perspectives DG and Metaprogramming

Loop Slicing on the GPU: A Pattern

Setting: N independent work units + preparation

Preparation

Question: How should one assign work units to threads? ws: in sequence

Thread t

Andreas Kl¨

  • ckner

Applied Math · Brown University GPU Metaprogramming using PyCUDA: Methods & Applications

slide-63
SLIDE 63

Why GPU Scripting? Scripting CUDA GPU RTCG DG on GPUs Perspectives DG and Metaprogramming

Loop Slicing on the GPU: A Pattern

Setting: N independent work units + preparation

Preparation

Question: How should one assign work units to threads? ws: in sequence

Thread t

wp: in parallel

Thread t

Andreas Kl¨

  • ckner

Applied Math · Brown University GPU Metaprogramming using PyCUDA: Methods & Applications

slide-64
SLIDE 64

Why GPU Scripting? Scripting CUDA GPU RTCG DG on GPUs Perspectives DG and Metaprogramming

Loop Slicing on the GPU: A Pattern

Setting: N independent work units + preparation

Preparation

Question: How should one assign work units to threads? ws: in sequence

Thread t

wi: “inline-parallel”

Thread t

wp: in parallel

Thread t

Andreas Kl¨

  • ckner

Applied Math · Brown University GPU Metaprogramming using PyCUDA: Methods & Applications

slide-65
SLIDE 65

Why GPU Scripting? Scripting CUDA GPU RTCG DG on GPUs Perspectives DG and Metaprogramming

Loop Slicing on the GPU: A Pattern

Setting: N independent work units + preparation

Preparation

Question: How should one assign work units to threads? ws: in sequence

Thread t

wi: “inline-parallel”

Thread t

wp: in parallel

Thread t

(amortize preparation)

Andreas Kl¨

  • ckner

Applied Math · Brown University GPU Metaprogramming using PyCUDA: Methods & Applications

slide-66
SLIDE 66

Why GPU Scripting? Scripting CUDA GPU RTCG DG on GPUs Perspectives DG and Metaprogramming

Loop Slicing on the GPU: A Pattern

Setting: N independent work units + preparation

Preparation

Question: How should one assign work units to threads? ws: in sequence

Thread t

wi: “inline-parallel”

Thread t

wp: in parallel

Thread t

(amortize preparation) (exploit register space)

Andreas Kl¨

  • ckner

Applied Math · Brown University GPU Metaprogramming using PyCUDA: Methods & Applications

slide-67
SLIDE 67

Why GPU Scripting? Scripting CUDA GPU RTCG DG on GPUs Perspectives DG and Metaprogramming

Loop Slicing for Differentiation

15 20 25 30 wp 0.8 1.0 1.2 1.4 1.6 1.8 2.0 2.2 Execution time [ms] Local differentiation, matrix-in-shared,

  • rder 4, with microblocking

point size denotes wi ∈

  • 1,

,4

  • 1.0

1.2 1.4 1.6 1.8 2.0 2.2 2.4 2.6 2.8 3.0 ws

Andreas Kl¨

  • ckner

Applied Math · Brown University GPU Metaprogramming using PyCUDA: Methods & Applications

slide-68
SLIDE 68

Why GPU Scripting? Scripting CUDA GPU RTCG DG on GPUs Perspectives Results

Nvidia GTX280 vs. single core of Intel Core 2 Duo E8400

2 4 6 8 Polynomial Order N 50 100 150 200 250 300 GFlops/s

GPU CPU

2 4 6 8 10 20 30 40 50 60 70 Speedup Factor

Speedup

Andreas Kl¨

  • ckner

Applied Math · Brown University GPU Metaprogramming using PyCUDA: Methods & Applications

slide-69
SLIDE 69

Why GPU Scripting? Scripting CUDA GPU RTCG DG on GPUs Perspectives Results

16 T10s vs. 64 = 8 × 2 × 4 Xeon E5472

2 4 6 8 Polynomial Order N 1000 2000 3000 4000 GFlops/s

Flop Rates and Speedups: 16 GPUs vs 64 CPU cores GPU CPU

2 4 6 8 5 10 15 20 25 Speedup Factor

Speedup

Andreas Kl¨

  • ckner

Applied Math · Brown University GPU Metaprogramming using PyCUDA: Methods & Applications

slide-70
SLIDE 70

Why GPU Scripting? Scripting CUDA GPU RTCG DG on GPUs Perspectives Results

GPU DG Showcase

Eletromagnetism

Andreas Kl¨

  • ckner

Applied Math · Brown University GPU Metaprogramming using PyCUDA: Methods & Applications

slide-71
SLIDE 71

Why GPU Scripting? Scripting CUDA GPU RTCG DG on GPUs Perspectives Results

GPU DG Showcase

Eletromagnetism Poisson

Andreas Kl¨

  • ckner

Applied Math · Brown University GPU Metaprogramming using PyCUDA: Methods & Applications

slide-72
SLIDE 72

Why GPU Scripting? Scripting CUDA GPU RTCG DG on GPUs Perspectives Results

GPU DG Showcase

Eletromagnetism Poisson CFD

Andreas Kl¨

  • ckner

Applied Math · Brown University GPU Metaprogramming using PyCUDA: Methods & Applications

slide-73
SLIDE 73

Why GPU Scripting? Scripting CUDA GPU RTCG DG on GPUs Perspectives Results

GPU DG Showcase

Eletromagnetism Poisson CFD

  • etc. . .

Andreas Kl¨

  • ckner

Applied Math · Brown University GPU Metaprogramming using PyCUDA: Methods & Applications

slide-74
SLIDE 74

Why GPU Scripting? Scripting CUDA GPU RTCG DG on GPUs Perspectives

Outline

1 Why GPU Scripting? 2 Scripting CUDA 3 GPU Run-Time Code Generation 4 DG on GPUs 5 Perspectives

Conclusions

Andreas Kl¨

  • ckner

Applied Math · Brown University GPU Metaprogramming using PyCUDA: Methods & Applications

slide-75
SLIDE 75

Why GPU Scripting? Scripting CUDA GPU RTCG DG on GPUs Perspectives Related Developments

  • Introducing. . . PyOpenCL

PyOpenCL is “PyCUDA for OpenCL” Complete, mature API wrapper Features like PyCUDA: not yet Tested on all available Implementations, OSs http://mathema.tician.de/ software/pyopencl OpenCL

Andreas Kl¨

  • ckner

Applied Math · Brown University GPU Metaprogramming using PyCUDA: Methods & Applications

slide-76
SLIDE 76

Why GPU Scripting? Scripting CUDA GPU RTCG DG on GPUs Perspectives Related Developments

Automating GPU Programming

GPU programming can be time-consuming, unintuitive and error-prone. Obvious idea: Let the computer do it. One way: Smart compilers

Andreas Kl¨

  • ckner

Applied Math · Brown University GPU Metaprogramming using PyCUDA: Methods & Applications

slide-77
SLIDE 77

Why GPU Scripting? Scripting CUDA GPU RTCG DG on GPUs Perspectives Related Developments

Automating GPU Programming

GPU programming can be time-consuming, unintuitive and error-prone. Obvious idea: Let the computer do it. One way: Smart compilers

GPU programming requires complex tradeoffs Tradeoffs require heuristics Heuristics are fragile

Andreas Kl¨

  • ckner

Applied Math · Brown University GPU Metaprogramming using PyCUDA: Methods & Applications

slide-78
SLIDE 78

Why GPU Scripting? Scripting CUDA GPU RTCG DG on GPUs Perspectives Related Developments

Automating GPU Programming

GPU programming can be time-consuming, unintuitive and error-prone. Obvious idea: Let the computer do it. One way: Smart compilers

GPU programming requires complex tradeoffs Tradeoffs require heuristics Heuristics are fragile

Another way: Dumb enumeration

Enumerate loop slicings Enumerate prefetch options Choose by running resulting code on actual hardware

Andreas Kl¨

  • ckner

Applied Math · Brown University GPU Metaprogramming using PyCUDA: Methods & Applications

slide-79
SLIDE 79

Why GPU Scripting? Scripting CUDA GPU RTCG DG on GPUs Perspectives Related Developments

Loo.py Example

Empirical GPU loop optimization:

a, b, c, i , j , k = [var(s) for s in ”abcijk”] n = 500 k = make loop kernel([ LoopDimension(”i”, n), LoopDimension(”j”, n), LoopDimension(”k”, n), ], [ (c[ i+n∗j], a[ i+n∗k]∗b[k+n∗j]) ]) gen kwargs = { ”min threads”: 128, ”min blocks”: 32, }

→ Ideal case: Finds 160 GF/s kernel without human intervention.

Andreas Kl¨

  • ckner

Applied Math · Brown University GPU Metaprogramming using PyCUDA: Methods & Applications

slide-80
SLIDE 80

Why GPU Scripting? Scripting CUDA GPU RTCG DG on GPUs Perspectives Related Developments

Loo.py Status

Limited scope:

Require input/output separation Kernels must be expressible using “loopy” model (i.e. indices decompose into “output” and “reduction”) Enough for DG, LA, FD, . . .

Andreas Kl¨

  • ckner

Applied Math · Brown University GPU Metaprogramming using PyCUDA: Methods & Applications

slide-81
SLIDE 81

Why GPU Scripting? Scripting CUDA GPU RTCG DG on GPUs Perspectives Related Developments

Loo.py Status

Limited scope:

Require input/output separation Kernels must be expressible using “loopy” model (i.e. indices decompose into “output” and “reduction”) Enough for DG, LA, FD, . . .

Kernel compilation limits trial rate Non-Goal: Peak performance Good results currently for dense linear algebra and (some) DG subkernels

Andreas Kl¨

  • ckner

Applied Math · Brown University GPU Metaprogramming using PyCUDA: Methods & Applications

slide-82
SLIDE 82

Why GPU Scripting? Scripting CUDA GPU RTCG DG on GPUs Perspectives Conclusions

Conclusions

Fun time to be in computational science

Andreas Kl¨

  • ckner

Applied Math · Brown University GPU Metaprogramming using PyCUDA: Methods & Applications

slide-83
SLIDE 83

Why GPU Scripting? Scripting CUDA GPU RTCG DG on GPUs Perspectives Conclusions

Conclusions

Fun time to be in computational science Use Python and PyCUDA to have even more fun :-)

With no compromise in performance

Andreas Kl¨

  • ckner

Applied Math · Brown University GPU Metaprogramming using PyCUDA: Methods & Applications

slide-84
SLIDE 84

Why GPU Scripting? Scripting CUDA GPU RTCG DG on GPUs Perspectives Conclusions

Conclusions

Fun time to be in computational science Use Python and PyCUDA to have even more fun :-)

With no compromise in performance

GPUs and scripting work well together

Enable Metaprogramming

Andreas Kl¨

  • ckner

Applied Math · Brown University GPU Metaprogramming using PyCUDA: Methods & Applications

slide-85
SLIDE 85

Why GPU Scripting? Scripting CUDA GPU RTCG DG on GPUs Perspectives Conclusions

Conclusions

Fun time to be in computational science Use Python and PyCUDA to have even more fun :-)

With no compromise in performance

GPUs and scripting work well together

Enable Metaprogramming

Further work in GPU-DG:

Other equations (Euler, Navier-Stokes) Curvilinear Elements Local Time Stepping

Andreas Kl¨

  • ckner

Applied Math · Brown University GPU Metaprogramming using PyCUDA: Methods & Applications

slide-86
SLIDE 86

Why GPU Scripting? Scripting CUDA GPU RTCG DG on GPUs Perspectives Conclusions

Where to from here?

More at. . . → http://mathema.tician.de/ CUDA-DG AK, T. Warburton, J. Bridge, J.S. Hesthaven, “Nodal Discontinuous Galerkin Methods on Graphics Processors”,

  • J. Comp. Phys., 2009.

GPU RTCG AK, N. Pinto et al. PyCUDA: GPU Run-Time Code Generation for High-Performance Computing, in prep.

Andreas Kl¨

  • ckner

Applied Math · Brown University GPU Metaprogramming using PyCUDA: Methods & Applications

slide-87
SLIDE 87

Why GPU Scripting? Scripting CUDA GPU RTCG DG on GPUs Perspectives Conclusions

Questions?

?

Thank you for your attention! http://mathema.tician.de/

image credits Andreas Kl¨

  • ckner

Applied Math · Brown University GPU Metaprogramming using PyCUDA: Methods & Applications

slide-88
SLIDE 88

Why GPU Scripting? Scripting CUDA GPU RTCG DG on GPUs Perspectives Conclusions

Image Credits

Circuitry: flickr.com/oskay C870 GPU: Nvidia Corp. Old Books: flickr.com/ppdigital OpenCL logo: Ars Technica/Apple Corp. OS Platforms: flickr.com/aOliN.Tk Adding Machine: flickr.com/thomashawk Floppy disk: flickr.com/ethanhein Machine: flickr.com/13521837@N00 OpenCL logo: Ars Technica/Apple Corp.

Andreas Kl¨

  • ckner

Applied Math · Brown University GPU Metaprogramming using PyCUDA: Methods & Applications