Applications of Programming the GPU Directly from Python Using - - PowerPoint PPT Presentation

▶

Dec 12, 2022 305 likes •586 views

Applications of Programming the GPU Directly from Python Using NumbaPro Travis E. Oliphant, Ph.D. Supercomputing 2013 November 20, 2013 Inroduction Data Analysis Visualisation Data Processing Products Free Python distribution

SLIDE 1

Applications of Programming the GPU Directly from Python Using NumbaPro

Supercomputing 2013 November 20, 2013 Travis E. Oliphant, Ph.D.

SLIDE 2

Inroduction

Enterprise Python Scientific Computing Data Processing Data Analysis Visualisation Scalable Computing

Wakari

Products
Training
Support
Consulting

Free Python distribution Enterprise version available which includes GPU support Scientific Python in your browser Available to install in your data-center

SLIDE 3

Big Picture

Empower domain experts, subject matter experts, and other occasional programmers with high-level tools that exploit modern hardware

Array Oriented Computing

?

SLIDE 4

Why Array-oriented computing

Express domain knowledge directly in

arrays (tensors, matrices, vectors) --- easier to teach programming in domain

Can take advantage of parallelism and

accelerators like the GPU

Object Attr1 Attr2 Attr3 Object Attr1 Attr2 Attr3 Object Attr1 Attr2 Attr3 Object Attr1 Attr2 Attr3 Object Attr1 Attr2 Attr3 Object Attr1 Attr2 Attr3 Attr1 Attr2 Attr3 Object1 Object2 Object3 Object4 Object5 Object6

SLIDE 5

Why Python

License Community Readable Syntax Modern Constructs Batteries Included

Free and Open Source, Permissive License

Broad and friendly community
Over 36,000 packages on PyPI
Commercial Support
Many conferences (PyData, SciPy, PyCon...)
Executable pseudo-code
Can understand and edit code a year later
Fun to develop
Use of Indentation

IPython

Interactive prompt on steroids (Notebook)
Allows less working memory
Allows failing quickly for exploration
List comprehensions
Iterator protocol and generators
Meta-programming
Introspection
JIT Compiler and Concurrency (Numba)
Internet (FTP

, HTTP , SMTP , XMLRPC)

Compression and Databases
Great

Visualization tools (Bokeh, Matplotlib, etc.)

Powerful libraries for STEM
Integration with C/C++/Fortran

SLIDE 6

Breaking the Speed Barrier (Numba!)

Numba aims to be the world’s best array-oriented compiler. rapid iteration and development + fast code execution ideal combination!

=

Python syntax but no GIL Native code speed for Numerical computing (NumPy code)

SLIDE 7

NumPy + Mamba = Numba

LLVM Library Intel Nvidia Apple AMD

OpenCL ISPC CUDA CLANG OpenMP

LLVM-PY Python Function Machine Code ARM

SLIDE 8

Example

@jit(‘f8(f8)’) def sinc(x): if x==0.0: return 1.0 else: return sin(x*pi)/(pi*x)

Numba

SLIDE 9

Compiler Overview

LLVM IR x86 C++ ARM PTX C Fortran Numba

Numba turns Python into a “compiled language”

SLIDE 10

~150x speed-up

Real-time image processing in Python (50 fps Mandelbrot)

SLIDE 11

Anaconda Accelerate

Python and NumPy stack compiled to Parallel Architectures (GPUs and multi-core machines)

Compile NumPy array expressions

for the CPU and GPU

Create parallel-for loops
Parallel execution of ufuncs
Run ufuncs on the GPU
Write CUDA directly in Python!
Requires CUDA 5.5

Fast development and execution

$ conda install accelerate

SLIDE 12

NumbaPro Features

CUDA Python
Vectorize --- NumPy functions on the GPU
Array expressions
Parallel for loops
Access to fast libraries (cuRAND, cuFFT, cuBLAS)

SLIDE 13

Compile NumPy array expressions

import numbapro from numba import autojit @autojit def formula(a, b, c): a[1:,1:] = a[1:,1:] + b[1:,:-1] + c[1:,:-1] @autojit def express(m1, m2): m2[1:-1:2,0,...,::2] = (m1[1:-1:2,...,::2] * m1[-2:1:-2,...,::2]) return m2

SLIDE 14

Create parallel-for loops

“prange” directive that spawns compiled tasks in threads (like Open-MP parallel-for pragma)

import numbapro # import first to make prange available from numba import autojit, prange @autojit def parallel_sum2d(a): sum = 0.0 for i in prange(a.shape[0]): for j in range(a.shape[1]): sum += a[i,j]

SLIDE 15

Fast vectorize

NumPy’s ufuncs take “kernels” and apply the kernel element-by-element over entire arrays Write kernels in Python!

from numbapro import vectorize from math import sin, pi @vectorize([‘f8(f8)’, ‘f4(f4)’]) def sinc(x): if x==0.0: return 1.0 else: return sin(x*pi)/(pi*x) x = numpy.linspace(-5,5,100) y = sinc(x)

SLIDE 16

Ufuncs in parallel (multi-thread or GPU)

from numbapro import vectorize from math import sin @vectorize([‘f8(f8)’, ‘f4(f4)’], target=‘gpu’) def sinc(x): if x==0.0: return 1.0 else: return sin(x*pi)/(pi*x) @vectorize([‘f8(f8)’, ‘f4(f4)’], target=‘parallel’) def sinc2(x): if x==0.0: return 1.0 else: return sin(x*pi)/(pi*x) x = numpy.linspace(-5,5,1000) y = sinc(x) # On GPU z = sinc2(x) # Multiple CPUs

SLIDE 17

Example Benchmark

Overhead of memory transfer is

ver-come after about 1 million

floats for simple computation About 1ms of overhead for memory transfer and set-up

SLIDE 18

Using Vectorize

from numbapro import vectorize sig = 'uint8(uint32, f4, f4, f4, f4, uint32, uint32, uint32)' @vectorize([sig], target='gpu') def mandel(tid, min_x, max_x, min_y, max_y, width, height, iters): pixel_size_x = (max_x - min_x) / width pixel_size_y = (max_y - min_y) / height x = tid % width y = tid / width real = min_x + x * pixel_size_x imag = min_y + y * pixel_size_y c = complex(real, imag) z = 0.0j for i in range(iters): z = z * z + c if (z.real * z.real + z.imag * z.imag) >= 4: return i return 255

Kind Time Speed-up Python 263.6 1.0x CPU 2.639 100x GPU 0.1676 1573x

Tesla S2050

SLIDE 19

Grouping Calculations

prototype = "void(float32[:,:], float32[:,:], float32[:,:])" @guvectorize([prototype], '(m,n),(n,p)->(m,p)', target='gpu') def matmul(A, B, C): m, n = A.shape n, p = B.shape for i in range(m): for j in range(p): C[i, j] = 0 for k in range(n): C[i, j] += A[i, k] * B[k, j]

Create “generalized ufuncs” whose elements are “arrays”

# creates an array of 1000 x 2 x 4 A = np.arange(matrix_ct * 2 * 4, dtype=np.float32 ).reshape(1000, 2, 4) # creates an array of 1000 x 4 x 5 B = np.arange(matrix_ct * 4 * 5, dtype=np.float32 ).reshape(1000, 4, 5) # outputs an array of 1000 x 2 x 5 C = matmul(A, B)

SLIDE 20

Using cuBLAS

import numpy as np from numbapro.cudalib import cublas A = np.array(np.arange(N ** 2, dtype=np.float32).reshape(N, N)) B = np.array(np.arange(N) + 10, dtype=A.dtype) D = np.zeros_like(A, order='F') # NumPy E = np.dot(A, np.diag(B)) # cuBLAS blas = cublas.Blas() blas.gemm('T', 'T', N, N, N, 1.0, A, np.diag(B), 1.0, D)

SLIDE 21

FFT Convolution with cuFFT

from numbapro.cudalib import cufft # host -> device d_img = cuda.to_device(img) # image d_fltr = cuda.to_device(fltr) # filter # FFT forward cufft.fft_inplace(d_img) cufft.fft_inplace(d_fltr) # multply vmult(d_img, d_fltr, out=d_img) # inplace # FFT inverse cufft.ifft_inplace(d_img) # device -> host filted_img = d_img.copy_to_host() @vectorize(['complex64(complex64, complex64)'], target='gpu') def vmult(a, b): return a * b

works with 1d,2d,3d

SLIDE 22

Monte-Carlo Pricing and cuRAND

from numbapro import vectorize, cuda from numbapro.cudalib import curand @vectorize(['f8(f8, f8, f8, f8, f8)'], target='gpu') def step(last, dt, c0, c1, noise): return last * math.exp(c0 * dt + c1 * noise) def monte_carlo_pricer(paths, dt, interest, volatility): n = paths.shape[0] blksz = cuda.get_current_device().MAX_THREADS_PER_BLOCK gridsz = int(math.ceil(float(n) / blksz)) # Instantiate cuRAND PRNG prng = curand.PRNG(curand.PRNG.MRG32K3A) # Allocate device side array d_normdist = cuda.device_array(n, dtype=np.double) c0 = interest - 0.5 * volatility ** 2 c1 = volatility * math.sqrt(dt) # Simulation loop d_last = cuda.to_device(paths[:, 0]) for j in range(1, paths.shape[1]): prng.normal(d_normdist, mean=0, sigma=1) d_paths = cuda.to_device(paths[:, j]) step(d_last, dt, c0, c1, d_normdist, out=d_paths) d_paths.copy_to_host(paths[:, j]) d_last = d_paths

from numbapro import jit, cuda @jit('void(double[:], double[:], double, double, double, double[:])', target='gpu') def step(last, paths, dt, c0, c1, normdist): # expands to i = cuda.threadIdx.x + cuda.blockIdx.x * cuda.blockDim.x i = cuda.grid(1) if i >= paths.shape[0]: return noise = normdist[i] paths[i] = last[i] * math.exp(c0 * dt + c1 * noise)

Tuned kernel

SLIDE 23

Benchmark

http://continuum.io/blog/monte-carlo-pricer

SLIDE 24

CUDA Python

from numbapro import cuda from numba import autojit @autojit(target=‘gpu’) def array_scale(src, dst, scale): tid = cuda.threadIdx.x blkid = cuda.blockIdx.x blkdim = cuda.blockDim.x i = tid + blkid * blkdim if i >= n: return dst[i] = src[i] * scale src = np.arange(N, dtype=np.float) dst = np.empty_like(src) array_scale[grid, block](src, dst, 5.0)

CUDA Development directly in Python

SLIDE 25

Example: Matrix multiply

@cuda.jit(argtypes=[f4[:,:], f4[:,:], f4[:,:]]) def cu_square_matrix_mul(A, B, C): sA = cuda.shared.array(shape=(tpb, tpb), dtype=f4) sB = cuda.shared.array(shape=(tpb, tpb), dtype=f4) tx = cuda.threadIdx.x ty = cuda.threadIdx.y bx = cuda.blockIdx.x by = cuda.blockIdx.y bw = cuda.blockDim.x bh = cuda.blockDim.y x = tx + bx * bw y = ty + by * bh acc = 0. for i in range(bpg): if x < n and y < n: sA[ty, tx] = A[y, tx + i * tpb] sB[ty, tx] = B[ty + i * tpb, x] cuda.syncthreads() if x < n and y < n: for j in range(tpb): acc += sA[ty, j] * sB[j, tx] cuda.syncthreads() if x < n and y < n: C[y, x] = acc

bpg = 50 tpb = 32 n = bpg * tpb A = np.array(np.random.random((n, n)), dtype=np.float32) B = np.array(np.random.random((n, n)), dtype=np.float32) C = np.empty_like(A) stream = cuda.stream() with stream.auto_synchronize(): dA = cuda.to_device(A, stream) dB = cuda.to_device(B, stream) dC = cuda.to_device(C, stream) cu_square_matrix_mul[(bpg, bpg), (tpb, tpb), stream](dA, dB, dC) dC.to_host(stream)

SLIDE 26

Performance Results

About 6x faster on the GPU.

GeForce GTX 560 Ti core i7

SLIDE 27

How to get it

Anaconda Accelerate (Anaconda Add-On)
Available as part of on-premise Wakari
On wakari.io --- cloud based Python (GPU

instances coming soon)

http://github.com/ContinuumIO/numbapro-

examples

conda install accelerate pip install conda conda init conda install accelerate

Anaconda User Other Python users

enterprise.wakari.io