[PPT] - Motivation to Learn GPGPU Julius Parulek Why to Learn About GPU? PowerPoint Presentation

SLIDE 1

Motivation to Learn GPGPU

Julius Parulek

SLIDE 2

Why to Learn About GPU?

Computational power of GPU vs. CPU

SLIDE 3

Why to Learn About GPU?

NVIDIA GPU relative performances

SLIDE 4

Why to Learn About GPU?

Hardware

SLIDE 5

Why to Learn About GPU?

Interactive rendering delivers almost off-line quality

SLIDE 6

Why to Learn About GPU?

GPU programming allows to

parallelize the computation via data parallel streaming strategy

SIMT

Single instruction multiple threads 10000s of simultaneously active threads

Examples

OpenCL, CUDA

Direct graphical enhancements

OpenGL / Direct X Shaders (GLSL,HLSL,Cg)

Replaceable state-machines on the GPU

SLIDE 7

Why to Learn About GPU?

Application fields

Force-field simulations

Particles systems, molecular simulations, graph drawing

Voronoi diagrams

Data analysis, motion planning, geometric optimization

Sorting and searching

Database and range queries

Matrix multiplication

Physical simulation

FFT

Signal processing

Visualization

Volume rendering, raycasting

SLIDE 8

Why to Learn About GPU?

Application fields

Force-field simulations

Particles systems, molecular simulations, graph drawing

Voronoi diagrams

Data analysis, motion planning, geometric optimization

Sorting and searching

Database and range queries

Matrix multiplication

Physical simulation

FFT

Signal processing

Visualization

Volume rendering, raycasting

Aim: real-time and interactive visualization of complex phenomena!

SLIDE 9

Introduction to CUDA

Julius Parulek

SLIDE 10

CUDA

Why to learn CUDA?

Random and unlimited access to memory

Read/write whenever necessary

Cooperation through shared memory High learning curve

Few extensions to C

No graphics overhead Quick implementation

Focus on parallelization and not programming A pen and a paper to analyze algorithms

Number of blocks of a number of threads

SLIDE 11

CUDA

CUDA = serial program with parallel kernels Differences between CUDA and CPU threads

CUDA threads are extremely lightweight

Very little creation overhead Instant switching

CUDA uses 1000s of threads to achieve efficiency

Multi-core CPUs can use only a few

CUDA threads are physical

NVIDIA GPUs

SLIDE 12

CUDA

Kernel

A simple C program

Kernels are executed by thread

Thousands of threads execute the same kernel

Parallelization

Each thread has its own threadID

threadID

SLIDE 13

CUDA

Threads are organized into blocks

Threads in a block can synchronize (parallel task)

Blocks are grouped into a grid

Blocks are independent (might coordinate)

SLIDE 14

CUDA

Memory hierarchy

Thread accesses Block accesses Grid accesses

threadID registers shared memory global memory

SLIDE 15

CUDA

Memory space overview

SLIDE 16

CUDA

Kernel/Thread execution

One kernel at a time is executed as a grid A block executes on one thread processor Several blocks can execute concurrently on one thread processor (multiprocessor)

SLIDE 17

CUDA

Variable keywords

global void KernelFunc(...);

kernel function, runs on device

device int GlobalVar;

variable in device memory

shared int SharedVar;

variable in per-block shared memory

Execute the kernel

kernelFunc<<<500,128>>>(a,b)

1D array of 500 blocks where each contains 1D array

f 128 threads = 500x128=64000 threads

SLIDE 18

CUDA

Get the thread ID CUDA variables

dim3 threadIdx, blockIdx, blockDim, gridDim; int threadID = blockDim.x*blockIdx.x+threadIdx.x

blockDim.x blockIdx.x threadIdx.x

SLIDE 19

CUDA

Get the thread ID CUDA variables

dim3 threadIdx, blockIdx, blockDim, gridDim; int threadID = blockDim.x*blockIdx.x+threadIdx.x

blockDim.x blockIdx.x threadIdx.x //A is in shared memory A[threadID] = begin[threadID]; //sync threads within a block __syncthreads(); int left = A[threadID - 1];

SLIDE 20

CUDA

Get the thread ID CUDA variables

dim3 threadIdx, blockIdx, blockDim, gridDim; int threadID = blockDim.x*blockIdx.x+threadIdx.x

blockDim.x blockIdx.x //A is in shared memory A[threadID] = begin[threadID]; /* atomic counter is increased when element threadID was accessed atomicInc(&a,i) -> (*a>=i) ? 0 : (*a++) */ val = atomicInc(B[threadID],val); int left = A[threadID - 1];

SLIDE 21

CUDA

Atomic functions

Atomic operation is guaranteed to be performed without interference from other threads! atomicAdd,atomicSub,atomicExch,atomicMin,atomic Max,atomicInc,atomicDec,atomicCAS, atomicAnd, atomicOr, atomicXor Notes

Shared memory is faster then using atomic operations in global memory Minimize their usage

Causing code serialization

20

SLIDE 22

CUDA

Hello World in CUDA

//add two arrays a+b in O(1) __global__ void increment_gpu( float *a, float *b, int N) { int idx = blockIdx.x * blockDim.x+ threadIdx.x; if (idx < N) a[idx] = a[idx] + b[idx]; } void main() { ….. int blockSize = 500; dim3 dimBlock (blocksize); dim3 dimGrid( ceil( N / (float)blocksize) ); increment_gpu<<<dimGrid, dimBlock>>>(a, b, N); }

SLIDE 23

CUDA

Hello World in CUDA

//add two arrays a+b in O(1) __global__ void increment_gpu( float *a, float *b, int N) { int idx = blockIdx.x * blockDim.x+ threadIdx.x; if (idx < N) a[idx] = a[idx] + b[idx]; } void main() { ….. int blockSize = 500; dim3 dimBlock (blocksize); dim3 dimGrid( ceil( N / (float)blocksize) ); increment_gpu<<<dimGrid, dimBlock>>>(a, b, N); }

SLIDE 24

CUDA

Histogram in CUDA using atomicInc

//compute a histogram b of the array a with a bin size of d __global__ void histogram( float *a, int *b, float d, int N) { int idx = blockIdx.x * blockDim.x+ threadIdx.x; if (idx < N){ int bin = int(a[idx]/d); atomicInc(b+bin, N); } } void main() { ….. int blockSize = 500; dim3 dimBlock (blocksize); dim3 dimGrid( ceil( N / (float)blocksize) ); histogram <<<dimGrid, dimBlock>>>(a, b, d, N); }

SLIDE 25

CUDA

Thread accessing privileges

We already know that each thread can:

Read/write per-thread registers Read/write per-block shared memory Read/write per-grid global memory

Additionally each thread can also:

Read/write per-thread local memory Read only per-grid constant memory Read only per-grid texture memory

texture<float,2> my_texture;

declare texture reference

float4 texel = texfetch (my_texture, u, v);

fetch the texel value

SLIDE 26

CUDA

Any source file containing CUDA language extensions must be compiled with NVCC

NVCC separates code running on the host from code running

n the device

When running in device emulation mode, one can:

Use host native debug support (breakpoints, inspection, etc.) Access any device-specific data from host code and vice-versa Call any host function from device code (e.g. printf ) and vice- versa Detect deadlock situations caused by improper usage of __syncthreads

C++ source file nvcc CPU GPU nvcc –deviceemu //use for debugging

SLIDE 27

CUDA

Reducing problems

Reduce N values to a single one

Sum(v0, v1, … , vN-2, vN-1) Max(v0, v1, … , vN-2, vN-1) Min(v0, v1, … , vN-2, vN-1)

SLIDE 28

CUDA

Example: Sum(V)=Sum(x0, x1, … , xk-2, xk-1)

SLIDE 29

CUDA

Example: Sum(V)=Sum(x0, x1, … , xk-2, xk-1)

__global__ void Sum(int *X, int *sum) { extern __shared__ int sdata[]; // each thread loads one element from global to shared mem unsigned int tid = threadIdx.x; unsigned int i = blockIdx.x*blockDim.x + threadIdx.x; sdata[tid] = X[i]; __syncthreads(); // do reduction in shared mem for(unsigned int s=1; s < blockDim.x; s *= 2) { if (tid % (2*s) == 0) { sdata[tid] += sdata[tid + s]; } __syncthreads(); } // write result for this block to global mem if (tid == 0) Y[blockIdx.x] = sdata[0]; }

SLIDE 30

CUDA

Example: Sum(V)=Sum(x0, x1, … , xk-2, xk-1)

SLIDE 31

CUDA

Example: Sum(V)=Sum(x0, x1, … , xk-2, xk-1)

__global__ void Sum(int *X, int *sum) { extern __shared__ int sdata[]; // each thread loads one element from global to shared mem unsigned int tid = threadIdx.x; unsigned int i = blockIdx.x*blockDim.x + threadIdx.x; sdata[tid] = X[i]; __syncthreads(); // do reduction in shared mem for(unsigned int s=blockDim.x/2; s>0; s>>=1) { if (tid < s) { sdata[tid] += sdata[tid + s]; } __syncthreads(); } // write result for this block to global mem if (tid == 0) Y[blockIdx.x] = sdata[0]; }

SLIDE 32

CUDA

Example: Sum(V)=Sum(x0, x1, … , xk-2, xk-1)

__global__ void Sum(int *X, int *sum) { extern __shared__ int sdata[]; // each thread loads one element from global to shared mem unsigned int tid = threadIdx.x; unsigned int i = blockIdx.x*blockDim.x + threadIdx.x; sdata[tid] = X[i]; __syncthreads();

// do reduction in shared mem, unroll for blockDim = 32 If (tid<16) sdata[tid] += sdata[tid + 16]; __syncthreads(); If (tid<8) sdata[tid] += sdata[tid + 8]; __syncthreads(); If (tid<4) sdata[tid] += sdata[tid + 4]; __syncthreads(); If (tid<2) sdata[tid] += sdata[tid + 2]; __syncthreads(); If (tid<1) sdata[tid] += sdata[tid + 1]; __syncthreads();

// write result for this block to global mem if (tid == 0) Y[blockIdx.x] = sdata[0]; }

SLIDE 33

CUDA

Example: Sparse matrix vector multiplication

Kernel construction: y=Ax Strategy

One thread is assigned to each row in matrix A

SLIDE 34

CUDA

Example: Sparse matrix vector multiplication

Kernel construction: y=Ax Strategy

One thread is assigned to each row in matrix A

SLIDE 35

CUDA

Example: Sparse matrix vector multiplication

Kernel construction: y=Ax Strategy

One thread is assigned to each row in matrix A

SLIDE 36

CUDA

Example: Sparse matrix vector multiplication

Sparse matrix A is represented by data, rows, and cols

SLIDE 37

CUDA

Example: Sparse matrix vector multiplication

Sparse diagonal matrix A is represented by data and offsets

SLIDE 38

CUDA

Example: Sparse matrix vector multiplication

SLIDE 39

CUDA

Example: Sparse matrix vector multiplication

Sparse non-diagonal matrix A is represented by data and indices Indices is the matrix n x k

k is the maximal number of non-zero elements in row

SLIDE 40

CUDA

Example: Sparse matrix vector multiplication

SLIDE 41

CUDA

Example: Sparse matrix vector multiplication

Sparse matrix A is represented by data, indices, and ptr

ptr[i] is an offset in data for the i-th row

SLIDE 42

CUDA

Example: Sparse matrix vector multiplication

SLIDE 43

CUDA

Example: Sparse matrix vector multiplication

__global__ void spmv_csr_vector_kernel(const int num_rows , const int * ptr , const int * indices , const float * data , const float * x, float * y) { __shared__ float vals[]; int thread_id = blockDim.x * blockIdx.x + threadIdx.x; int row = blockIdx.x; if (row < num_rows){ int row_start = ptr[row]; int row_end = ptr[row+1]; vals[threadIdx.x] = 0; int jj = row_start + threadIdx.x; If (jj<row_end) vals[threadIdx.x] = data[jj] * x[indices[jj]]; __syncthreads(); // parallel reduction in shared memory if (tid < 16) vals[threadIdx.x] += vals[threadIdx.x + 16]; if (tid < 8) vals[threadIdx.x] += vals[threadIdx.x + 8]; if (tid < 4) vals[threadIdx.x] += vals[threadIdx.x + 4]; if (tid < 2) vals[threadIdx.x] += vals[threadIdx.x + 2]; if (tid < 1) vals[threadIdx.x] += vals[threadIdx.x + 1]; // first thread writes the result if (tid == 0) y[row] += vals[threadIdx.x]; } }

SLIDE 44

CUDA

Ray-casting of implicit functions

//compute a histogram b of the array a with a bin size of d __global__ void raycaste(out* int,int2 N) { int idx = blockIdx.x * blockDim.x+ threadIdx.x; int idy = blockIdx.y * blockDim.y+ threadIdx.y; if (idx < N.x && idy < N.x){ Ray r(idx,idy); p = function_intersection(r,n); clr = compute_shading(p,n);

ut[idy*N.y+idx]=clr;

} }

SLIDE 45

CUDA

Libraries

PhysX,CUFFT,CUBLAS,CUDPP

Python/CUDA

pyCUDA

Cuda python bindings + GPUarray class

cudamat

Matrix operations implemented in CUDA

gnumpy

Numpy arrays on GPU

pyfft

Fast fourier transform library using pyCUDA

SLIDE 46