Welcome! Todays Agenda: Practical GPGPU: Verlet Fluid GPGPU - - PowerPoint PPT Presentation

welcome today s agenda
SMART_READER_LITE
LIVE PREVIEW

Welcome! Todays Agenda: Practical GPGPU: Verlet Fluid GPGPU - - PowerPoint PPT Presentation

/INFOMOV/ Optimization & Vectorization J. Bikker - Sep-Nov 2015 - Lecture 14: GPGPU (2) Welcome! Todays Agenda: Practical GPGPU: Verlet Fluid GPGPU Algorithms Optimizing GPU code INFOMOV Lecture 14


slide-1
SLIDE 1

/INFOMOV/ Optimization & Vectorization

  • J. Bikker - Sep-Nov 2015 - Lecture 14: “GPGPU (2)”

Welcome!

slide-2
SLIDE 2

Today’s Agenda:

  • Practical GPGPU: Verlet Fluid
  • GPGPU Algorithms
  • Optimizing GPU code
slide-3
SLIDE 3

Verlet

INFOMOV – Lecture 14 – “GPGPU (2)” 3

https://www.youtube.com/watch?v=JcgkAMr9r5o

slide-4
SLIDE 4

Verlet

Verlet Physics

Motion: 𝑦1 = 𝑦0 + 𝑤0∆𝑢 We can express this without velocities: 𝑦2 = 𝑦1 + (𝑦1 − 𝑦0) INFOMOV – Lecture 14 – “GPGPU (2)” 4 Simulation:

  • Backup current position: 𝑦𝑑𝑣𝑠𝑠𝑓𝑜𝑢 = 𝑦
  • Update positions: 𝑦 = 𝑦 + (𝑦 − 𝑦𝑞𝑠𝑓𝑤𝑗𝑝𝑣𝑡)
  • Store last position: 𝑦𝑞𝑠𝑓𝑤 = 𝑦𝑑𝑣𝑠𝑠𝑓𝑜𝑢
  • Apply constraints (e.g. walls)

Applying constraints:

  • e.g. if (x < 0) x = 0;
slide-5
SLIDE 5

Verlet

Verlet Physics

Cloth:

  • Using a grid of vertices
  • Forces on all vertices: gravity
  • Constraint for top row: fixed position
  • Constraint for all vertices: maximum

distance to neighbors Fluid:

  • Using large collection of particles
  • Forces on all particles: gravity
  • Constraint for all particles: container

boundaries

  • Constraint for all particles: do not intersect
  • ther particles

INFOMOV – Lecture 14 – “GPGPU (2)” 5

slide-6
SLIDE 6

Verlet

GPU Verlet Fluid

Input:

  • Array of particle positions
  • Array of previous particle positions

Output:

  • Visualization of simulation
  • Array of particle positions (updated)
  • Array of previous particle positions (updated)

INFOMOV – Lecture 14 – “GPGPU (2)” 6

slide-7
SLIDE 7

Verlet

GPU Verlet Fluid

Drawing a number of moving particles using OpenCL INFOMOV – Lecture 14 – “GPGPU (2)” 7

.STAGE 1

slide-8
SLIDE 8

Verlet

GPU Verlet Fluid – Host Code

INFOMOV – Lecture 14 – “GPGPU (2)” 8 Buffer* balls = new Buffer( BALLCOUNT * 6 * sizeof( float ) ); // put initial ball positions in buffer float* fb = (float*)balls->GetHostPtr(); for( int i = 0; i < BALLCOUNT; i++ ) { fb[i * 6] = Rand( 1 ); fb[i * 6 + 1] = Rand( 1 ); fb[i * 6 + 2] = Rand( 0.01f ) - 0.005f; fb[i * 6 + 3] = Rand( 0.01f ) - 0.005f; fb[i * 6 + 4] = fb[i * 6 + 0]; fb[i * 6 + 5] = fb[i * 6 + 1]; } balls->CopyToDevice();

position velocity (for now)

slide-9
SLIDE 9

Verlet

GPU Verlet Fluid – Device Code

INFOMOV – Lecture 14 – “GPGPU (2)” 9 __kernel void clear( write_only image2d_t outimg ) { int column = get_global_id( 0 ); int line = get_global_id( 1 ); if ((column >= 800) || (line >= 480)) return; write_imagef( outimg, (int2)(column, line), 0 ); } __kernel void update( global float* balls ) { int idx = get_global_id( 0 ); balls[idx * 6 + 0] += balls[idx * 6 + 2]; balls[idx * 6 + 1] += balls[idx * 6 + 3]; } Task:

  • write a single black pixel.

Workset:

  • number of pixels.

Task:

  • Update the position of one

ball. Workset:

  • Number of balls.
slide-10
SLIDE 10

Verlet

GPU Verlet Fluid – Host Code

INFOMOV – Lecture 14 – “GPGPU (2)” 10 __kernel void render( write_only image2d_t outimg, global float* balls ) { int column = get_global_id( 0 ); int line = get_global_id( 1 ); float2 uv = { (float)column / 800.0, (float)line / 480.0 }; for( int i = 0; i < BALLCOUNT; i++ ) { float2 pos = { balls[i * 6], balls[i * 6 + 1] }; float dist = length( pos - uv ); if (dist > 0.02f) continue; write_imagef( outimg, (int2)(column, 479 - line), (float4)(1,0,0,1) ); break; } }

slide-11
SLIDE 11

Verlet

GPU Verlet Fluid – Result

INFOMOV – Lecture 14 – “GPGPU (2)” 11

slide-12
SLIDE 12

Verlet

GPU Verlet Fluid

Rendering many particles efficiently INFOMOV – Lecture 14 – “GPGPU (2)” 12

.STAGE 2

slide-13
SLIDE 13

Verlet

GPU Verlet Fluid – Grid

INFOMOV – Lecture 14 – “GPGPU (2)” 13 Host: grid = new Buffer( GRIDX * GRIDY * (BALLSPERCELL + 1) * sizeof( unsigned int ) ); Device: __kernel void clearGrid( global unsigned int* grid ) { int idx = get_global_id( 0 ); int baseIdx = idx * (BALLSPERCELL + 1); grid[baseIdx] = 0; } Task:

  • Reset a grid cell by setting

ball count to 0. Workset:

  • Number of cells.

Data layout:

  • [0]: ball count for cell
  • [1..N]: ball indices
slide-14
SLIDE 14

Verlet

GPU Verlet Fluid – Grid

INFOMOV – Lecture 14 – “GPGPU (2)” 14 __kernel void fillGrid( global float* balls, global unsigned int* grid ) { int ballIdx = get_global_id( 0 ); int gx = balls[ballIdx * 6 + 0] * GRIDX; int gy = balls[ballIdx * 6 + 1] * GRIDY; if ((gx < 0) || (gy < 0) || (gx >= GRIDX) || (gy >= GRIDY)) return; int baseIdx = (gx + gy * GRIDX) * (BALLSPERCELL + 1); int count = grid[baseIdx]++; grid[baseIdx + count + 1] = ballIdx; } Task:

  • Add a single ball to the

correct grid cell. Workset:

  • Number of balls.
slide-15
SLIDE 15

Verlet

GPU Verlet Fluid – Grid

INFOMOV – Lecture 14 – “GPGPU (2)” 15 __kernel void fillGrid( global float* balls, global unsigned int* grid ) { int ballIdx = get_global_id( 0 ); int gx = balls[ballIdx * 6 + 0] * GRIDX; int gy = balls[ballIdx * 6 + 1] * GRIDY; if ((gx < 0) || (gy < 0) || (gx >= GRIDX) || (gy >= GRIDY)) return; int baseIdx = (gx + gy * GRIDX) * (BALLSPERCELL + 1); unsigned int count = atomic_inc( grid + baseIdx ); if (count < BALLSPERCELL) grid[baseIdx + count + 1] = idx; else { balls[ballIdx * 6 + 1] = balls[ballIdx * 6 + 5] = 0.1; grid[baseIdx] = BALLSPERCELL; } }

slide-16
SLIDE 16

Verlet

GPU Verlet Fluid – Grid

INFOMOV – Lecture 14 – “GPGPU (2)” 16 __kernel void render( write_only image2d_t outimg, global float* balls, global unsigned int* grid ) { int column = get_global_id( 0 ); int line = get_global_id( 1 ); if ((column >= 800) || (line >= 480)) return; float2 uv = { (float)column / 800.0, (float)line / 480.0 }; // draw balls using grid int gx = uv.x * GRIDX; int gy = uv.y * GRIDY; int gx1 = max( 0, gx - 1 ), gx2 = min( GRIDX - 1, gx + 1 ); int gy1 = max( 0, gy - 1 ), gy2 = min( GRIDY - 1, gy + 1 ); ...

slide-17
SLIDE 17

... for( int y = gy1; y <= gy2; y++ ) for( int x = gx1; x <= gx2; x++ ) { unsigned int baseIdx = (x + y * GRIDX) * (BALLSPERCELL + 1); unsigned int count = grid[baseIdx]; for( int i = 0; i < count; i++ ) { unsigned int ballIdx = grid[baseIdx + i + 1]; float2 pos = { balls[ballIdx * 6], balls[ballIdx * 6 + 1] }; float dist = length( pos - uv ); if (dist > 0.01f) continue; write_imagef( outimg, (int2)(column, 479 - line), (float4)(1,0,0,1) ); } } }

Verlet

GPU Verlet Fluid – Grid

INFOMOV – Lecture 14 – “GPGPU (2)” 17

slide-18
SLIDE 18

Verlet

GPU Verlet Fluid – Grid - Result

INFOMOV – Lecture 14 – “GPGPU (2)” 18

slide-19
SLIDE 19

Verlet

GPU Verlet Fluid

Implementing simulation INFOMOV – Lecture 14 – “GPGPU (2)” 19

.STAGE 3

slide-20
SLIDE 20

Verlet

GPU Verlet Fluid – Simulation

INFOMOV – Lecture 14 – “GPGPU (2)” 20 __kernel void simulate1( global float* balls ) { int idx = get_global_id( 0 ); float2 prevPos = { balls[idx * 6 + 0], balls[idx * 6 + 1] }; float2 delta = { balls[idx * 6 + 0] - balls[idx * 6 + 4], balls[idx * 6 + 1] - balls[idx * 6 + 5] + 0.0002 }; float speed = length( delta ); if (speed > 0.01f) delta = 0.01f * normalize( delta ); balls[idx * 6 + 0] += delta.x; balls[idx * 6 + 1] += delta.y; balls[idx * 6 + 4] = prevPos.x; balls[idx * 6 + 5] = prevPos.y; }

slide-21
SLIDE 21

Verlet

GPU Verlet Fluid – Simulation

INFOMOV – Lecture 14 – “GPGPU (2)” 21 __kernel void simulate2( global float* balls, global float* balls2, global unsigned int* grid ) { int cellIdx = get_global_id( 0 ); int baseIdx = cellIdx * (BALLSPERCELL + 1); int count = grid[baseIdx]; if (count == 0) return; int gx = idx % GRIDX; int gy = idx / GRIDX; // determine 3x3 block around current cell int gx1 = max( 0, gx - 1 ), gx2 = min( GRIDX - 1, gx + 1 ); int gy1 = max( 0, gy - 1 ), gy2 = min( GRIDY - 1, gy + 1 ); for( int i = 0; i < count; i++ ) {

slide-22
SLIDE 22

Verlet

GPU Verlet Fluid – Simulation

INFOMOV – Lecture 14 – “GPGPU (2)” 22 // get active ball int idx1 = grid[baseIdx + i + 1]; float2 ball1Pos = { balls[idx1 * 6 + 0], balls[idx1 * 6 + 1] }; // evade other balls for( int y = gy1; y <= gy2; y++ ) for( int x = gx1; x <= gx2; x++ ) { int baseIdx = (x + y * GRIDX) * (BALLSPERCELL + 1); int count2 = min( (unsigned int)BALLSPERCELL, grid[baseIdx] ); for( int j = 0; j < count2; j++ ) { int idx2 = grid[baseIdx + j + 1]; if (idx2 != idx1) { float2 ball2Pos = { balls2[idx2 * 6 + 0], balls2[idx2 * 6 + 1] }; ...

slide-23
SLIDE 23

Verlet

GPU Verlet Fluid – Simulation

INFOMOV – Lecture 14 – “GPGPU (2)” 23

slide-24
SLIDE 24

Verlet

GPU Verlet Fluid

What causes the poor performance? INFOMOV – Lecture 14 – “GPGPU (2)” 24

  • Simulation handles one grid cell per thread
  • Grid cell workload is highly irregular
  • Do we even have enough grid cells?
slide-25
SLIDE 25

Verlet

GPU Verlet Fluid - TakeAway

GPGPU is a bit different:

  • We have ‘host’ and ‘device’ code
  • We need many small identical tasks
  • Each task has an ‘identity’ (1D, 2D or 3D index in the workset)
  • Some tasks may be outside the workset (check for this!)
  • Ideally, each of those tasks should do a similar amount of work (if, for)
  • The tasks run in parallel: mind concurrency issues! (atomic)
  • Data transfer from CPU to GPU is expensive (avoid this)

In this example, OpenCL directly plotted to an OpenGL texture (which is then drawn

  • n a quad, using a shader). It is probably more efficient to let OpenCL prepare a

vertex buffer for drawing point sprites. INFOMOV – Lecture 14 – “GPGPU (2)” 25

slide-26
SLIDE 26

Today’s Agenda:

  • Practical GPGPU: Verlet Fluid
  • GPGPU Algorithms
  • Optimizing GPU code
slide-27
SLIDE 27

GPGPU Algorithms

Prefix Sum

The prefix sum (or cumulative sum) of a sequence of numbers is a second sequence of numbers consisting of the running totals of the input sequence: Input: 𝑦0, 𝑦1, 𝑦2 Output: 𝑦0, 𝑦0 + 𝑦1, 𝑦0 + 𝑦1 + 𝑦2 (inclusive) or 0, 𝑦0, 𝑦0 + 𝑦1 (exclusive). Example: Here, addition is used; more generally we can use an arbitrary binary associative operator. INFOMOV – Lecture 14 – “GPGPU (2)” 27 input 1 2 3 4 5 6 inclusive 1 3 6 10 15 21 exclusive 1 3 6 10 15

slide-28
SLIDE 28

GPGPU Algorithms

Prefix Sum

In C++:

  • ut[0] = 0;

for ( i = 1; i < n; i++ ) out[i] = in[i-1] + out[i-1];

INFOMOV – Lecture 14 – “GPGPU (2)” 28 input 1 2 3 4 5 6 inclusive 1 3 6 10 15 21 exclusive 1 3 6 10 15

slide-29
SLIDE 29

GPGPU Algorithms

Prefix Sum

The prefix sum is used for compaction. Given: kernel 𝐿 which may or may not produce output for further processing. INFOMOV – Lecture 14 – “GPGPU (2)” 29

K

slide-30
SLIDE 30

GPGPU Algorithms

Prefix Sum - Compaction

Given: kernel K which may or may not produce output for further processing. INFOMOV – Lecture 14 – “GPGPU (2)” 30

K

0 0 1 0 0 1 1 1 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 1 1 1 2 3 4 4 4 4 4 5 5 5 5 5 6 6 boolean array exclusive prefix sum

  • utput array
  • utput array size
slide-31
SLIDE 31

GPGPU Algorithms

Prefix Sum

  • ut[0] = 0;

for ( i = 1; i < n; i++ ) out[i] = in[i-1] + out[i-1]; In parallel: INFOMOV – Lecture 14 – “GPGPU (2)” 31 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 2 3 4 4 4 4 4 4 4 4 4 4 4 4 4 1 2 3 4 5 6 7 8 8 8 8 8 8 8 8 8 for ( d = 1; d <= log2n; d++ ) for all k in parallel do if k >= 2d-1 x[k] += x[k – 2d-1]

slide-32
SLIDE 32

GPGPU Algorithms

Prefix Sum

  • ut[0] = 0;

for ( i = 1; i < n; i++ ) out[i] = in[i-1] + out[i-1]; In parallel: INFOMOV – Lecture 14 – “GPGPU (2)” 32 for ( d = 1; d <= log2n; d++ ) for all k in parallel do if k >= 2d-1 x[k] += x[k – 2d-1] Notes:

  • The scan happens in-place. This is only correct if we

have 32 input elements, and the scan is done in a single warp. Otherwise we need to double buffer for correct results.

  • Time complexity of the algorithm is O(n log n). For

large n, it is not work-efficient.

slide-33
SLIDE 33

GPGPU Algorithms

Prefix Sum

INFOMOV – Lecture 14 – “GPGPU (2)” 33 1 2 1 4 1 2 1 8 1 2 1 4 1 2 1 4 1 2 1 2 1 2 1 2 1 1 1 1 1 1 1 1 1 2 1 4 1 2 1 0 1 2 1 0 1 2 1 4 1 0 1 2 1 4 1 6 0 1 2 3 4 5 6 7 Phase 1: “Up-sweep” Phase 2: “Down-sweep” Notes:

  • A scan consists of 2(n-1)

additions and n-1 swaps.

More information: GPU Gems 3, Chapter 39, “Parallel Prefix Sum (Scan) with CUDA”, Mark Harris.

slide-34
SLIDE 34

GPGPU Algorithms

Prefix Sum

You can find an implementation of the prefix sum in the OpenCL template: cl_int Buffer::ParallelScan() This replaces the contents of a buffer with the prefix sum of the same buffer. INFOMOV – Lecture 14 – “GPGPU (2)” 34

slide-35
SLIDE 35

GPGPU Algorithms

GPU Sorting: Selection Sort

INFOMOV – Lecture 14 – “GPGPU (2)” 35 __kernel void Sort( __global int* in, __global int* out ) { int i = get_global_id( 0 ); int n = get_global_size( 0 ); int iKey = in[i]; // compute position of in[i] in output int pos = 0; for( int j = 0; j < n; j++ ) { int jKey = in[j]; // broadcasted bool smaller = (jKey < iKey) || (jKey == iKey && j < i); pos += (smaller) ? 1 : 0; }

  • ut[pos] = iKey;

}

slide-36
SLIDE 36

GPGPU Algorithms

GPU Sorting

INFOMOV – Lecture 14 – “GPGPU (2)” 36

slide-37
SLIDE 37

GPGPU Algorithms

GPU Sorting Size: number of comparisons (in this case: 5 + 4 + 3 + 2 + 1 = 15) Depth: number of sequential steps (in this case: 9)

INFOMOV – Lecture 14 – “GPGPU (2)” 37

slide-38
SLIDE 38

GPGPU Algorithms

GPU Sorting

INFOMOV – Lecture 14 – “GPGPU (2)” 38

slide-39
SLIDE 39

GPGPU Algorithms

GPU Sorting

You can find an implementation of the bitonic sort in the OpenCL template: cl_int Buffer::ParallelSort() This replaces the contents of a buffer with the sorted values. INFOMOV – Lecture 14 – “GPGPU (2)” 39

slide-40
SLIDE 40

GPGPU Algorithms

INFOMOV – Lecture 14 – “GPGPU (2)” 40

Take-away:

GPGPU requires massive parallelism. Algorithms that do not exhibit this need to be replaced. The parallel scan is an important ingredient that serves as a building block for larger algorithms, or between kernels.

slide-41
SLIDE 41

Today’s Agenda:

  • Practical GPGPU: Verlet Fluid
  • GPGPU Algorithms
  • Optimizing GPU code
slide-42
SLIDE 42

Optimizing GPGPU

INFOMOV – Lecture 14 – “GPGPU (2)” 42

slide-43
SLIDE 43

Optimizing GPGPU

INFOMOV – Lecture 14 – “GPGPU (2)” 43

slide-44
SLIDE 44

Optimizing GPGPU

INFOMOV – Lecture 14 – “GPGPU (2)” 44

slide-45
SLIDE 45

Optimizing GPGPU

INFOMOV – Lecture 14 – “GPGPU (2)” 45

  • 1. Optimize memory usage
  • Read data from global memory once
  • Use local memory when possible
  • Careful: reading the same global address in 32 threads is not a good idea!
  • 2. Make sure there is enough work to hide latency
  • On AMD: use multiples of 64 threads (called a ‘wavefront’)
  • Tweak manually for performance, ideally per vendor / device
  • 3. Minimize the number of host-to-device transfers, then their size
  • 4. Minimize the number of kernel invocations

http://developer.amd.com/tools-and-sdks/opencl-zone/amd-accelerated-parallel-processing-app-sdk/opencl-optimization-guide temp = input[3] // input is in global mem Instead, use: if (get_local_id(0) == 0) local = input[3] barrier(CLK_LOCAL_MEM_FENCE); temp = local

Faster OpenCL

slide-46
SLIDE 46

Optimizing GPGPU

INFOMOV – Lecture 14 – “GPGPU (2)” 46

Faster OpenCL

Smaller things:

  • Use float4 whenever possible
  • Use predication rather than control flow
  • Bypass short-circuiting
  • Remove conditional code
  • AOS vs SOA performance
  • Reducing atomics
  • Reduced precision math
  • Pinned memory

If (A>B) C += D; else C -= D; Replace this with: int factor = (A>B) ? 1:-1; C += factor*D; if(x==1) r=0.5; if(x==2) r=1.0; becomes r = select(r, 0.5, x==1); r = select(r, 1.0, x==2); if(a&&b&&c&&d){…} becomes bool cond = a&&b&&c&&d; if(cond){…} native_log native_exp native_sqrt native_sin native_pow … pinned = clCreateBuffer( Kernel::GetContext(), CL_MEM_ALLOC_HOST_PTR | CL_MEM_WRITE_ONLY, sizeof( myData ), 0, 0 );

slide-47
SLIDE 47

Today’s Agenda:

  • Practical GPGPU: Verlet Fluid
  • GPGPU Algorithms
  • Optimizing GPU code
slide-48
SLIDE 48

/INFOMOV/ END of “GPGPU (2)”

next lecture: Presentations