RAMA HOETZLEIN Graphics Research Engineer | SIGGRAPH 2013 Outline - - PowerPoint PPT Presentation

rama hoetzlein
SMART_READER_LITE
LIVE PREVIEW

RAMA HOETZLEIN Graphics Research Engineer | SIGGRAPH 2013 Outline - - PowerPoint PPT Presentation

RAMA HOETZLEIN Graphics Research Engineer | SIGGRAPH 2013 Outline Atomic Ops state Bottlenecks change Divergence Occupancy Part 1 CUDA Best Practice Hardware Strategies Diffuclty Part 2 CUDA Optimization Deploy Optimize


slide-1
SLIDE 1

RAMA HOETZLEIN

Graphics Research Engineer | SIGGRAPH 2013

slide-2
SLIDE 2

Outline

Diffuclty Talk Time 10 min 20 min 30 min 40 min 50 min Part 1 – CUDA Best Practice Strategies Part 2 – CUDA Optimization Assess Parallelize Optimize Deploy Hardware Occupancy Divergence Bottlenecks Atomic Ops

state change

slide-3
SLIDE 3

Part #1 – Best Practices: Strategies

slide-4
SLIDE 4

Assess Parallelize Optimize Deploy

APOD: A Systematic Path to Performance

slide-5
SLIDE 5

Assess

  • Know your application problem
  • Know your hardware capabilities
  • Determine what aspects of problem are best

suited to parallelization. Identify “hotspots”

  • Use profiling tools for find critical bottlenecks in

CPU code

slide-6
SLIDE 6

NVIDIA CUDA-MEMCHECK for Linux & Mac NVIDIA Nsight Eclipse & Visual Studio Editions Allinea DDT with CUDA Distributed Debugging Tool TotalView for CUDA for Linux Clusters NVIDIA CUDA-GDB for Linux & Mac

Profiling and Debugging Solutions

http://developer.nvidia.com/nsight

slide-7
SLIDE 7

Assess

  • 1. Know your hardware!

CPU Core i7-3770 4 108 25 GeForce GTX480 480 1345 177 Quadro K5000 1536 2168 172 Tesla K20X 2688 3950 250 Cores Gflops MB/s threads run in parallel on many cores

slide-8
SLIDE 8

Assess

Practical Example: Fluid Simulation

slide-9
SLIDE 9

Assess

  • 2. Know your problem!

Insert into Accel Grid Compute Forces Compute Pressures Integrate

slide-10
SLIDE 10

Assess

  • 2. Know your problem!

Insert into Accel Grid Compute Forces Compute Pressures Integrate

Search for neighboring particles. (NNS) Like to be slowest part of code.

CPU version: O ( n^2 ) worst case O ( n k ) spatial grid lookup

slide-11
SLIDE 11

Assess

  • 3. Determine metrics

Time: Standardize your units (avoid using fps) Consider time to complete task, time per frame, and time per sub-task. e.g. milliseconds Performance: Measures the overall ability to do work. Choose a reasonable metric.. e.g. Image processing.. pixels/sec Combination of algorithm efficiency and hardware. e.g. particles / second == particles / op * ops / second Efficiency: Normalizes performance by dividing by hardware Gflops. Measures the capability of the algorithm regardless of hardware. e.g. (particles / second) / Gflops == particles / Gflop

slide-12
SLIDE 12

Assess

  • 4. Identify hotspots

One frame 524,288 particles Total: 1300 ms / frame Power = 403,289 particles / sec Efficiency = 186 p/s/Gf

slide-13
SLIDE 13

Assess

  • 4. Identify hotspots

524,288 particles Insert 7 ms Pressure 480 ms Force 788 ms Advance 36 ms Order of magnitude greater than

  • ther steps
slide-14
SLIDE 14

Parallelize

  • Determine amount of crosstalk in the problem
  • Identify parallel method suitable to problem
  • Translate CPU algorithms to GPU
slide-15
SLIDE 15

Parallelize

  • 1. Crosstalk and Coherency

determine ease of parallelism

Color Grading Simple Particles Image Blur N-Body Problem Fluid Simulation Raytracing coherent incoherent

slide-16
SLIDE 16

Parallelize

  • 2. Design parallel algorithm

Example: Fluid Simulation Key Observations

  • 1. Particles are dynamic
  • 2. Particles become incoherent in memory (mix) as they move
  • 3. Radix-Sort can keep coherency. Radix = fast parallel sort.
  • 4. Do Neighbor Search on coherent particles.

Assign one particle per thread. Keep coherent by sorting each frame. Many resources available: CUDA SDK Samples Developer Forums

developer.nvidia.com/gpu-computing-sdk devtalk.nvidia.com

slide-17
SLIDE 17

Optimize

  • 1. Compare GPU to CPU

One frame CPU Time: 1300 ms CPU Pow: 403,289 p/sec CPU Effic: 3734 p/s/Gf GPU Time: 90 ms / frame GPU Pow: 5,825,422 p/sec 14x faster GPU Effic: 2687 p/s/Gf 524,288 particles

slide-18
SLIDE 18

Optimize

  • 2. Memory Architecture

Kepler Memory Hierarchy

L2 Global Memory Registers

SM-N

Registers

SM-0

Registers

SM-1

L1 SMEM

Rea d

  • nly

L1 SMEM Read

  • nly

L1 SMEM

Rea d

  • nly

Global Memory 170 GB/s (inc. Local Memory) (400 cyl.) L2 Cache, 1.5MB GK110 Shared Memory, 64k 2000 GB/s (shared per SMX) Read-only, 48k Texture Memory (100 cyl.) L1 Cache Registers 8000 GB/s

slide-19
SLIDE 19

Optimize

  • 3. Keep Optimizing!

What is

  • ccupancy?

Why is it 56% for Forces? Why is shared memory not used?

Shared mem: 100x faster

slide-20
SLIDE 20

Deploy

  • Multiple GPUs

cudaGetDeviceCount

  • Error handling

cudaGetErrorString()

  • NVML:

Cluster management

NV-SMI:

System monitoring

Once you have a working, efficient GPU solution…

slide-21
SLIDE 21

Part #2 – Best Practices: CUDA Optimization

slide-22
SLIDE 22

Hardware Architecture SimpleGPU A visual simplification of the GPU with all the essential components, to help visualize optimization issues.

slide-23
SLIDE 23

Hardware Architecture

Fermi Kepler

GF100 GF104 GK104 GK110

32 32 32 32

Threads / Warp Registers / Thread

63 63 63 255

Threads / Threadblock

1024 1024 1024 1024

Shared Memory

48k 48k 48k 48k

Local Memory Cores / MP

32 32 192 192 variable (uses GMEM)

Global Memory GTX 480 = 1.5G Titan / K20 = 6 GB

Know your hardware.

slide-24
SLIDE 24

Execution Model

0 1 2 31 0 1 2 31

SMX = Streaming multi-processors

  • Run a threadblock (multiple warps)
  • Shares shared memory
  • Provides registers to each thread

All threads in a warp are launched in parallel. Instructions and memory reads are executed in parallel within a warp. Threads = virtual, millions Cores = limited, physical Many threads (virtual) are scheduled to run on cores (physical hardware)

launched waiting launched waiting

CUDA code

SMX #1 SMX #2

slide-25
SLIDE 25

Occupancy

  • 1. Maximize use of all SMX on the GPU
  • 2. Maximize use of threads in a warp per SMX
slide-26
SLIDE 26

Occupancy #1 – Maximize GPU Usage

0 1 2 31 0 1 2 31

C Code: for (i = 0; i < 1024; i++) y = data[ i ] ^ 2; CUDA Code: kernel < grid, tblk > ( data ) { int i = threadIdx.x; int y = data[ i ] ^ 2; } data [ ]

slide-27
SLIDE 27

Occupancy #1 - Maximize GPU Usage

0 1 2 31 0 1 2 31

Dim2 tblk ( 16, 16 ) = 256 threads Dim1 grid ( 1, 1 ) kernel < grid, tblk > ( my_img, grey ) “Hey, great, 256 steps in parallel” data [ ]

slide-28
SLIDE 28

Occupancy #1 - Maximize GPU Usage

0 1 2 31 0 1 2 31

Dim2 tblk ( 16, 16 ) Dim2 grid ( 2, 1 ) kernel < grid, tblk > ( my_img, grey ) “Wow, double the calculations!” It takes the same amount of time ! Most of the GPU is just sitting there. 2x work data [ ]

slide-29
SLIDE 29

Occupancy #1 - Maximize GPU Usage

0 1 2 31 0 1 2 31

Dim2 tblk ( 16, 16 ) Dim2 grid ( 64, 64 ) kernel < grid, tblk > ( my_img, grey ) Now we’re doing ~1,024 in parallel! … AND giving GPU enough to stay busy. Total: 1,048,576 threads scheduled Yes! data [ ]

#1. Maximize use of SMX in GPU

slide-30
SLIDE 30

Occupancy #2 - Threadblocks

a b c a b c a b c a b c a b c a b c a b c a b c

0 1 2 31 0 1 2 31

3 * 10 = 30 (not 32) >> non-full warp 3 * 25 = 75 (not 96) >> non-full threadblock (tails)

unused unused unused unused

Irregular threadblock dimensions cause low occupancy in each warp, and tail threads…. even though the grid is large. Dim2 tblk ( 3, 25 ) Dim2 grid ( 64, 64 ) kernel < grid, tblk > ( my_img, grey )

slide-31
SLIDE 31

Occupancy #2 - Threadblocks

a b c a b c a b c a b c a b c a b c a b c a b c

0 1 2 31 0 1 2 31

Dim2 tblk ( 16, 1 ) Dim2 grid ( 64, 64 ) kernel < grid, tblk > ( my_img, grey ) Dim2 tblk ( 32, 32 ) Dim2 grid ( 64, 64 ) kernel < grid, tblk > ( my_img, grey ) Now: 1024 threads / threadblock Now threadblocks are full. Only 16 threads per threadblock. GPU supports: 32 threads / warp 1024 threads / threadblock

#1. Maximize use of threadblocks

slide-32
SLIDE 32

Execution Divergence

  • 1. Reduce or eliminate the use of conditionals
  • 2. Maximize computational ops

(over conditional ops and memory ops)

slide-33
SLIDE 33

Execution Divergence

0 1 2 31 0 1 2 31

SMX #1 SMX #2 kernel < grid, tblk > ( in, out, param ) { int i = blockIdx.x*blockDim.x + threadIdx.x; if ( in[ i+1 ] > 0 ) {

  • ut[ i ] = pow ( in[ i ], in[ i+1] );

} else {

  • ut[ i ] = 1;

} } Time Warp #1 Warp #2 – must wait for all to finish Code makes sure value is in range. Not an issue across SMX.

if pow

if if

if pow

Cores idle Core idle Core idle

slide-34
SLIDE 34

Execution Divergence

0 1 2 31 0 1 2 31

SMX #1 SMX #2 kernel < grid, tblk > ( in, out, param ) { int i = blockIdx.x*blockDim.x + threadIdx.x;

  • ut[ i ] = pow ( in[ i ], in[ i+1] );

} Time Warp #1 Warp #2 – next warp launches sooner Do validation on input data before launching kernel.

slide-35
SLIDE 35

Execution Divergence

0 1 2 31 0 1 2 31

SMX #1 SMX #2 kernel < grid, tblk > ( in, out, param ) { int i = blockIdx.x*blockDim.x + threadIdx.x; val = in [ i ];

if ( isEnabled[0] ) val = pow ( val, param[0] ); if ( isEnabled[1] ) val = val * param[1]; if ( isEnabled[2] ) val = val + param[2];

  • ut [ i ] = val;

} Time isEnabled

Image processing: Conditionally perform a power, multiply and offset on each pixel.

Warp #1 Warp #2

3x if 3x sto 3x eq pow + 3x if 3x sto 3x eq pow + 3x if 3x sto 3x eq pow + 3x if 3x sto 3x eq pow +

slide-36
SLIDE 36

Execution Divergence

0 1 2 31 0 1 2 31

SMX #1 SMX #2 kernel < grid, tblk > ( in, out, param ) { int i = blockIdx.x*blockDim.x + threadIdx.x; val = in [ i ];

if ( isEnabled[0] ) val = pow ( val, param[0] ); if ( isEnabled[1] ) val = val * param[1]; if ( isEnabled[2] ) val = val + param[2];

  • ut [ i ] = val;

} Time isEnabled

Image processing: Conditionally perform a power, multiply and offset on each pixel.

3x if 3x sto 3x eq pow +

Warp #1 Warp #2 – must wait for all to finish Most of the time is spent loading and checking parameters, rather than computing.

3x if 3x sto 3x eq pow + 3x if 3x sto 3x eq pow + 3x if 3x sto 3x eq pow +

slide-37
SLIDE 37

Execution Divergence

0 1 2 31 0 1 2 31

SMX #1 SMX #2 kernel < grid, tblk > ( in, out, param ) { int i = blockIdx.x*blockDim.x + threadIdx.x;

  • ut[i] = pow ( in[i], param[0] ) * param[1]

+ param[2];

} Time

#1. Avoided global load of isEnabled #2. Avoided conditional checks #3. Avoided register use (val =)

pow x + pow x +

Solution: Use parameters to set identity values when not enabled. e.g. val = val * 1; val = val + 0;

pow x + pow x +

Maximize use of the GPU for calculation. Memory reads and conditionals are not always “useful” work.

slide-38
SLIDE 38

Bottleneck: Memory or Instruction?

  • 1. Are you memory or instruction-limited ?
  • 2. If memory-limited, seek to reduce bandwidth
  • 3. If instruction-limited, seek to remove

unnecessary calculations.

slide-39
SLIDE 39

Theoretical Limits

MEMORY Theoretical memory throughput: 177 GB/s (Tesla M2090) INSTRUCTIONS Refer to CUDA Programming Guide for ops/cycle Theoretical instruction throughput: 655 Ginst/sec (Tesla M2090) BALANCED RATIO This is what we want to opimize: 3.76 : 1 higher ratio = instruction-bound lower ratio = memory-bound

slide-40
SLIDE 40

Memory Limited

int px = threadIdx.y * blockIdx.x + threadIdx.x int a = img [ px-1 ]; int b = img [ px ] ; // filter operation int c = img [ px+1 ] ; int d; d = a + b + c; // combine pixel values

a b c a b c a b c a b c a b c a b c a b c a b c

0 1 2 31 0 1 2 31

kernel < 3, 25, 1 > ( my_img, grey ) { LOADS: 4x 32-bit global per thread = 16 bytes / thread OPS: 3x adds per thread Ratio = 16:3 = 5.3:1 great than 4:1 memory bound

OPS

LOADS OPS

slide-41
SLIDE 41

Memory Limited – Soln #1. Reduce data size

int px = threadIdx.y * blockIdx.x + threadIdx.x int a = img [ px-1 ]; int b = img [ px ] ; int c = img [ px+1 ] ; int d; d = a + b + c;

a b c a b c a b c a b c a b c a b c a b c a b c

0 1 2 31 0 1 2 31

kernel < 3, 25, 1 > ( my_img, grey ) { LOADS: 4x 8-bit global per thread = 4 bytes / thread OPS: 3x add per thread Ratio = 4:3 = 1.3:1 less than 4:1 no longer memory bound

OPS LOADS

slide-42
SLIDE 42

Memory Limited – Soln #2. Use Shared memory

int i = threadIdx.y * blockIdx.x + threadIdx.x __shared__ pixels[32]; if ( threadIdx.x < 32 ) pixels[ threadIdx.x ] = img [ i ]; __syncthreads(); tx = threadIdx.x; img [ i ] = pixels[tx-1] + pixels[tx] + pixels[tx+1]; }

a b c a b c a b c a b c a b c a b c a b c a b c

0 1 2 31 0 1 2 31

kernel < 3, 25, 1 > ( my_img, grey ) { Share memory is 100x faster than global. LOADS: 1 bytes per thread (each does 1) OPS: 3x add per thread Ratio = 1:3 = 0.3:1 much less than 4:1 not memory bound

OPS LOADS

Load shared memory Calculate using shared mem

slide-43
SLIDE 43

Coalescing and Memory Access

Scenario:

Warp requests 32 aligned, consecutive 4-byte words Warp needs 128 bytes – moved together across bus ... addresses from a warp

96 192 128 160 224 288 256 32 64 352 320 384 448 416 Memory addresses

slide-44
SLIDE 44

... addresses from a warp

Scenario:

Warp requests 32 aligned, permuted 4-byte words Perfectly coalesced

96 192 128 160 224 288 256 32 64 352 320 384 448 416 Memory addresses

Coalescing and Memory Access

slide-45
SLIDE 45

Scenario:

Warp requests 32 misaligned, consecutive 4-byte words At most 160 bytes move across the bus. No coalesced

96 192 128 160 224 288 256 32 64 352 320 384 448 416 Memory addresses

... addresses from a warp

Coalescing and Memory Access

slide-46
SLIDE 46

addresses from a warp

Scenario:

All threads in a warp request the same 4-byte word 32 bytes move across the bus. Bus utilization: 12.5%. Access conflict. ...

96 192 128 160 224 288 256 32 64 352 320 384 448 416 Memory addresses

Coalescing and Memory Access

slide-47
SLIDE 47

addresses from a warp

96 192 128 160 224 288 256 32 64 352 320 384 448 416 Memory addresses

Scenario:

Warp requests 32 scattered 4-byte words, outside 128 bytes Non-coalesced ...

Coalescing and Memory Access

slide-48
SLIDE 48

Instruction Limited

Assuming that data is loaded efficiently (not memory-bound): #1. Are you performing unnecessary calculations? #2. Can the problem/algorithm be reformulated? #3. Use faster ops: Can you use float instead of double? Generally, a good problem to have: It means most of your effort is going to do useful work.

  • Memory ops are necessary, but not useful
  • Conditional may be necessary, but often are not
slide-49
SLIDE 49

Instruction Limited

int px = threadIdx.y * blockIdx.x + threadIdx.x int a = img [ px ]; int d; d = pow ( a, 3.2 ) * sin ( a ) + a*a;

a b c a b c a b c a b c a b c a b c a b c a b c

0 1 2 31 0 1 2 31

kernel < 3, 25, 1 > ( my_img, grey ) { LOADS: 1x 4-bit global LOADS per thread = 4 bytes / thread OPS: 3x add/mul + (2x pow/sin * 6) = 15 Ratio = 1:15 = 1:15 much less than 4:1 Instruction bound

32-bit float add/mul: 192 per CC = 1 op 32-bit float sqrt/pow/sin: 32 per CC = 6 ops

CUDA Programming Guide (Section 5.4.1)

OPS LOADS

slide-50
SLIDE 50

Instruction Limited

int px = threadIdx.y * blockIdx.x + threadIdx.x int a = img [ px ]; int d; d = pow ( a, 3.2 ) * sin ( a ) + a*a;

a b c a b c a b c a b c a b c a b c a b c a b c

0 1 2 31 0 1 2 31

kernel < 3, 25, 1 > ( my_img, grey ) {

OPS LOADS

Are your math operations optimal? Common approaches:

  • Avoid unnecessary calculations
  • CUDA fast, half-precision ops.
  • Lookup tables, constant memory
  • Algorithmic…
slide-51
SLIDE 51

Assess Parallelize Optimize Deploy

Know your hardware Find hotspots

slide-52
SLIDE 52

Assess Parallelize Optimize Deploy

Know your hardware Find hotspots Write CUDA code Profile CPU and GPU

slide-53
SLIDE 53

Assess Parallelize Optimize Deploy

Know your hardware Find hotspots Write CUDA code Occupancy Branching Bandwidth Profile CPU and GPU Coalescing

slide-54
SLIDE 54

Assess Parallelize Optimize Deploy

Know your hardware Find hotspots Write CUDA code Occupancy Branching Bandwidth Profile CPU and GPU Multiple Devices Error Handling Distribute! Coalescing

slide-55
SLIDE 55

Thank You !

Further Resources:

CUDA Best Practices: http://docs.nvidia.com/cuda/cuda-c-best-practices-guide/ CUDA SDK Samples: https://developer.nvidia.com/gpu-computing-sdk GPU Technology Conference (GTC) - Featured talks: http://www.gputechconf.com NSight Profiler: https://developer.nvidia.com/nvidia-nsight-visual-studio-edition (free) HPC Computing with CUDA: http://www.nvidia.com/object/SC09_Tutorial.html

RAMA HOETZLEIN

Graphics Research Engineer | SIGGRAPH 2013

RAMA HOETZLEIN rhoetzlein@nvidia.com