[PPT] - PLATFORMS BRANT ZHAO, NVIDIA MAX LV, NVIDIA Performance Energy PowerPoint Presentation

SLIDE 1

BRANT ZHAO, NVIDIA MAX LV, NVIDIA

POWER EFFICIENT VISUAL COMPUTING ON MOBILE PLATFORMS

SLIDE 2

Performance
Energy

Efficiency

SLIDE 3

Power Efficient GPU Programming

Case Studies & Findings

SLIDE 4

Case study #1: Image Pyramid Blending

SLIDE 5

+ =

Reconstruct, Up-sample and Add

Image Pyramid Blending

SLIDE 6

Image Pyramid Blending

A naïve CUDA implementation

Blend Laplacian Pyramids cudaMalloc for pyramids Create Laplacian Pyramids for left image Create Laplacian Pyramids for right image Create Gaussian Pyramids for mask image cudaMalloc for pyramids cudaMalloc for pyramids cudaMalloc for pyramids Reconstruct Blended Image CPU GPU CPU FREQUENCY TIME

SLIDE 7

Image Pyramid Blending

Power optimized: Avoid CPU<->GPU interleaving

Blend Laplacian Pyramids cudaMalloc for pyramids Create Laplacian Pyramids for left image Create Laplacian Pyramids for right image Create Gaussian Pyramids for mask image cudaMalloc for pyramids cudaMalloc for pyramids cudaMalloc for pyramids Reconstruct Blended Image CPU GPU CPU FREQUENCY TIME

SLIDE 8

Image Pyramid Blending

Perf/Watt comparison

0.60 0.65 0.70 0.75 0.80 0.85 0.90 0.95 1.00 1.05 0.60 0.65 0.70 0.75 0.80 0.85 0.90 0.95 1.00 1.05

NORMALIZED PERFORMANCE

NORMALIZED CPU+GPU POWER CPU GPU interleaving NOT Interleaving

SLIDE 9

Case study #2: 2D Convolution

SLIDE 10

1 2 1 2 2 3 2 1 10 1

+ =

0.25 0.75

2D Convolution

SLIDE 11

1 2 1 2 2 3 2 1 10 1 8

+ =

0.25 0.75

2D Convolution

SLIDE 12

0.25 0.75

Basic operations for 2 output pixels
9 packed FP16 MAD

1 2 1 2 2 3 2 1 10

pack0 pack1 pack2 pack3 pack4 pack5 pack6 pack7 pack8

0.25 0.5 1 8

2D Convolution

3x3 2D convolution with FP16

SLIDE 13

2D Convolution

Perf/Watt comparison

0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 1.1 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 1.1

NORMALIZED PERFORMANCE NORMALIZED GPU POWER FP32 FP16

SLIDE 14

Case study #3: Sparse Lucas- Kanade Optical Flow (SparseLK)

∆𝑞1 ∆𝑞2

SLIDE 15

SparseLK

First Frame 𝐽 Second Frame 𝐽𝑜𝑓𝑦𝑢 ∆𝑞 = 𝐽𝑦

2

𝐽𝑦𝐽𝑧 𝐽𝑦𝐽𝑧 𝐽𝑧

2 𝑦∈𝛻 −1

(𝐽 𝑦 − 𝐽𝑜𝑓𝑦𝑢 𝑦 + ∆𝑞𝑞𝑠𝑓𝑤 ) 𝐽𝑦 𝐽𝑧

𝑦∈𝛻

∆𝑞0 ∆𝑞1 ∆𝑞2

SLIDE 16

SparseLK

Solution#1

T0 T1 … T5

∆𝑞 = 𝐽𝑦

2

𝐽𝑦𝐽𝑧 𝐽𝑦𝐽𝑧 𝐽𝑧

2 𝑦∈𝛻 −1

(𝐽 𝑦 − 𝐽𝑜𝑓𝑦𝑢 𝑦 + ∆𝑞𝑞𝑠𝑓𝑤 ) 𝐽𝑦 𝐽𝑧

𝑦∈𝛻

Multiple threads for a feature

point

Share data via shared memory
r shuffle
Reduction needed to get final

results

High thread level

parallelism(TLP) but more instructions needed

SLIDE 17

SparseLK

Solution#2

∆𝑞 = 𝐽𝑦

2

𝐽𝑦𝐽𝑧 𝐽𝑦𝐽𝑧 𝐽𝑧

2 𝑦∈𝛻 −1

(𝐽 𝑦 − 𝐽𝑜𝑓𝑦𝑢 𝑦 + ∆𝑞𝑞𝑠𝑓𝑤 ) 𝐽𝑦 𝐽𝑧

𝑦∈𝛻

Each thread handles a feature

point

No need to shuffle data
No need to do reduction
Need more registers to hold

data

High instruction level

parallelism(ILP) but low

ccupancy

T0 T1

SLIDE 18

SparseLK

Instruction# and Perf/Watt

𝑄𝑓𝑠𝑔 𝑋𝑏𝑢𝑢 = 𝑋𝑝𝑠𝑙𝑚𝑝𝑏𝑒 𝑇𝑓𝑑 𝐹𝑜𝑓𝑠𝑕𝑧 𝑇𝑓𝑑 = 𝑋𝑝𝑠𝑙𝑚𝑝𝑏𝑒 𝐹𝑜𝑓𝑠𝑕𝑧 𝐹𝑜𝑓𝑠𝑕𝑧 = 𝐹𝑜𝑓𝑠𝑕𝑧𝑄𝑓𝑠𝐽𝑜𝑡𝑢𝑡ℎ𝑣𝑔𝑔𝑚𝑓 ∗ 𝐽𝑜𝑢𝑠𝑣𝑑𝑢𝑗𝑝𝑜𝑡ℎ𝑣𝑔𝑔𝑚𝑓 + 𝐹𝑜𝑓𝑠𝑕𝑧𝑄𝑓𝑠𝐽𝑜𝑡𝑢𝑠𝑓𝑒𝑣𝑑𝑢𝑗𝑝𝑜 ∗ 𝐽𝑜𝑡𝑢𝑠𝑣𝑑𝑢𝑗𝑝𝑜𝑠𝑓𝑒𝑣𝑑𝑢𝑗𝑝𝑜 + 𝐹𝑜𝑓𝑠𝑕𝑧𝑄𝑓𝑠𝐽𝑜𝑡𝑢𝑝𝑢ℎ𝑓𝑠 ∗ 𝐽𝑜𝑡𝑢𝑠𝑣𝑑𝑢𝑗𝑝𝑜𝑝𝑢ℎ𝑓𝑠 + 𝑄𝑝𝑥𝑓𝑠𝑥𝑏𝑡𝑢𝑓𝑒 ∗ 𝑈𝑗𝑛𝑓 𝐹𝑜𝑓𝑠𝑕𝑧 = 𝐹𝑜𝑓𝑠𝑕𝑧𝑄𝑓𝑠𝐽𝑜𝑡𝑢𝑡ℎ𝑣𝑔𝑔𝑚𝑓 ∗ 𝐽𝑜𝑢𝑠𝑣𝑑𝑢𝑗𝑝𝑜𝑡ℎ𝑣𝑔𝑔𝑚𝑓 + 𝐹𝑜𝑓𝑠𝑕𝑧𝑄𝑓𝑠𝐽𝑜𝑡𝑢𝑠𝑓𝑒𝑣𝑑𝑢𝑗𝑝𝑜 ∗ 𝐽𝑜𝑡𝑢𝑠𝑣𝑑𝑢𝑗𝑝𝑜𝑠𝑓𝑒𝑣𝑑𝑢𝑗𝑝𝑜 + 𝐹𝑜𝑓𝑠𝑕𝑧𝑄𝑓𝑠𝐽𝑜𝑡𝑢𝑝𝑢ℎ𝑓𝑠 ∗ 𝐽𝑜𝑡𝑢𝑠𝑣𝑑𝑢𝑗𝑝𝑜𝑝𝑢ℎ𝑓𝑠

SLIDE 19

SparseLK

Perf/Watt comparison

0.0 0.2 0.4 0.6 0.8 1.0 1.2 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 1.1 NORMALIZED PERFORMANCE NORMALIZED GPU POWER Multithread per feature SingleThreadPerFeature

SLIDE 20

Summary

Analyze the whole pipeline at the system

level

Use energy efficient features on the

target platform

Balance between TLP and ILP

SLIDE 21

PLATFORMS BRANT ZHAO, NVIDIA MAX LV, NVIDIA Performance Energy - - PowerPoint PPT Presentation

POWER EFFICIENT VISUAL COMPUTING ON MOBILE PLATFORMS

Efficiency

Power Efficient GPU Programming

Case study #1: Image Pyramid Blending

Image Pyramid Blending

Image Pyramid Blending

Image Pyramid Blending

Image Pyramid Blending

Case study #2: 2D Convolution

+ =

2D Convolution

+ =

2D Convolution

2D Convolution

2D Convolution

Case study #3: Sparse Lucas- Kanade Optical Flow (SparseLK)

SparseLK

SparseLK

parallelism(TLP) but more instructions needed

SparseLK

parallelism(ILP) but low

SparseLK

SparseLK

Summary

level

target platform

THANK YOU

brantz@nvidia.com