PLATFORMS BRANT ZHAO, NVIDIA MAX LV, NVIDIA Performance Energy - - PowerPoint PPT Presentation

β–Ά
platforms
SMART_READER_LITE
LIVE PREVIEW

PLATFORMS BRANT ZHAO, NVIDIA MAX LV, NVIDIA Performance Energy - - PowerPoint PPT Presentation

POWER EFFICIENT VISUAL COMPUTING ON MOBILE PLATFORMS BRANT ZHAO, NVIDIA MAX LV, NVIDIA Performance Energy Efficiency Power Efficient GPU Programming - Case Studies & Findings Case study #1: Image Pyramid Blending Image Pyramid


slide-1
SLIDE 1

BRANT ZHAO, NVIDIA MAX LV, NVIDIA

POWER EFFICIENT VISUAL COMPUTING ON MOBILE PLATFORMS

slide-2
SLIDE 2
  • Performance
  • Energy

Efficiency

slide-3
SLIDE 3

Power Efficient GPU Programming

  • Case Studies & Findings
slide-4
SLIDE 4

Case study #1: Image Pyramid Blending

slide-5
SLIDE 5

+ =

Reconstruct, Up-sample and Add

Image Pyramid Blending

slide-6
SLIDE 6

Image Pyramid Blending

  • A naΓ―ve CUDA implementation

Blend Laplacian Pyramids cudaMalloc for pyramids Create Laplacian Pyramids for left image Create Laplacian Pyramids for right image Create Gaussian Pyramids for mask image cudaMalloc for pyramids cudaMalloc for pyramids cudaMalloc for pyramids Reconstruct Blended Image CPU GPU CPU FREQUENCY TIME

slide-7
SLIDE 7

Image Pyramid Blending

  • Power optimized: Avoid CPU<->GPU interleaving

Blend Laplacian Pyramids cudaMalloc for pyramids Create Laplacian Pyramids for left image Create Laplacian Pyramids for right image Create Gaussian Pyramids for mask image cudaMalloc for pyramids cudaMalloc for pyramids cudaMalloc for pyramids Reconstruct Blended Image CPU GPU CPU FREQUENCY TIME

slide-8
SLIDE 8

Image Pyramid Blending

  • Perf/Watt comparison

0.60 0.65 0.70 0.75 0.80 0.85 0.90 0.95 1.00 1.05 0.60 0.65 0.70 0.75 0.80 0.85 0.90 0.95 1.00 1.05

NORMALIZED PERFORMANCE

NORMALIZED CPU+GPU POWER CPU GPU interleaving NOT Interleaving

slide-9
SLIDE 9

Case study #2: 2D Convolution

slide-10
SLIDE 10

1 2 1 2 2 3 2 1 10 1

+ =

0.25 0.75

2D Convolution

slide-11
SLIDE 11

1 2 1 2 2 3 2 1 10 1 8

+ =

0.25 0.75

2D Convolution

slide-12
SLIDE 12

0.25 0.75

  • Basic operations for 2 output pixels
  • 9 packed FP16 MAD

1 2 1 2 2 3 2 1 10

pack0 pack1 pack2 pack3 pack4 pack5 pack6 pack7 pack8

0.25 0.5 1 8

2D Convolution

  • 3x3 2D convolution with FP16
slide-13
SLIDE 13

2D Convolution

  • Perf/Watt comparison

0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 1.1 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 1.1

NORMALIZED PERFORMANCE NORMALIZED GPU POWER FP32 FP16

slide-14
SLIDE 14

Case study #3: Sparse Lucas- Kanade Optical Flow (SparseLK)

βˆ†π‘ž1 βˆ†π‘ž2

slide-15
SLIDE 15

SparseLK

First Frame 𝐽 Second Frame π½π‘œπ‘“π‘¦π‘’ βˆ†π‘ž = 𝐽𝑦

2

𝐽𝑦𝐽𝑧 𝐽𝑦𝐽𝑧 𝐽𝑧

2 π‘¦βˆˆπ›» βˆ’1

(𝐽 𝑦 βˆ’ π½π‘œπ‘“π‘¦π‘’ 𝑦 + βˆ†π‘žπ‘žπ‘ π‘“π‘€ ) 𝐽𝑦 𝐽𝑧

π‘¦βˆˆπ›»

βˆ†π‘ž0 βˆ†π‘ž1 βˆ†π‘ž2

slide-16
SLIDE 16

SparseLK

  • Solution#1

T0 T1 … T5

βˆ†π‘ž = 𝐽𝑦

2

𝐽𝑦𝐽𝑧 𝐽𝑦𝐽𝑧 𝐽𝑧

2 π‘¦βˆˆπ›» βˆ’1

(𝐽 𝑦 βˆ’ π½π‘œπ‘“π‘¦π‘’ 𝑦 + βˆ†π‘žπ‘žπ‘ π‘“π‘€ ) 𝐽𝑦 𝐽𝑧

π‘¦βˆˆπ›»

  • Multiple threads for a feature

point

  • Share data via shared memory
  • r shuffle
  • Reduction needed to get final

results

  • High thread level

parallelism(TLP) but more instructions needed

slide-17
SLIDE 17

SparseLK

  • Solution#2

βˆ†π‘ž = 𝐽𝑦

2

𝐽𝑦𝐽𝑧 𝐽𝑦𝐽𝑧 𝐽𝑧

2 π‘¦βˆˆπ›» βˆ’1

(𝐽 𝑦 βˆ’ π½π‘œπ‘“π‘¦π‘’ 𝑦 + βˆ†π‘žπ‘žπ‘ π‘“π‘€ ) 𝐽𝑦 𝐽𝑧

π‘¦βˆˆπ›»

  • Each thread handles a feature

point

  • No need to shuffle data
  • No need to do reduction
  • Need more registers to hold

data

  • High instruction level

parallelism(ILP) but low

  • ccupancy

T0 T1

slide-18
SLIDE 18

SparseLK

  • Instruction# and Perf/Watt

𝑄𝑓𝑠𝑔 𝑋𝑏𝑒𝑒 = π‘‹π‘π‘ π‘™π‘šπ‘π‘π‘’ 𝑇𝑓𝑑 πΉπ‘œπ‘“π‘ π‘•π‘§ 𝑇𝑓𝑑 = π‘‹π‘π‘ π‘™π‘šπ‘π‘π‘’ πΉπ‘œπ‘“π‘ π‘•π‘§ πΉπ‘œπ‘“π‘ π‘•π‘§ = πΉπ‘œπ‘“π‘ π‘•π‘§π‘„π‘“π‘ π½π‘œπ‘‘π‘’π‘‘β„Žπ‘£π‘”π‘”π‘šπ‘“ βˆ— π½π‘œπ‘’π‘ π‘£π‘‘π‘’π‘—π‘π‘œπ‘‘β„Žπ‘£π‘”π‘”π‘šπ‘“ + πΉπ‘œπ‘“π‘ π‘•π‘§π‘„π‘“π‘ π½π‘œπ‘‘π‘’π‘ π‘“π‘’π‘£π‘‘π‘’π‘—π‘π‘œ βˆ— π½π‘œπ‘‘π‘’π‘ π‘£π‘‘π‘’π‘—π‘π‘œπ‘ π‘“π‘’π‘£π‘‘π‘’π‘—π‘π‘œ + πΉπ‘œπ‘“π‘ π‘•π‘§π‘„π‘“π‘ π½π‘œπ‘‘π‘’π‘π‘’β„Žπ‘“π‘  βˆ— π½π‘œπ‘‘π‘’π‘ π‘£π‘‘π‘’π‘—π‘π‘œπ‘π‘’β„Žπ‘“π‘  + 𝑄𝑝π‘₯𝑓𝑠π‘₯𝑏𝑑𝑒𝑓𝑒 βˆ— π‘ˆπ‘—π‘›π‘“ πΉπ‘œπ‘“π‘ π‘•π‘§ = πΉπ‘œπ‘“π‘ π‘•π‘§π‘„π‘“π‘ π½π‘œπ‘‘π‘’π‘‘β„Žπ‘£π‘”π‘”π‘šπ‘“ βˆ— π½π‘œπ‘’π‘ π‘£π‘‘π‘’π‘—π‘π‘œπ‘‘β„Žπ‘£π‘”π‘”π‘šπ‘“ + πΉπ‘œπ‘“π‘ π‘•π‘§π‘„π‘“π‘ π½π‘œπ‘‘π‘’π‘ π‘“π‘’π‘£π‘‘π‘’π‘—π‘π‘œ βˆ— π½π‘œπ‘‘π‘’π‘ π‘£π‘‘π‘’π‘—π‘π‘œπ‘ π‘“π‘’π‘£π‘‘π‘’π‘—π‘π‘œ + πΉπ‘œπ‘“π‘ π‘•π‘§π‘„π‘“π‘ π½π‘œπ‘‘π‘’π‘π‘’β„Žπ‘“π‘  βˆ— π½π‘œπ‘‘π‘’π‘ π‘£π‘‘π‘’π‘—π‘π‘œπ‘π‘’β„Žπ‘“π‘ 

slide-19
SLIDE 19

SparseLK

  • Perf/Watt comparison

0.0 0.2 0.4 0.6 0.8 1.0 1.2 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 1.1 NORMALIZED PERFORMANCE NORMALIZED GPU POWER Multithread per feature SingleThreadPerFeature

slide-20
SLIDE 20

Summary

  • Analyze the whole pipeline at the system

level

  • Use energy efficient features on the

target platform

  • Balance between TLP and ILP
slide-21
SLIDE 21

THANK YOU

brantz@nvidia.com