BRANT ZHAO, NVIDIA MAX LV, NVIDIA
PLATFORMS BRANT ZHAO, NVIDIA MAX LV, NVIDIA Performance Energy - - PowerPoint PPT Presentation
PLATFORMS BRANT ZHAO, NVIDIA MAX LV, NVIDIA Performance Energy - - PowerPoint PPT Presentation
POWER EFFICIENT VISUAL COMPUTING ON MOBILE PLATFORMS BRANT ZHAO, NVIDIA MAX LV, NVIDIA Performance Energy Efficiency Power Efficient GPU Programming - Case Studies & Findings Case study #1: Image Pyramid Blending Image Pyramid
- Performance
- Energy
Efficiency
Power Efficient GPU Programming
- Case Studies & Findings
Case study #1: Image Pyramid Blending
+ =
Reconstruct, Up-sample and Add
Image Pyramid Blending
Image Pyramid Blending
- A naΓ―ve CUDA implementation
Blend Laplacian Pyramids cudaMalloc for pyramids Create Laplacian Pyramids for left image Create Laplacian Pyramids for right image Create Gaussian Pyramids for mask image cudaMalloc for pyramids cudaMalloc for pyramids cudaMalloc for pyramids Reconstruct Blended Image CPU GPU CPU FREQUENCY TIME
Image Pyramid Blending
- Power optimized: Avoid CPU<->GPU interleaving
Blend Laplacian Pyramids cudaMalloc for pyramids Create Laplacian Pyramids for left image Create Laplacian Pyramids for right image Create Gaussian Pyramids for mask image cudaMalloc for pyramids cudaMalloc for pyramids cudaMalloc for pyramids Reconstruct Blended Image CPU GPU CPU FREQUENCY TIME
Image Pyramid Blending
- Perf/Watt comparison
0.60 0.65 0.70 0.75 0.80 0.85 0.90 0.95 1.00 1.05 0.60 0.65 0.70 0.75 0.80 0.85 0.90 0.95 1.00 1.05
NORMALIZED PERFORMANCE
NORMALIZED CPU+GPU POWER CPU GPU interleaving NOT Interleaving
Case study #2: 2D Convolution
1 2 1 2 2 3 2 1 10 1
+ =
0.25 0.75
2D Convolution
1 2 1 2 2 3 2 1 10 1 8
+ =
0.25 0.75
2D Convolution
0.25 0.75
- Basic operations for 2 output pixels
- 9 packed FP16 MAD
1 2 1 2 2 3 2 1 10
pack0 pack1 pack2 pack3 pack4 pack5 pack6 pack7 pack8
0.25 0.5 1 8
2D Convolution
- 3x3 2D convolution with FP16
2D Convolution
- Perf/Watt comparison
0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 1.1 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 1.1
NORMALIZED PERFORMANCE NORMALIZED GPU POWER FP32 FP16
Case study #3: Sparse Lucas- Kanade Optical Flow (SparseLK)
βπ1 βπ2
SparseLK
First Frame π½ Second Frame π½πππ¦π’ βπ = π½π¦
2
π½π¦π½π§ π½π¦π½π§ π½π§
2 π¦βπ» β1
(π½ π¦ β π½πππ¦π’ π¦ + βπππ ππ€ ) π½π¦ π½π§
π¦βπ»
βπ0 βπ1 βπ2
SparseLK
- Solution#1
T0 T1 β¦ T5
βπ = π½π¦
2
π½π¦π½π§ π½π¦π½π§ π½π§
2 π¦βπ» β1
(π½ π¦ β π½πππ¦π’ π¦ + βπππ ππ€ ) π½π¦ π½π§
π¦βπ»
- Multiple threads for a feature
point
- Share data via shared memory
- r shuffle
- Reduction needed to get final
results
- High thread level
parallelism(TLP) but more instructions needed
SparseLK
- Solution#2
βπ = π½π¦
2
π½π¦π½π§ π½π¦π½π§ π½π§
2 π¦βπ» β1
(π½ π¦ β π½πππ¦π’ π¦ + βπππ ππ€ ) π½π¦ π½π§
π¦βπ»
- Each thread handles a feature
point
- No need to shuffle data
- No need to do reduction
- Need more registers to hold
data
- High instruction level
parallelism(ILP) but low
- ccupancy
T0 T1
SparseLK
- Instruction# and Perf/Watt
πππ π πππ’π’ = πππ πππππ πππ πΉπππ ππ§ πππ = πππ πππππ πΉπππ ππ§ πΉπππ ππ§ = πΉπππ ππ§πππ π½ππ‘π’π‘βπ£ππππ β π½ππ’π π£ππ’ππππ‘βπ£ππππ + πΉπππ ππ§πππ π½ππ‘π’π πππ£ππ’πππ β π½ππ‘π’π π£ππ’ππππ πππ£ππ’πππ + πΉπππ ππ§πππ π½ππ‘π’ππ’βππ β π½ππ‘π’π π£ππ’πππππ’βππ + πππ₯ππ π₯ππ‘π’ππ β ππππ πΉπππ ππ§ = πΉπππ ππ§πππ π½ππ‘π’π‘βπ£ππππ β π½ππ’π π£ππ’ππππ‘βπ£ππππ + πΉπππ ππ§πππ π½ππ‘π’π πππ£ππ’πππ β π½ππ‘π’π π£ππ’ππππ πππ£ππ’πππ + πΉπππ ππ§πππ π½ππ‘π’ππ’βππ β π½ππ‘π’π π£ππ’πππππ’βππ
SparseLK
- Perf/Watt comparison
0.0 0.2 0.4 0.6 0.8 1.0 1.2 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 1.1 NORMALIZED PERFORMANCE NORMALIZED GPU POWER Multithread per feature SingleThreadPerFeature
Summary
- Analyze the whole pipeline at the system
level
- Use energy efficient features on the
target platform
- Balance between TLP and ILP