A Case for Dynamic Activation Quantization in CNNs Karl Taht, Surya - - PowerPoint PPT Presentation

▶

Mar 13, 2024 477 likes •693 views

A Case for Dynamic Activation Quantization in CNNs Karl Taht, Surya Narayanan, Rajeev Balasubramonian University of Utah Overview Background Proposal Search Space Architecture Results Future Work Improving CNN Efficiency

SLIDE 1

A Case for Dynamic Activation Quantization in CNNs

Karl Taht, Surya Narayanan, Rajeev Balasubramonian University of Utah

SLIDE 2

Overview

Background
Proposal
Search Space
Architecture
Results
Future Work

SLIDE 3

Improving CNN Efficiency

Stripes: Bit-Serial Deep Neural Network Computing
Per-layer bit precisions net significant savings with <1% accuracy loss
Brute force approach to find best quantization – retraining at each step!
Good end result, but expensive!
Weight-Entropy-Based Quantization for Deep Neural Networks
Quantize both weights and activations
Guided search to find optimal quantization (entropy and clustering)
Still requires retraining, still a passive approach

Can we exploit adaptive reduced precision during inference?

SLIDE 4

Proposal: Adaptive Quantization Approach (AQuA)

Most images contain regions of irrelevant

information for the classification task

Can avoid such computations all together?
Quantize completely regions to 0 bits
More simply – Crop them!

SLIDE 5

Proposal: Activation Cropping

SLIDE 6

Proposal: Activation Cropping

Concept: Add lightweight predictor here Save computations here

SLIDE 7

Search Space – How to Crop

Image

N N

Exploit domain knowledge
Information is typically centered

within the image (>55% in our tests)

Utilize a regular pattern
Less control logic required
Maps easier to different hardware
Added bonus:
While objects are centered, majority
f area (and thus computation) is on

the outside!

SLIDE 8

Proposal: Activation Cropping

N = 25 N = 10 N = 8 N = 5 N = 2 Concept: Scale Feature Maps Proportionally

SLIDE 9

Search Space – Crop Directions

Image Image Image Image

[ 0 1 0 1 ] [ 0 1 0 0 ] [ 0 0 1 0 ] [ 0 0 0 1 ]

We consider 16 possible crops as

permutations of top, bottom, left, and right crops encoded as a vector: [ TOP , BOTTOM , LEFT , RIGHT ]

Unlike traditional pruning, AQuA can

exploit image-based information to enhance pruning options.

Image

[ 1 0 0 0 ]

Image

[ 1 0 1 1 ]

SLIDE 10

Quantifying Potentials

For maintaining original

Top-1 accuracy, 75% images can tolerate some type of crop!

Greater savings with

top-5 predictions

Technique invariant to

weight quantization

Weight Set Number of Edges Cropped

SLIDE 11

Exploiting Energy Savings with ISAAC

Activation cropping technique can

be applied to any architecture

We use the ISAAC accelerator due

to its flexibility

Future work includes leveraging

additional variable precision techniques

1 bit 2 bit 2 bit 2 bit 2 bit 1 bit 1 bit 1 bit 8 bit

Inputs W e i g h t s Outputs

SLIDE 12

Weight Precision Savings

1 bit 2 bit 2 bit 2 bit 2 bit 1 bit 1 bit 1 bit

8 columns

16 bit 10 bit 5 columns 8 bit ADC (Multiplexed)

8 x ADC Operations

5 x ADC Operations

SLIDE 13

“FlexPoint” Support

1 bit 2 bit 2 bit 2 bit 2 bit 1 bit 1 bit 1 bit

8 columns

16 bit 10 bit 5 columns 8 bit ADC (Multiplexed) Can vary shift amount to compute fixed point computations with different exponents

SLIDE 14

Activation Quantization Savings

1 bit 2 bit 2 bit 2 bit 2 bit 1 bit 1 bit 1 bit 8 bit

k-bit inputs Outputs 1...1010101 0...1000110 1...0011111 1 .. 1 1 1 1 1 .. 1 1 1 1 .. 1 1 1 1 .. 1 1 1 1 1 k .. 6 7 5 3 1 4 2 Time Step

K-bit activations (inputs) require K time steps.

Buffered Input

SLIDE 15

Activation Quantization Savings

1 bit 2 bit 2 bit 2 bit 2 bit 1 bit 1 bit 1 bit 8 bit

k-bit inputs Outputs 1...1010101 0...1000110 1...0011111 1 .. 1 1 1 1 1 .. 1 1 1 1 .. 1 1 1 1 .. 1 1 1 1 1 k .. 6 7 5 3 1 4 2 Time Step

K-bit activations (inputs) require K time steps.

Buffered Input

Fewer computations means increasing throughput, reducing area requirements, and lowering energy.

SLIDE 16

Naive Approach – Crop Everything

Substantial energy savings at a cost to accuracy
Theoretically, can save over 33% energy and maintain original accuracy!

SLIDE 17

Overall Energy Savings

Adaptive quantization saves 33%
n average compared to an

uncropped baseline.

Technique can be applied in

conjunction with weight quantization techniques with nearly identical relative savings

SLIDE 18

Future Work

Predict unimportant regions
Using a “0th” layer with a just a

few gradient-based kernels

Use variable low precision

computations unimportant regions (not just cropping)

Quantify energy and latency

changes due to additional prediction step, but fewer

verall computations

Original Sobel Gradient

SLIDE 19

Conclusion

Adaptive quantization saves 33%
n average compared to an

uncropped baseline.

Technique can be applied in

conjunction with weight quantization techniques with nearly identical relative savings

SLIDE 20

A Case for Dynamic Activation Quantization in CNNs

Overview

Improving CNN Efficiency

Can we exploit adaptive reduced precision during inference?

Proposal: Adaptive Quantization Approach (AQuA)

information for the classification task

Proposal: Activation Cropping

Proposal: Activation Cropping

Search Space – How to Crop

Image

N N

Proposal: Activation Cropping

Search Space – Crop Directions

Image Image Image Image

Image

Image

Quantifying Potentials

Top-1 accuracy, 75% images can tolerate some type of crop!

top-5 predictions

weight quantization

Exploiting Energy Savings with ISAAC

be applied to any architecture

to its flexibility

additional variable precision techniques

Weight Precision Savings

“FlexPoint” Support

Activation Quantization Savings

K-bit activations (inputs) require K time steps.

Activation Quantization Savings

K-bit activations (inputs) require K time steps.

Fewer computations means increasing throughput, reducing area requirements, and lowering energy.

Naive Approach – Crop Everything

Overall Energy Savings

uncropped baseline.

conjunction with weight quantization techniques with nearly identical relative savings

Future Work

computations unimportant regions (not just cropping)

changes due to additional prediction step, but fewer

Conclusion

uncropped baseline.

conjunction with weight quantization techniques with nearly identical relative savings

Thank you!

Questions?