Convolutional Neural Networks (CNNs) Recurrent Neural Networks - - PowerPoint PPT Presentation

▶

Apr 28, 2023 209 likes •528 views

Convolutional Neural Networks (CNNs) Recurrent Neural Networks (RNNs) L1 Scalar Processor L0 L0 M*V Processor N N N N N N Project Brainwave Pretrained DNN Model Scalable DNN Hardware Neural Processing Unit Microservice Extract

SLIDE 1

SLIDE 2

Convolutional Neural Networks (CNNs) Recurrent Neural Networks (RNNs)

SLIDE 3

SLIDE 4

SLIDE 5

N N N L0 L1 N N N L0

Scalable DNN Hardware Microservice Project Brainwave Neural Processing Unit Scalar Processor

M*V Processor

Pretrained DNN Model

SLIDE 6

Extract parallelism from a single thread of execution Achieve high utilization without batching Scale to O(100k) spatial units Synthesis specialization

SLIDE 7

Serial dependence

SLIDE 8

Extract parallelism from a single thread of execution Achieve high utilization without batching Scale to O(100k) spatial units Synthesis specialization

SLIDE 9

O(0) O(N2) Batched RNNs

SLIDE 10

O(0) O(N2) O(N) Batched RNNs

SLIDE 11

Extract parallelism from a single thread of execution Achieve high utilization without batching Scale to O(100k) spatial units Synthesis specialization

SLIDE 12

SLIDE 13

SLIDE 14

1. void LSTM(int steps) { 2. for (int t = 0; t < steps; t++) { 3. v_rd(NetQ); 4. v_wr(InitialVrf, xt); 5. v_rd(InitialVrf, xt); 6. mv_mul(Wf); 7. vv_add(bf); 8. v_wr(AddSubVrf, xWf); 9. v_rd(InitialVrf, h_prev); 10. mv_mul(Uf); 11. vv_add(xWf); 12. v_sigm(); 13. vv_mul(c_prev); 14. v_wr(AddSubVrf, ft_mod); 15. v_rd(InitialVrf, h_prev); 16. mv_mul(Uc); 17. vv_add(xWc); 18. v_tanh(); 19. vv_mul(it); 20. vv_add(ft_mod); 21. v_wr(MultiplyVrf, c_prev); 22. v_wr(InitialVrf, ct); 23. v_rd(InitialVrf, ct); 24. v_tanh(); 25. vv_mul(ot); 26. v_wr(InitialVrf, h_prev); 27. v_wr(NetQ); 28. }

29. }

SLIDE 15

Extract parallelism from a single thread of execution Achieve high utilization without batching Scale to O(100k) spatial units Synthesis specialization

SLIDE 16

SLIDE 17

SLIDE 18

+ ×

SLIDE 19

+ + × × + × × + + × × + × × +

SLIDE 20

+ + × × + × × + + × × + × × + + + × × + × × + + × × + × × + + + × × + × × + + × × + × × + + + × × + × × + + × × + × × +

SLIDE 21

SLIDE 22

1 2 3 4

SLIDE 23

SLIDE 24

SLIDE 25

D D D D Dispatcher

SLIDE 26

MVM Scheduler Top Level Scheduler Scalar Processor

Instructions

SLIDE 27

Extract parallelism from a single thread of execution Achieve high utilization without batching Scale to O(100k) spatial units Synthesis specialization

SLIDE 28

Device Node Latency Effective TFLOPS Utilization BW-NPU (Stratix 10, FP8) Intel 14nm 2 ms 35.9 74.8% Device Node Latency Effective TFLOPS Utilization BW-NPU (Arria 10, FP11) TSMC 20nm 1.64 ms 4.7 66%

DeepBench RNN: GRU-2816, batch=1, 71B OPs/serve CNN: ResNet-50, batch=1, 7.7B OPs/serve

SLIDE 29

SLIDE 30

Convolutional Neural Networks (CNNs) Recurrent Neural Networks (RNNs)

N N N L0 L1 N N N L0

M*V Processor

Extract parallelism from a single thread of execution Achieve high utilization without batching Scale to O(100k) spatial units Synthesis specialization

Serial dependence

Extract parallelism from a single thread of execution Achieve high utilization without batching Scale to O(100k) spatial units Synthesis specialization

O(0) O(N2) Batched RNNs

O(0) O(N2) O(N) Batched RNNs

Extract parallelism from a single thread of execution Achieve high utilization without batching Scale to O(100k) spatial units Synthesis specialization

Extract parallelism from a single thread of execution Achieve high utilization without batching Scale to O(100k) spatial units Synthesis specialization

+ ×

+ + × × + × × + + × × + × × +

1 2 3 4

D D D D Dispatcher

MVM Scheduler Top Level Scheduler Scalar Processor

Instructions

Extract parallelism from a single thread of execution Achieve high utilization without batching Scale to O(100k) spatial units Synthesis specialization

Device Node Latency Effective TFLOPS Utilization BW-NPU (Stratix 10, FP8) Intel 14nm 2 ms 35.9 74.8% Device Node Latency Effective TFLOPS Utilization BW-NPU (Arria 10, FP11) TSMC 20nm 1.64 ms 4.7 66%

DeepBench RNN: GRU-2816, batch=1, 71B OPs/serve CNN: ResNet-50, batch=1, 7.7B OPs/serve

https://github.com/Azure/aml-real-time-ai