Convolutional Neural Networks (CNNs) Recurrent Neural Networks - - PowerPoint PPT Presentation

convolutional neural networks cnns recurrent neural
SMART_READER_LITE
LIVE PREVIEW

Convolutional Neural Networks (CNNs) Recurrent Neural Networks - - PowerPoint PPT Presentation

Convolutional Neural Networks (CNNs) Recurrent Neural Networks (RNNs) L1 Scalar Processor L0 L0 M*V Processor N N N N N N Project Brainwave Pretrained DNN Model Scalable DNN Hardware Neural Processing Unit Microservice Extract


slide-1
SLIDE 1
slide-2
SLIDE 2

Convolutional Neural Networks (CNNs) Recurrent Neural Networks (RNNs)

slide-3
SLIDE 3
slide-4
SLIDE 4
slide-5
SLIDE 5

N N N L0 L1 N N N L0

Scalable DNN Hardware Microservice Project Brainwave Neural Processing Unit Scalar Processor

M*V Processor

Pretrained DNN Model

slide-6
SLIDE 6

Extract parallelism from a single thread of execution Achieve high utilization without batching Scale to O(100k) spatial units Synthesis specialization

slide-7
SLIDE 7

Serial dependence

slide-8
SLIDE 8

Extract parallelism from a single thread of execution Achieve high utilization without batching Scale to O(100k) spatial units Synthesis specialization

slide-9
SLIDE 9

O(0) O(N2) Batched RNNs

slide-10
SLIDE 10

O(0) O(N2) O(N) Batched RNNs

slide-11
SLIDE 11

Extract parallelism from a single thread of execution Achieve high utilization without batching Scale to O(100k) spatial units Synthesis specialization

slide-12
SLIDE 12
slide-13
SLIDE 13
slide-14
SLIDE 14

1. void LSTM(int steps) { 2. for (int t = 0; t < steps; t++) { 3. v_rd(NetQ); 4. v_wr(InitialVrf, xt); 5. v_rd(InitialVrf, xt); 6. mv_mul(Wf); 7. vv_add(bf); 8. v_wr(AddSubVrf, xWf); 9. v_rd(InitialVrf, h_prev); 10. mv_mul(Uf); 11. vv_add(xWf); 12. v_sigm(); 13. vv_mul(c_prev); 14. v_wr(AddSubVrf, ft_mod); 15. v_rd(InitialVrf, h_prev); 16. mv_mul(Uc); 17. vv_add(xWc); 18. v_tanh(); 19. vv_mul(it); 20. vv_add(ft_mod); 21. v_wr(MultiplyVrf, c_prev); 22. v_wr(InitialVrf, ct); 23. v_rd(InitialVrf, ct); 24. v_tanh(); 25. vv_mul(ot); 26. v_wr(InitialVrf, h_prev); 27. v_wr(NetQ); 28. }

  • 29. }
slide-15
SLIDE 15

Extract parallelism from a single thread of execution Achieve high utilization without batching Scale to O(100k) spatial units Synthesis specialization

slide-16
SLIDE 16
slide-17
SLIDE 17
slide-18
SLIDE 18

+ ×

slide-19
SLIDE 19

+ + × × + × × + + × × + × × +

slide-20
SLIDE 20

+ + × × + × × + + × × + × × + + + × × + × × + + × × + × × + + + × × + × × + + × × + × × + + + × × + × × + + × × + × × +

slide-21
SLIDE 21
slide-22
SLIDE 22

1 2 3 4

slide-23
SLIDE 23
slide-24
SLIDE 24
slide-25
SLIDE 25

D D D D Dispatcher

slide-26
SLIDE 26

MVM Scheduler Top Level Scheduler Scalar Processor

Instructions

slide-27
SLIDE 27

Extract parallelism from a single thread of execution Achieve high utilization without batching Scale to O(100k) spatial units Synthesis specialization

slide-28
SLIDE 28

Device Node Latency Effective TFLOPS Utilization BW-NPU (Stratix 10, FP8) Intel 14nm 2 ms 35.9 74.8% Device Node Latency Effective TFLOPS Utilization BW-NPU (Arria 10, FP11) TSMC 20nm 1.64 ms 4.7 66%

DeepBench RNN: GRU-2816, batch=1, 71B OPs/serve CNN: ResNet-50, batch=1, 7.7B OPs/serve

slide-29
SLIDE 29
slide-30
SLIDE 30

https://github.com/Azure/aml-real-time-ai