SLIDE 1
Convolutional Neural Networks (CNNs) Recurrent Neural Networks - - PowerPoint PPT Presentation
Convolutional Neural Networks (CNNs) Recurrent Neural Networks - - PowerPoint PPT Presentation
Convolutional Neural Networks (CNNs) Recurrent Neural Networks (RNNs) L1 Scalar Processor L0 L0 M*V Processor N N N N N N Project Brainwave Pretrained DNN Model Scalable DNN Hardware Neural Processing Unit Microservice Extract
SLIDE 2
SLIDE 3
SLIDE 4
SLIDE 5
N N N L0 L1 N N N L0
Scalable DNN Hardware Microservice Project Brainwave Neural Processing Unit Scalar Processor
M*V Processor
Pretrained DNN Model
SLIDE 6
Extract parallelism from a single thread of execution Achieve high utilization without batching Scale to O(100k) spatial units Synthesis specialization
SLIDE 7
Serial dependence
SLIDE 8
Extract parallelism from a single thread of execution Achieve high utilization without batching Scale to O(100k) spatial units Synthesis specialization
SLIDE 9
O(0) O(N2) Batched RNNs
SLIDE 10
O(0) O(N2) O(N) Batched RNNs
SLIDE 11
Extract parallelism from a single thread of execution Achieve high utilization without batching Scale to O(100k) spatial units Synthesis specialization
SLIDE 12
SLIDE 13
SLIDE 14
1. void LSTM(int steps) { 2. for (int t = 0; t < steps; t++) { 3. v_rd(NetQ); 4. v_wr(InitialVrf, xt); 5. v_rd(InitialVrf, xt); 6. mv_mul(Wf); 7. vv_add(bf); 8. v_wr(AddSubVrf, xWf); 9. v_rd(InitialVrf, h_prev); 10. mv_mul(Uf); 11. vv_add(xWf); 12. v_sigm(); 13. vv_mul(c_prev); 14. v_wr(AddSubVrf, ft_mod); 15. v_rd(InitialVrf, h_prev); 16. mv_mul(Uc); 17. vv_add(xWc); 18. v_tanh(); 19. vv_mul(it); 20. vv_add(ft_mod); 21. v_wr(MultiplyVrf, c_prev); 22. v_wr(InitialVrf, ct); 23. v_rd(InitialVrf, ct); 24. v_tanh(); 25. vv_mul(ot); 26. v_wr(InitialVrf, h_prev); 27. v_wr(NetQ); 28. }
- 29. }
SLIDE 15
Extract parallelism from a single thread of execution Achieve high utilization without batching Scale to O(100k) spatial units Synthesis specialization
SLIDE 16
SLIDE 17
SLIDE 18
+ ×
SLIDE 19
+ + × × + × × + + × × + × × +
SLIDE 20
+ + × × + × × + + × × + × × + + + × × + × × + + × × + × × + + + × × + × × + + × × + × × + + + × × + × × + + × × + × × +
SLIDE 21
SLIDE 22
1 2 3 4
SLIDE 23
SLIDE 24
SLIDE 25
D D D D Dispatcher
SLIDE 26
MVM Scheduler Top Level Scheduler Scalar Processor
Instructions
SLIDE 27
Extract parallelism from a single thread of execution Achieve high utilization without batching Scale to O(100k) spatial units Synthesis specialization
SLIDE 28
Device Node Latency Effective TFLOPS Utilization BW-NPU (Stratix 10, FP8) Intel 14nm 2 ms 35.9 74.8% Device Node Latency Effective TFLOPS Utilization BW-NPU (Arria 10, FP11) TSMC 20nm 1.64 ms 4.7 66%
DeepBench RNN: GRU-2816, batch=1, 71B OPs/serve CNN: ResNet-50, batch=1, 7.7B OPs/serve
SLIDE 29
SLIDE 30