[PPT] - RANA: Towards Efficient Neural Acceleration with Refresh-Optimized PowerPoint Presentation

SLIDE 1

The 45th International Symposium on Computer Architecture - ISCA 2018

RANA: Towards Efficient Neural Acceleration with Refresh-Optimized Embedded DRAM

Fengbin Tu, Weiwei Wu, Shouyi Yin, Leibo Liu, Shaojun Wei Institute of Microelectronics Tsinghua University

SLIDE 2

Ubiquitous Deep Neural Networks (DNNs)

1

Image Classification Object Detection Video Surveillance Speech Recognition

SLIDE 3

DNN Requires Large On-Chip Buffer

Modern DNN’s layer data storage can reach

0.3~6.27MB.

The numbers will increase if the network processes

higher resolution images or larger batch size.

2

[1] Krizhevsky et al., “ImageNet Classification with Deep Convolutional Neural Networks”, NIPS’12. [2] Simonyan et al., “Very Deep Convolutional Networks for Large-Scale Image Recognition”, ICLR’15. [3] Szegedy et al., “Going Deeper with Convolutions”, CVPR’15. [4] He et al., “Deep Residual Learning for Image Recognition”, CVPR’16.

SLIDE 4

SRAM-based DNN Accelerators

The small footprint limits the on-chip buffer size of

conventional SRAM-based DNN accelerators.

– Usually <500KB with area cost of 3~20mm2. (Normalized)

3

Heterogeneous PE Array Data Buffer System

Weight Buffer

Controller

Configurable Interface

... Data Buffer1 Data Buffer2 ... Buffer CTRL Buffer CTRL Bank[0] Bank[47] Bank[0] Bank[47] ... ... ...

CONV FC/LSTM

IO IO

...

PE PE PE PE PE PE

...

PE PE PE PE PE PE

...

PE PE PE PE PE PE

...

PE PE PE PE PE PE ... ... ... ... ... ... Super PE ... Super PE Super PE Super PE Super PE Super PE

Configuratin Configuratin

Configuration Context

Thinker, 348KB, 19.4mm2 DianNao, 44KB, 3.0mm2 Eyeriss, 182KB, 12.3mm2 Envision, 77KB, 10.1mm2 (Normalized)

Thinker: Yin et al., “A High Energy Efficient Reconfigurable Hybrid Neural Network Processor for Deep Learning Applications”, JSSC’18. DianNao: Chen et al., “DianNao: A Small-Footprint High-Throughput Accelerator for Ubiquitous Machine-Learning”, ASPLOS’14. Eyeriss: Chen et al., “Eyeriss: An Energy-Efficient Reconfigurable Accelerator for Deep Convolutional Neural Networks”, ISSCC’16. Envision: Moons et al., “ENVISION: A 0.26-to-10TOPS/W Subword-Parallel Dynamic-Voltage-Accuracy-Frequency-Scalable Convolutional Neural Network Processor in 28nm FDSOI”, ISSCC’17.

SLIDE 5

SRAM vs. eDRAM (Embedded DRAM)

4

eDRAM has higher density than SRAM. Refresh is required for data retention.

Charge will leak over time and might cause retention failures.

SLIDE 6

Refresh is an Energy Bottleneck

5

[1] Chang et al., “Technology Comparison for Large Last-Level Caches (L3Cs): Low-Leakage SRAM, Low Write-Energy STT-RAM, and Refresh-Optimized eDRAM”, HPCA’13. [2] Wilkerson et al., “Reducing Cache Power with Low-Cost, Multi-bit Error-Correcting Codes”, ISCA’10.

[1] HPCA’13 eDRAM Power Breakdown [2] ISCA’10 System Power Breakdown

Overhead: eDRAM Refresh Energy

SLIDE 7

Opportunity to Remove eDRAM Refresh

6

Refresh Interval = Retention Time

Ghosh, “Modeling of Retention Time for High-Speed Embedded Dynamic Random Access Memories”, TCASI’14.658

SLIDE 8

Opportunity to Remove eDRAM Refresh

7

Refresh is unnecessary, if Data Lifetime < Retention Time

Opportunity1: Increase retention time by training. Opportunity2: Reduce data lifetime by scheduling.

SLIDE 9

RANA: Retention-Aware Neural Acceleration Framework

8

Retention-Aware Training Method Hybrid Computation Pattern Refresh-Optimized eDRAM Controller

Tolerable Retention Time Layerwise Configurations

1. Accuracy Constraint
2. eDRAM Retention Time

Distribution

1. Energy Modeling
2. Data Lifetime Analysis
3. Buffer Storage Analysis
1. Data Mapping
2. Memory Controller

Modification

Optimized Energy Consumption

1. DNN Accelerator
2. Target DNN Model

(Training) (Scheduling) (Architecture) 1 2 3 Compilation Phase Execution Phase

Strengthen DNN accelerators with refresh-optimized

eDRAM:

– Increase on-chip buffer size by replacing SRAM with eDRAM. – Reduce energy overhead by removing unnecessary eDRAM refresh.

SLIDE 10

RANA: Retention-Aware Neural Acceleration Framework

9

Retention-Aware Training Method Hybrid Computation Pattern Refresh-Optimized eDRAM Controller

Tolerable Retention Time Layerwise Configurations

Optimized Energy Consumption

1. DNN Accelerator
2. Target DNN Model

(Training) (Scheduling) (Architecture) 1 2 3

DNN accelerator DNN model The last layer? Switch to the next layer No Run scheduling scheme Layer description Hardware constraints Computation Pattern: <OD/WD, Tm, Tn, Tr, Tc> Yes Configurations for each layer

eDRAM Bank eDRAM Bank eDRAM Bank eDRAM Bank eDRAM Bank

Programmable Clock Divider eDRAM Refresh Flags

Unified Buffer System eDRAM Controller

Refresh Issuer Reference Clock

Retention Time↑ Data Lifetime↓ Refresh Control

SLIDE 11

Tech1: Retention-Aware Training Method

Retention time is diverse among different cells.

– Retention failure rate: Fraction of the cells under the given retention time.

10

Kong et al., “Analysis of Retention Time Distribution of Embedded DRAM – A New Method to Characterize Across-Chip Threshold Voltage Variation”, ITC’08.

Typical eDRAM Retention Time Distribution (32KB)

The weakest cell appears at the 45micro-second point.

SLIDE 12

Tech1: Retention-Aware Training Method

Retrain the network to tolerate higher failure rate

and get longer tolerable retention time.

11

Target DNN Model Failure Rate (r) Fixed-Point Pretrain Fixed-Point DNN Model Adding Layer Masks Random Bit-Level Errors Retrain Weight Adjustment Retention-Aware DNN Model Retention-Aware Training Method

SLIDE 13

Tech1: Retention-Aware Training Method

Failure rate of 10−5: No accuracy loss, 734𝜈s.
Failure rate of 10−4: Accuracy decreases.

12

Relative Accuracy under Different Retention Failure Rates

734𝜈s 45𝜈s 1030𝜈s

SLIDE 14

Tech2: Hybrid Computation Pattern

Computation pattern, expressed in a loop.
Data lifetime and buffer storage are related to the

loop ordering, especially the outermost-level loop.

13

SLIDE 15

Tech2: Hybrid Computation Pattern

Outputs are dynamically updated by accumulation,

which recharges the cells like periodic refresh.

Different computation patterns have different data

lifetime and buffer storage requirements.

14

Input Dependent Output Dependent Weight Dependent

SLIDE 16

DNN accelerator DNN model The last layer? Switch to the next layer No Run scheduling scheme Layer description Hardware constraints Computation Pattern: <OD/WD, Tm, Tn, Tr, Tc> Yes Configurations for each layer

Tech2: Hybrid Computation Pattern

Scheduling scheme:

– Input: DNN accelerator and network’s parameters. – Optimization: Minimize total system energy. – Output: Layerwise configurations.

15

min 𝐹𝑜𝑓𝑠𝑕𝑧

s. t.

𝐹𝑜𝑓𝑠𝑕𝑧 = 𝐹𝑟𝑣𝑏𝑢𝑗𝑝𝑜 (14), 𝑈𝑜 ∙ 𝑈ℎ ∙ 𝑈𝑚 ≤ 𝑆𝑗, 𝑈𝑛 ∙ 𝑈𝑠 ∙ 𝑈𝑑 ≤ 𝑆𝑝, 𝑈𝑛 ∙ 𝑈𝑜 ∙ 𝐿2 ≤ 𝑆𝑥, 1 ≤ 𝑈𝑛 ≤ 𝑁, 1 ≤ 𝑈𝑜 ≤ 𝑂, 1 ≤ 𝑈𝑠 ≤ 𝑆, 1 ≤ 𝑈𝑑 ≤ 𝐷.

Scheduling Scheme

SLIDE 17

Tech3: Refresh-Optimized eDRAM Controller

eDRAM controller:

– Programmable clock divider: Refresh interval. – Refresh issuers and flags, for each eDRAM bank. – Configuration from Tech1 & Tech2.

16 eDRAM Bank eDRAM Bank eDRAM Bank eDRAM Bank eDRAM Bank

Programmable Clock Divider eDRAM Refresh Flags

Unified Buffer System eDRAM Controller

Refresh Issuer Reference Clock

SLIDE 18

Evaluation Platform

RTL-level cycle-accurate simulation, for performance estimation and

memory access tracing.

System-level energy estimation, based on synthesis, Destiny and CACTI.

17 Platform Configurations DNN Accelerator 256 MACs, 384KB SRAM, 200MHz, 5.682mm2, 65nm eDRAM 1.454MB, retention time = 45𝜈s, 65nm

Kong et al., “Analysis of Retention Time Distribution of Embedded DRAM – A New Method to Characterize Across-Chip Threshold Voltage Variation”, ITC’08.

SLIDE 19

Experimental Results

18

eDRAM refresh operations: 99.7%↓ Off-chip memory access: 41.7%↓ System energy consumption: 66.2%↓

SLIDE 20

Scalability to Other Architectures

DaDianNao: 4096 MACs, 36MB eDRAM, 606MHz.

19

eDRAM refresh operations: 99.9%↓ System energy consumption: 69.4%↓

Chen et al., “DaDianNao: A Machine-Learning Supercomputer”, MICRO’14.

SLIDE 21

Takeaway

RANA: Retention-Aware Neural Acceleration Framework

Training: Retention-aware training method.

– Exploit DNN’s error resilience to improve tolerable retention time.

Scheduling: Hybrid computation pattern.

– Different computing order and parallelism show different data lifetime and buffer storage requirement.

Architecture: Refresh-Optimized eDRAM controller.

– No need to refresh all the banks. – No need to always use the worst-case refresh interval.

Not limited to applying eDRAM to DNN acceleration.

– Approximate computing: Retention and error resilience.

20

Retention-Aware Training Method Hybrid Computation Pattern Refresh-Optimized eDRAM Controller

Tolerable Retention Time Layerwise Configurations

Optimized Energy Consumption

1. DNN Accelerator
2. Target DNN Model

(Training) (Scheduling) (Architecture) 1 2 3

SLIDE 22

RANA: Towards Efficient Neural Acceleration with Refresh-Optimized Embedded DRAM

Fengbin Tu, Weiwei Wu, Shouyi Yin, Leibo Liu, Shaojun Wei Institute of Microelectronics Tsinghua University

Ubiquitous Deep Neural Networks (DNNs)

DNN Requires Large On-Chip Buffer

0.3~6.27MB.

higher resolution images or larger batch size.

SRAM-based DNN Accelerators

conventional SRAM-based DNN accelerators.

– Usually <500KB with area cost of 3~20mm2. (Normalized)

SRAM vs. eDRAM (Embedded DRAM)

eDRAM has higher density than SRAM. Refresh is required for data retention.

Refresh is an Energy Bottleneck

Overhead: eDRAM Refresh Energy

Opportunity to Remove eDRAM Refresh

Opportunity to Remove eDRAM Refresh

Refresh is unnecessary, if Data Lifetime < Retention Time

Opportunity1: Increase retention time by training. Opportunity2: Reduce data lifetime by scheduling.

RANA: Retention-Aware Neural Acceleration Framework

eDRAM:

RANA: Retention-Aware Neural Acceleration Framework

Retention Time↑ Data Lifetime↓ Refresh Control

Tech1: Retention-Aware Training Method

– Retention failure rate: Fraction of the cells under the given retention time.

Typical eDRAM Retention Time Distribution (32KB)

Tech1: Retention-Aware Training Method

and get longer tolerable retention time.

Tech1: Retention-Aware Training Method

Relative Accuracy under Different Retention Failure Rates

Tech2: Hybrid Computation Pattern

loop ordering, especially the outermost-level loop.

Tech2: Hybrid Computation Pattern

which recharges the cells like periodic refresh.

lifetime and buffer storage requirements.

Tech2: Hybrid Computation Pattern

– Input: DNN accelerator and network’s parameters. – Optimization: Minimize total system energy. – Output: Layerwise configurations.

Tech3: Refresh-Optimized eDRAM Controller

– Programmable clock divider: Refresh interval. – Refresh issuers and flags, for each eDRAM bank. – Configuration from Tech1 & Tech2.

Evaluation Platform

Experimental Results

eDRAM refresh operations: 99.7%↓ Off-chip memory access: 41.7%↓ System energy consumption: 66.2%↓

Scalability to Other Architectures

eDRAM refresh operations: 99.9%↓ System energy consumption: 69.4%↓

Takeaway

RANA: Retention-Aware Neural Acceleration Framework

Thank you for your attention!

Email: tfb13@mails.tsinghua.edu.cn