a 1 9 4 nj decision 3 6 4 k decisions s i n mem ory

A 1 9 .4 nJ/ Decision 3 6 4 K Decisions/ s I n-Mem ory Random - PowerPoint PPT Presentation

A 1 9 .4 nJ/ Decision 3 6 4 K Decisions/ s I n-Mem ory Random Forest Classifier in 6 T SRAM Array Mingu Kang, Sujan Gonugondla, Naresh Shanbhag University of Illinois at Urbana Champaign Machine Learning under Resource Constraints


  1. A 1 9 .4 nJ/ Decision 3 6 4 K Decisions/ s I n-Mem ory Random Forest Classifier in 6 T SRAM Array Mingu Kang, Sujan Gonugondla, Naresh Shanbhag University of Illinois at Urbana Champaign

  2. Machine Learning under Resource Constraints  Embedded statistical inference: IoT, sensor-rich platforms  Decision making under resource constraints  Limited form factor, battery-powered, real-time 2

  3. The Random Forest (RF) Algorithm  Random Forest [ 1]  Ensemble of many (a few hundreds) decision trees  High accuracy  Simple computation (only comparisons)  Suitable for multi-class classifications  Inherent error-resiliency (from ensemble nature) RF algorithm [ 1] L. Breiman, Machine Learning2001 3

  4. Implementation Challenges  Implementation challenges  Non-uniform tree structure - Variations in depth, # of nodes, symmetricity  Frequent memory access ( � �,� , � �,� � - Memory dominates the system efficiency  Irregular data access pattern: ��� �,� �  Prior Art:  Software and FPGA implementations. No ASIC.  Fails to take advantage of RF algorithm inherent error-resiliency 4

  5. Proposed Solution: Deep In-memory Architecture (DIMA) with DSS  DIMA [ 2-4] :  Embedded analog processing  Storage density, normal read & write function preserved  FR: functional read  BLP: bitline processor (subtraction, comparison)  CBLP: cross BLP (aggregation)  RDL: ADC & residual digital logic  Deterministic sub-sampling (DSS)  Regularizes memory access pattern [ 2] M.Kang, et al., ICASSP14 [ 3] M.Kang, et al., Arxiv16 [ 4] M.Kang, et al., US Patent no. 9,697,877 5

  6. RF Chip Architecture  SRAM bitcell array  Stores up to 42 groups  Each group has 4 sub-group (1 sub group = 1 tree)  Input buffer  Stores 4: 1 sub-sampled pixels in 4 sections for DSS  Cross bar (CB)  31 CB units per sub-group enabled in parallel  Comparator (COMP)  128 analog comparators ( ∆� �� ��. ∆� ��� ) Proposed architecture - IREG : pixel index register, RSREG : RSS register 6

  7. Functional READ (FR) � ��� � ��� Δ� Δ� �� �� ��� � Δ� �� ∝ � � � �� ∝ � 0.5 � � � Δ� ��� Functional read ( FR) Conventional read - B : bit precision, L : column mux ratio  Fetches and computes the linear combination of stored data into analog  ( LB ) times more data access per read & precharge  Savings in energy & delay at the cost of reduced SNR 7

  8. In-memory Bitline Processing  Subtraction � � 1 → � � � � @ 2 � � ���������� � � � � � � � � and � in the same column Store �  � ∝ � � � , ∆� � � � ∝ � � � ∆� �� � � � � ��� � � Comparison: ∆� �� ∆�  > ��� < 1 0.7 : variation due to possible cominations of 0.9 ( T MSB , X MSB ) at the T MSB ‐ X MSB value 0.695 0.8 0.7 0.69 V BL (V) 0.6 0.685 0.5 0.68 0.4 X MSB T MSB 0.3 15 0 0 15 0.675 T MSB = 0 X MSB = 0 0.2 0.67 0.1 0 ‐15 ‐10 ‐5 0 5 10 15 T MSB ‐ X MSB A colum n of SRAM array Measured subtraction in a 6 5 nm CMOS 8

  9. Deterministic Sub-sampling (DSS)  Random sub-sampling (RSS)  Requires complex cross bar (e.g., 256: 1 for 256-pixel � )  Deterministic sub-sampling (DSS) before RSS Sub-samples � to generate  four sub-images � �,�,�,�  Reduces cross bar complexity (e.g., 256: 1 → 64: 1)  More than 3× and 4× energy and layout area savings  4: 1 chosen due to accuracy vs. sub-sampling ratio trade-off Proposed RF algorithm 9

  10. Application & Measured Results  Training (off-chip)  200 images per class employed for training  Bit precision: 8, tree depth: 6, 64 trees  Testing  Randomly chosen 200 testing images from test data set KUL Belgium traffic sign dataset Energy Energy Platform Max Classification # of per delay ( 6 5 nm tree rate Accuracy ( % ) trees decision product CMOS) Depth ( decisions/ m s) ( nJ) ( fJ·s) Conv. Arch. 6 4 6 1 6 7 / bank 6 0 .4 3 6 1 .6 9 3 .5 Proposed 6 4 6 3 6 4 / bank 1 9 .4 5 3 .2 9 4 Arch. EDP reduction by 6 .8 × 10

  11. Measured Energy vs. Accuracy Trade-off Accuracy vs. # of trees vs. Δ � �� Accuracy  BL swing  Energy  # of trees  error resiliency  → allows lower BL swing Accuracy vs. energy → higher energy efficiency w .r.t BL sw ing ( Δ � �� ) * * Δ� �� for conv. is 10 × ” Δ� �� per LSB” 11

  12. Chip Summary & Comparison Chip m icrograph Chip sum m ary Technology 65 nm CMOS 1.2 × 1.2 mm Die size 16 KB SRAM capacity (512 × 256 bit-cells) 2.11 × 0.92 um 2 Bit-cell size CTRL CLK freq. 1 GHz CORE 1.0 Supply voltage (V) CTRL 0.75 Com parison w ith state-of-the-art Prior Input Throughput Energy EDP Process Algorithm Dataset Accuracy art size (8b) (decision/s) (nJ/decision) (fJs/decision) Support 130nm Traffic 320 33 1.5M 45G [5] vector 90% CMOS sign video × 240 [40K]* [1250]* [31250]* machine 14nm K-nearest Not 21.5M 3.4 0.2 Not [6] 128 tri-gate neighbor reported [498.8K]* [145.3]* [292.3]* reported 65nm Ours Random KUL traffic 16 19.4 364.4K 52.4 94% CMOS ( M =64) forest signs × 16 (w/ CTRL) [ 5] : J.Park JSSC12, [ 6] : H.Kaul ISSCC16, * scaled to 65 nm CMOS 12

  13. Conclusions  First ASIC implementation of RF algorithm  low-SNR processing via DIMA and DSS  Energy & speed benefits  2.2 × and 3.1 × smaller delay and energy → 6.8 × smaller EDP compared to digital ASIC  Higher potential in large-scale applications  # of trees up to a few hundreds in real-life applications → Higher error-resiliency → More room to scale ∆� �� for energy efficiency  Future work  On-chip training to compensate process variations  Different algorithms (e.g., boosted ensemble classifier) 13

  14. Acknowledgment  This work was supported by Systems on Nanoscale Information fabriCs (SONIC), one of the six SRC STARnet Centers, sponsored by SRC and DARPA. 14

Recommend


More recommend


Explore More Topics

Stay informed with curated content and fresh updates.