Real Chip Evaluation of a Low Power CGRA with Optimized Application - - PowerPoint PPT Presentation

real chip evaluation of a low power cgra with optimized
SMART_READER_LITE
LIVE PREVIEW

Real Chip Evaluation of a Low Power CGRA with Optimized Application - - PowerPoint PPT Presentation

Real Chip Evaluation of a Low Power CGRA with Optimized Application Mapping Takuya Kojima, Naoki Ando, Yusuke Matsushita, Hayate Okuhara, Nguyen Anh Vu Doan and Hideharu Amano Keio University, Japan International Symposium on Highly-Efficient


slide-1
SLIDE 1

Real Chip Evaluation of a Low Power CGRA with Optimized Application Mapping

Takuya Kojima, Naoki Ando, Yusuke Matsushita, Hayate Okuhara, Nguyen Anh Vu Doan and Hideharu Amano Keio University, Japan

International Symposium on Highly-Efficient Accelerators and Reconfigurable Technologies (HEART2018), Toronto, Canada

slide-2
SLIDE 2

Outline

n Introduction n A CGRA Architecture n Three Types of Control

1. Pipeline Structure Control 2. Body Bias Control 3. Application Mapping

n New Mapping Optimization Method n Real Chip Implementation n Experimental Results n Conclusion

slide-3
SLIDE 3

Importance of Low Power Consumption

nForthcoming

nIoT devices nWearable computing nSensor network

nChallenges

nHigh performance

nFor image processing

nLow Power Consumption

nFor long battery life

slide-4
SLIDE 4

SF-CGRAs: Straight-Forward Coarse-Grained Reconfigurable Arrays

n Key features of straight-forward CGRAs

Permutation Network

PE PE PE PE PE PE PE PE

Pipeline Register

Permutation Network

PE PE PE PE PE PE PE PE

Date Memory

n Limited data flow direction n Less frequent reconfiguration n Pipelined PE array n High energy efficiency

slide-5
SLIDE 5

VPCMA: Variable Pipelined Cool Mega Array [1]

n PE array consists of

n 8 x 12 PEs n 7 pipeline registers

n PE has

n No Register file n No clock tree

n Pipeline register works in

1. latch mode 2. bypass mode

n μ-Controller

n Controls data transfer data mem. ↔ PE array

PE PE PE PE PE PE PE PE

PE-Array

  • PE

PE PE PE PE PE PE PE

  • Data Manipulator

Data Memory

  • μ-controller
  • Pipeline

Registers

  • r

[1] N.Ando, et al. "Variable pipeline structure for Coarse Grained Reconfigurable Array CMA." Field-Programmable Technology, 2016.

slide-6
SLIDE 6

Pipeline Structure Control

6

4th PE Row 3rd PE Row 2nd PE Row 1st PE Row

Pipeline Register

Number of Pipeline Stage Large Small Operating Frequency Throughput Glitch Propagation Dynamic Power of Registers & Clock

1st stage 3rd stage 2nd stage

slide-7
SLIDE 7

Pipeline Structure Control

7

4th PE Row 3rd PE Row 2nd PE Row 1st PE Row

Pipeline Register 2nd stage

1st stage

Number of Pipeline Stage Large Small Operating Frequency Throughput Glitch Propagation Dynamic Power of Registers & Clock

slide-8
SLIDE 8

Body Bias Effects on SOTB

n Tradeoff between leak power and performance Decrease

  • f Static Power

Performance Enhancement

Zero Bias Reverse Bias Forward Bias

n SOTB Technology

n 65 nm n One of FD-SOI n Body Biasing

slide-9
SLIDE 9

Row-level Body Bias Control

9

Probability of Leak Power Reduction

Delay time in case of no control Delay time in case of row-level control AND SL MULT ADD

2 Stage Pipeline 4 Stage Pipeline Delay Time of PE for Each Opcode

Time Deadline

slide-10
SLIDE 10

How to map an application to the PE array?

n An app. is represented as a data flow graph (DFG) n Various Mappings exist

+

OR >> <<

× −

OR + × << >> Example of Application DFG PE Array n High Performance n Large Power

Mapping Eval. map

slide-11
SLIDE 11

How to map an application to the PE array?

n An app. is represented as a data flow graph (DFG) n Various Mappings exist

+

OR >> <<

× −

OR + × << >> Example of Application DFG PE Array n Small Power n Low Performance

Mapping Eval. map

slide-12
SLIDE 12

Complexity of Mapping Optimization

  • 3. Body Bias Voltage

(BBV) for Each Row

  • 2. Pipeline

Structure

(1(.( (-) ( 11(22( ) 2)21

Dynamic Power Static Power

(# of Rows)^(# of voltages) patterns

NP-Complete Problem

128 patterns

n Tradeoff between leak power and dynamic power Interdependent

  • n each other

control

slide-13
SLIDE 13

Related work

  • 1. Performance & power optimization for CGRA[2]

n Considering VDD control n Optimization Priority: Performace > Power

  • 2. Body bias domain size exploration for CGRAs[3]

n Analysis of area overhead and power reduction effects n Not taking care of the dynamic power

  • 3. Pipeline & body bias optimization for CGRAs [4]

n Method using integer-linear-program n Assuming static mapping

  • [2] Gu, Jiangyuan, et al. "Energy-aware loops mapping on multi-vdd CGRAs without performance degradation.”

Design Automation Conference (ASP-DAC), 2017 22nd Asia and South Pacific. IEEE, 2017. [3] Y.Matsushita, “Body Bias Grain Size Exploration for a Coarse Grained Reconfigurable Accelerator”,

  • Proc. of the 26th The International Conference on Field-Programmable Logic and Applications (FPL),2016.

[4] T. Kojima, et al. “Optimization of body biasing for variable pipelined coarse-grained reconfigurable architectures”. IEICE Transactions on Information and Systems, Vol. E101-D,No. 6, June 2018.

slide-14
SLIDE 14

Is optimizing only the power consumption enough?

n Several requirements

n Power Consumption n Performance (Operating Frequency) n Throughput

n Multi-Objective Optimization brings users nA variety of choices nBalancing the tradeoffs

Power Performance Throughput

slide-15
SLIDE 15

Proposal: Use Multi-Objective Optimization

n Non-dominated Sorting Genetic Algorithm-II (NSGA-II)

n Multi-Objective Genetic Algorithm

nIn this work

n1-point crossover nCommonly-used probability [5]

n0.7 crossover probability n0.3 mutation probability

n300 generations

[5] L. Davis. “Adapting operator probabilities in genetic algorithms”. In Proceedings of the third international conference on Genetic algorithms, pp. 61–69, San Francisco, CA, USA, 1989. Morgan Kaufmann Publishers Inc.

slide-16
SLIDE 16

Gene & Evaluation of Individuals

DFG Mapping Pipeline Structure

Degree of Parallelism Total Wire Length Dynamic Power Static Power Total Power Analyze Each Path Routing Target Freq. ILP Solver for BBVs Glitch Estimation BBV for Each Row

  • Critical

Path Delay

slide-17
SLIDE 17

Gene & Evaluation of Individuals

DFG Mapping Pipeline Structure

Degree of Parallelism Total Wire Length Dynamic Power Static Power Total Power Analyze Each Path Routing Target Freq. ILP Solver for BBVs Glitch Estimation BBV for Each Row

  • Critical

Path Delay

  • Dynamic power model
  • Proposed in [6]
  • Considering glitch

propagation

  • Based on results
  • f real chip

measurements

[6] T.Kojima, et al. “Glitch-aware variable pipeline optimization for CGRAs”. ReConFig2017, pp. 1–6, Dec 2017.

slide-18
SLIDE 18

Gene & Evaluation of Individuals

DFG Mapping Pipeline Structure

Degree of Parallelism Total Wire Length Dynamic Power Static Power Total Power Analyze Each Path Routing Target Freq. ILP Solver for BBVs Glitch Estimation BBV for Each Row

  • Critical

Path Delay

  • An Integer Linear Program (ILP)
  • Minimizes the static power
  • Considers timing constraints
  • Takes within 0.1 sec
  • The same method as proposed

in [4]

[4] T. Kojima, et al. “Optimization of body biasing for variable pipelined coarse-grained reconfigurable architectures”. IEICE Transactions

  • n Information and Systems, Vol. E101-D,
  • No. 6, June 2018.
slide-19
SLIDE 19

An Implemented Real Chip “CCSOTB2”

n CCSOTB2

n VPCMA Architecture n SOTB 65nm Technology n 5 Body Bias Domains

n Design: Verilog HDL n Synthesis: Synopsys Design Compiler n Place & Route: Synopsys IC Compiler

6mm 3mm TCI PE Array

Body Bias Domains domain1 1-5th PE Rows domain2 6th PE Row domain3 7th PE Row domain4 8th PE Row domain5

  • ther parts
slide-20
SLIDE 20

Preliminary Experiments

n Leak power of PE row is measured

n BBV: -0.8 ~ +0.4 V (step: 0.2 V)

n Maximum Operating Freq.

n 30MHz n due to bottleneck in μ-controller

CCSOTB2 Chip Artex-7 FPGA

Experimental Environment

Mother Board

  • Zero

Bias

slide-21
SLIDE 21

Benchmark Applications

n4 simple image processing application nAssuming 30MHz frequency

Name Description af 24bit alpha blender gray 24bit gray scale sepia 8bit sepia filter sf 24 bit sepia filter

slide-22
SLIDE 22

Proposed method vs. Black-Diamond

nBlack-Diamond [7]

ndoes not support pipeline control nor body bias control nStatic mapping regardless of user’s requirements

nCombine with pipeline optimization[6]

nConsidering glitch effects

[6] T.Kojima, et al. “Glitch-aware variable pipeline optimization for CGRAs”. ReConFig2017, pp. 1–6, Dec 2017. [7] V.Tunbunheng , et al. “Black-diamond: a retargetable compiler using graph with configuration bits for dynamically reconfigurable architectures”. In Proc. of The 14th SASIMI, pp. 412–419, 2007.

slide-23
SLIDE 23

Mapping quality

Black-Diamond with pipeline optimization Proposed method Difference of mapping results (gray application) 0.0 V 0.0 V

  • 0.4 V
  • 0.4 V
  • 0.4 V
slide-24
SLIDE 24

Mapping quality

Black-Diamond with pipeline optimization Proposed method Difference of mapping results (af application) 0.0 V 0.0 V 0.0 V 0.0 V

  • 0.2 V
slide-25
SLIDE 25

Power reduction

n For all applications, the total power is reduced n In average, 14.2 % reduction is achieved

slide-26
SLIDE 26

Conclusion

n A new optimization method based on a multi-

  • bjective genetic algorithm is proposed

n Three controls are considered simultaneously

  • 1. Pipeline structure control
  • 2. Body bias control
  • 3. Application mapping

n Real chip experiments shows 14.2% power reduction

slide-27
SLIDE 27

22 222 2