[PPT] - On line Power Optimization of Data Flow Multi-Core Architecture PowerPoint Presentation

SLIDE 1

1 2007 2007 Pascal Vivet - CEA/LETI - PATMOS’2010, Grenoble, France

PATMOS 2010, Grenoble, France

20th Int. Workshop on Power And Timing Modeling, Optimization and Simulation

On line Power Optimization of Data Flow Multi-Core Architecture based on Vdd-Hopping for Local DVFS

Pascal Vivet1, Edith Beigne1, Hugo Lebreton1, Nacer-Eddine Zergainoh2

1CEA-Leti, Minatec, Grenoble, France 2TIMA, Grenoble, France

{pascal.vivet@cea.fr}

SLIDE 2

2 2007 2007 Pascal Vivet - CEA/LETI - PATMOS’2010, Grenoble, France

Introduction

Power Consumption Challenge

With convergence and growing capacity of mobile systems Not all applications require the peak performance level :

Plenty of room for power optimization !

Power Consumption Issue

Must be addressed at all levels :

from physical implementation to system level

Ex: Dual –Vt, clock-gating, power switches, DPM, DVFS, task

mapping, task allocation, etc …

Main Power Consumption reduction techniques

Dynamic Power Management (DPM) Dynamic Voltage and Frequency Scaling (DVFS)

cycles

p

N CV E

2

2 1 ~ α

SLIDE 3

3 2007 2007 Pascal Vivet - CEA/LETI - PATMOS’2010, Grenoble, France

Context

Data Flow like application

Pre-determined computation flow, can be model as a task graph Ex : Video encoding/decoding, Telecom baseband modulation, …

Heterogenous Architecture

Composed of a mix of “Soft IPs” and “Hard IPs” Interconnected in a Data-Flow manner, using a Network-on-Chip

Use an efficient on-chip DVFS technique

Using two-set point voltages : “VDD-Hopping”

Objective :

Propose a hardware Local Power Manager Ensures Real Time constraints On-Line Optimization, benefiting from Dynamic Slack Time

SLIDE 4

4 2007 2007 Pascal Vivet - CEA/LETI - PATMOS’2010, Grenoble, France

Outline

Introduction & Context Low Power GALS NoC architecture Local Power Manager and DVFS control Case study on an Telecom Application Conclusion

SLIDE 5

5 2007 2007 Pascal Vivet - CEA/LETI - PATMOS’2010, Grenoble, France

Low Power GALS NoC Architecture

GALS scheme :

Independant synchronous islands Interconnected by an asynchronous

Network-on-Chip Within each IP unit :

Local Clock generator

provide local core frequency

Power Supply Unit,

provides local core supply

Network Interface,

Handle NoC communications

Local Power Manager,

Control low power mechanisms A main CPU in charge of global

power management

Task scheduling, DVFS parameters, …

Each IP unit is a fully independent Frequency and Power domain
Local fine grain power management can be executed during IP

computation and communication independently from each others

[E. Beigne & al, NOCS’08, JSSC’09]

SLIDE 6

6 2007 2007 Pascal Vivet - CEA/LETI - PATMOS’2010, Grenoble, France

VDD-Hopping for DVFS : Principle

Energy per operation scales with V²
In most applications, IPs do not need to be at full speed
Decrease Voltage and Frequency to be energy efficient
DVFS using two set points
Use of two PMOS power switches

Vhigh (1.2 V), Vlow (0.7 V), or off (0 V)

Can easily be integrated in any CMOS circuit
Initially proposed by Tokyo Univ. (2000)

But was not integrated on chip

Details in [S.Miermont, P.Vivet, M. Renaudin, PATMOS07]

Similar Recent Design [Bevan Baas, UC Davis]

Hopping Operation and LPM control

Computing power Electrical power

Vlow, Flow Vhigh, Fhigh

Vdd-Hopping

Vavg,Favg

IP continues computation and communication during voltage transitions Hopping transistion < 50 ns VDD-Hopping transitions are controlled by the LPM

LPM Frequency Voltage

Vhigh Vlow Fhigh Flow 1 1

Clk [S. Miermont & al, PATMOS’07]

SLIDE 7

7 2007 2007 Pascal Vivet - CEA/LETI - PATMOS’2010, Grenoble, France

Power modes and NI features

Power Modes :

Controlled by the Local Power Manager (LPM)

Network Interface features :

Handle NoC protocol

Packetisation, routing, flow control

Control IP hardware tasks

Configuration, Execution, Interrupt

(Wake up the LPM when incoming task raises, etc.) Power Mode Behavior High

NI and IP are active, at Vhigh supply

Low

NI and IP are active, at Vlow supply

Idle

NI only active, IP core clock is gated, at Vlow voltage

Off

Supply is off (only leakage)

How to control DVFS and VDD- Hopping from an application perspective ? What kind of LPM control ?

SLIDE 8

8 2007 2007 Pascal Vivet - CEA/LETI - PATMOS’2010, Grenoble, France

Outline

Introduction & Context Low Power GALS NoC architecture Local Power Manager and DVFS control Case study on an Telecom Application Conclusion

SLIDE 9

9 2007 2007 Pascal Vivet - CEA/LETI - PATMOS’2010, Grenoble, France

Local DVFS control : main principles

Data Flow Heterogeneous architecture

The application is mapped and distributed on distinct IPs The Data Flow are fully handled by hardware (Network Interface)

Data Flow Applications

With both latency and throughput constraints Global latency control is required to meet the deadline

Propose to use Worst Case Execution Time (WCET)

On line optimization allows to benefit of data dependent computation

Trade-off dynamic slack time versus energy

⇒

Hybrid global and local optimization scheme

Off line global scheduling

WCET is computed off line on application traces

On line local control, to offer on line optimisation

Local Power Manager is using VDD-Hopping (two set points only) for DVFS

[D. Marculescu et al, ASP-DAC’2005] [A. Maxiaguine et al, CODESS’2005]

SLIDE 10

10 2007 2007 Pascal Vivet - CEA/LETI - PATMOS’2010, Grenoble, France

LPM : NI Synchronization

LPM synchronization with NI task execution

This a very generic case. Do not depend on IP internal computations / structure.
Various trade-off in task granularity, and hopping frequency to smooth the traffic

Timeslot and WCEC constraints Determine the number of cycles for High and Low voltage t 1 Config. chargée t Coeur actif t 1 Vhigh Vlow

l h wcec h h l l h h

f N N f N f N f N − + = + = τ

( )

l wcec l h h h

f N f f f N × − − = τ

h wcec l

N N N − =

SLIDE 11

11 2007 2007 Pascal Vivet - CEA/LETI - PATMOS’2010, Grenoble, France

LPM : IP core synchronization

LPM synchronisation with the actual core computation

Define atomic task as a sub task where number of cycles and

number of inputs/outputs are known

Need additional logic to detect when IP core really starts

Benefit of Vdd-Hopping properties

Compute the number of cycles Nh and Nl as before, Wait incoming data at Vlow, Start atomic task at Vlow, do a Vhigh transition once, and go back to

Vlow as soon as atomic task computation is finished

t 1 Task Loaded t Atom ic Task t 1 Vhigh Vlow

SLIDE 12

12 2007 2007 Pascal Vivet - CEA/LETI - PATMOS’2010, Grenoble, France

Next timeslot is incremented with unused cycles :
Next task cycle numbers Nh and Nl are updated :

LPM : On-Line optimization

When AEC < WCEC, dynamic slack time is available

Data dependant delay, variable communication delay, …

On line optimization to further reduce energy Main principle ? reallocate remaining time of current task to the next one

l l h h

f n f n + + = ′ + τ τ τ

t t k-1 k k+1 Nl Vhigh Vlow Nh N'l N'h T T T' Vhigh Vlow

l h l

f f f 2 ≤ <

( )

⎪ ⎩ ⎪ ⎨ ⎧ ′ − = ′ + − = ′

h wcec l h l h h

N N N n n N N 5 ,

Idle Compute

SLIDE 13

13 2007 2007 Pascal Vivet - CEA/LETI - PATMOS’2010, Grenoble, France

Outline

Introduction & Context Low Power GALS NoC architecture Local Power Manager and DVFS control Case study on an Telecom Application Conclusion

SLIDE 14

14 2007 2007 Pascal Vivet - CEA/LETI - PATMOS’2010, Grenoble, France

Case study : 3GPP LTE Telecom Application

3GPP LTE Application

MIMO scheme (2 antennas) 14 OFDM symbols

MAGALI circuit

15 NoC routers
Dedicated HW units (OFDM,

Bit interleaving, …)

Generic DSP (Mephisto Cores)
Memory Controllers (SME)
65nm technology, 30mm²

Synchro symbole

Ant. 1

Synchro symbole

Ant. 2

Démodulation OFDM Démodulation OFDM Tampon TTI Synchro Trame Estim. CFO Correc. CFO Estim. CFO Décodage MIMO Démod. souple Estim. canal Estim. canal Correc. CFO Désentrelacement bits Dépoinçonnage Turbo-décodage

04 03 02 01 00 14 13

MC8051

mc8051_12

12 11

CFO

Estim. canal

mep_10

10

Turbo- décodage

asip_24

24

Décodage MIMO

mep_23

23

Décodage MIMO

mep_22

22

SME

sme_21

21

Démod. OFDM

trx_ofdm_20

20

Démod. OFDM

trx_ofdm_20s

CFO

Estim. canal

mep_21s

SME

sme_22s

Desentrelac. Démod.

rx_bit_23s

SME

sme_10w Interface NoC Interface NoC

ARM

Task mapping on the GALS NoC architecture 3GPP LTE Application [F. Clermidy et al, ISSCC’2010]

SLIDE 15

15 2007 2007 Pascal Vivet - CEA/LETI - PATMOS’2010, Grenoble, France

Power Profiling Methodology

Accurate enough power profile is mandatory

« Handmade » analysis is not sufficient to validate/estimate DVFS schemes

Use an existing simulation platform of the architecture

SystemC/TLM model executing the application

Instrument the simulation platform :

Develop the Local Power Manager (LPM) module Develop power models of the IPs, including DVFS/DPM features Profile the simulation platform with power estimates

Power values extracted from gate level simulations

Power data Power data Power monitor trace trace Interconnection TL model IP models Power models IP datasheet Design shrink RTL extraction log log TL communications Application, Embedded sw Iterative Power Estimation Power data Power data Power monitor trace trace Interconnection TL model IP models Power models IP datasheet Design shrink RTL extraction log log TL communications Application, Embedded sw Iterative Power Estimation

[H. Lebreton, P. Vivet, ISVLSI’08]

Use of a generic « TLM-Power » library

SLIDE 16

16 2007 2007 Pascal Vivet - CEA/LETI - PATMOS’2010, Grenoble, France

Power Optimization : Compared scenarios

* IDLE mode = NI is active, IP core clock is gated, and uses Vlow supply (All scenari uses IDLE mode, except the two first ones)

Low LOW mode at maximal achievable flow High HIGH mode at maximal achievable fhigh On/Off HIGH mode at fh max, and IDLE when tasks complete DFS HIGH mode using only Dynamic Frequency Scaling DVFS NI DVFS synchronized with NI DVFS Core DVFS synchronized with CORE DVFS AEC DVFS synchronized with CORE, plus on-line optimization using Actual Execution Cycle

SLIDE 17

17 2007 2007 Pascal Vivet - CEA/LETI - PATMOS’2010, Grenoble, France

Results (1/2)

Important power reduction for On/Off mode

Thanks to an efficient implementation of the Idle mode (Core clock is gated, and using Vlow supply)

Almost no gain for DFS :

As expected, DFS only spread transaction in time, but do not really save energy

The obtained power reduction clearly depend on IP profiles Dedicated control bring some real benefits

DVFS Core : 35% of additional gains, compared to NI simple synchronization
DVFS AEC : Up to further 30% power reduction for on-line optimization, when data dependant delays

0.0 0.5 1.0 1.5 2.0 2.5 3.0 Low High On/Off DFS DVFS NI DVFS Core DVFS AEC Energy (mJ) asip_24 rx_bit_23s mep_23 mep_22 mep_21s mep_10 trx_ofdm_20s trx_ofdm_20

(*) The Low mode does not respect application constraints, only given for indication

SLIDE 18

18 2007 2007 Pascal Vivet - CEA/LETI - PATMOS’2010, Grenoble, France

Results (2/2)

Asynchronous Network-on-Chip power consumption

About 5% of the overall power consumption

Mitigated power reduction for memory controllers

Need specific control for these IPs, strongly dependent of the NoC dataflow

Overall results :

Application power budget reduced from 340mW downto 160mW

0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0 Low High On/Off DFS DVFS NI DVFS Core DVFS AEC Energy (mJ) NoC total IP total Mémoires

SLIDE 19

19 2007 2007 Pascal Vivet - CEA/LETI - PATMOS’2010, Grenoble, France

Outline

Introduction & Context Low Power GALS NoC architecture Local Power Manager and DVFS control Case study on an Telecom Application Conclusion

SLIDE 20

20 2007 2007 Pascal Vivet - CEA/LETI - PATMOS’2010, Grenoble, France

Conclusion

Local DVFS within a GALS NoC architecture

VDD-Hopping for DVFS, using only two set points (two external

supplies)

Efficient design, with fast transition time, allowing distributed

ptimization

Hybrid Global and Local Optimization scheme

Off line scheduling, using WCET analysis On Line local control, to benefit of dynamic slack time Simple idea : reallocate remaining time of current task to the next

task.

Validation on a complete application

Efficient Idle mode is mandatory (really low power and fast recovery) Additional on-line control is beneficial, can bring up to 45% more

power reduction, compared to Idle mode.

SLIDE 21

21 2007 2007 Pascal Vivet - CEA/LETI - PATMOS’2010, Grenoble, France

Conclusion

Local Power Manager

A SystemC Model is available Evaluation of Complexity : ~ 2 Kgates

Next steps :

Define more sophisticated DVFS control scheme for Memory

Controllers Ex : buffer monitoring, using thresholds

Implement the Local Power Manager

Acknowledgments

This project has been partially funded by the ARAVIS Minalogic

Project

SLIDE 22

22 2007 2007 Pascal Vivet - CEA/LETI - PATMOS’2010, Grenoble, France

PATMOS 2010, Grenoble, France

On line Power Optimization of Data Flow Multi-Core Architecture based on Vdd-Hopping for Local DVFS

Pascal Vivet1, Edith Beigne1, Hugo Lebreton1, Nacer-Eddine Zergainoh2

{pascal.vivet@cea.fr}

Introduction

Power Consumption Challenge

Plenty of room for power optimization !

Power Consumption Issue

from physical implementation to system level

mapping, task allocation, etc …

Main Power Consumption reduction techniques

N CV E

2 1 ~ α

Context

Data Flow like application

Heterogenous Architecture

Use an efficient on-chip DVFS technique

Objective :

Outline

Introduction & Context Low Power GALS NoC architecture Local Power Manager and DVFS control Case study on an Telecom Application Conclusion

Low Power GALS NoC Architecture

power management

VDD-Hopping for DVFS : Principle

Power modes and NI features

Power Modes :

Network Interface features :

Packetisation, routing, flow control

Configuration, Execution, Interrupt

(Wake up the LPM when incoming task raises, etc.) Power Mode Behavior High

Low

Idle

Off

How to control DVFS and VDD- Hopping from an application perspective ? What kind of LPM control ?

Outline

Introduction & Context Low Power GALS NoC architecture Local Power Manager and DVFS control Case study on an Telecom Application Conclusion

Local DVFS control : main principles

Data Flow Heterogeneous architecture

Data Flow Applications

Propose to use Worst Case Execution Time (WCET)

Trade-off dynamic slack time versus energy

⇒

Hybrid global and local optimization scheme

WCET is computed off line on application traces

Local Power Manager is using VDD-Hopping (two set points only) for DVFS

LPM : NI Synchronization

f N N f N f N f N − + = + = τ

( )

f N f f f N × − − = τ

N N N − =

LPM : IP core synchronization

number of inputs/outputs are known

Vlow as soon as atomic task computation is finished

LPM : On-Line optimization

( )

Outline

Introduction & Context Low Power GALS NoC architecture Local Power Manager and DVFS control Case study on an Telecom Application Conclusion

Case study : 3GPP LTE Telecom Application

Power Profiling Methodology

Power Optimization : Compared scenarios

Results (1/2)

Results (2/2)

Outline

Introduction & Context Low Power GALS NoC architecture Local Power Manager and DVFS control Case study on an Telecom Application Conclusion

Conclusion

supplies)

Hybrid Global and Local Optimization scheme

task.

Validation on a complete application

power reduction, compared to Idle mode.

Conclusion

Local Power Manager

Next steps :

Controllers Ex : buffer monitoring, using thresholds

Acknowledgments

Project

Innovation for industry