Energy Reduction via Critical Path Prediction Toshinori Sato - - PowerPoint PPT Presentation

energy reduction via critical path prediction
SMART_READER_LITE
LIVE PREVIEW

Energy Reduction via Critical Path Prediction Toshinori Sato - - PowerPoint PPT Presentation

Energy Reduction via Critical Path Prediction Toshinori Sato Akihiro Chiyonobu Itsujiro Arita Kyushu Institute of Technology WCED'02 The KIT COSMOS Processor 1 Overview Background Power of CMOS circuits Device and circuit design


slide-1
SLIDE 1

WCED'02 The KIT COSMOS Processor 1

Energy Reduction via Critical Path Prediction

Toshinori Sato Akihiro Chiyonobu Itsujiro Arita Kyushu Institute of Technology

slide-2
SLIDE 2

WCED'02 The KIT COSMOS Processor 2

Overview

Background Power of CMOS circuits Device and circuit design techniques Architectural-level design techniques

Criticality-based instruction scheduling Clustered microarchitecture

Summary

slide-3
SLIDE 3

WCED'02 The KIT COSMOS Processor 3

Nanometer design

Nanometer design poses many

challenges for processor designers.

reliability, signal integrity, speed, power…

slide-4
SLIDE 4

WCED'02 The KIT COSMOS Processor 4

Applications

Current trend of increasing popularity of

mobile devices such as smart cell phones is a driving force to investigate high-performance and energy-efficient microprocessors.

3D video games, flight ticket reservation, mobile banking,

mobile trading, digital camera, MP3 player …

As computing power of a mobile device

increases, its power consumption is also increasing.

Since mobile devices are battery-operated, energy efficiency

is the first class constraint for microprocessors.

slide-5
SLIDE 5

WCED'02 The KIT COSMOS Processor 5

Active power of CMOS circuit

Pactive = f × Cload × Vdd2

f:

clock frequency

Cload:

load capacitance

Vdd:

supply voltage

Supply voltage reduction is the most effective

way to lower power consumption.

slide-6
SLIDE 6

WCED'02 The KIT COSMOS Processor 6

Gate delay of CMOS circuit

  • Vdd: supply voltage

Vth: threshold voltage of device α: factor depending upon carrier velocity

saturation

Supply voltage reduction increases gate delay,

which results in slower clock frequency.

Vdd (Vdd-Vth)α Tpd ∝

slide-7
SLIDE 7

WCED'02 The KIT COSMOS Processor 7

Device & circuit optimizations

We can exploit critical path information

Critical path (CP) is the path which decides

processor cycle time (frequency).

Small transistors (tr) on non-CP Low supply voltage for tr.s on non-CP Large threshold tr.s on non-CP Background

Multiple supply voltages (on-chip supplies) Multiple-threshold CMOS, Variable-threshold CMOS

slide-8
SLIDE 8

WCED'02 The KIT COSMOS Processor 8

Architectural-level techniques

We can select functional units based

  • n each instruction’s criticality.

High-speed and power-hungry units

  • eg. CSA, CLA

Low-speed and power-efficient units

  • eg. RCA

Criticality-based instruction scheduling

slide-9
SLIDE 9

WCED'02 The KIT COSMOS Processor 9

Criticality-based scheduling

Dispatch policy is based on each

instruction’s criticality.

Only instructions on critical paths should

be dispatched into fast functional units.

Non-critical instructions can use slow

functional units.

slide-10
SLIDE 10

WCED'02 The KIT COSMOS Processor 10

Critical path

Chain of instructions, which determines

the number of cycles executing program.

I0 I2 I4 I5 I6 I8 I7 I9 I3 I1 I9 I0 I2 I6 I8 I5 Critical path

slide-11
SLIDE 11

WCED'02 The KIT COSMOS Processor 11

Critical path prediction

Tune’s critical path prediction buffer

[HPCA’01]

Simple but uses only local information

Fields’s token passing CP predictor

[ISCA’01]

More accurate due to use of global

information, but complex

slide-12
SLIDE 12

WCED'02 The KIT COSMOS Processor 12

Tune’s CPP buffer

PC Counter critical/not > Th?

slide-13
SLIDE 13

WCED'02 The KIT COSMOS Processor 13

Criticality-based scheduling

Fast and power-hungry

Critical Non-critical Instruction

Slow and power-efficient

slide-14
SLIDE 14

WCED'02 The KIT COSMOS Processor 14

Evaluation

OOO 8-way superscalar processor Based on SimpleScalar/Alpha tool set 6 integer units (fast / slow) 4K-entry CPP buffer, 3-bit counters

+ 1 if critical, -1 if not. Threshold = 5.

SPEC2000 benchmark

slide-15
SLIDE 15

WCED'02 The KIT COSMOS Processor 15

Processor models

Baseline model Power-efficient model

slide-16
SLIDE 16

WCED'02 The KIT COSMOS Processor 16

Vdd and frequency scaling

500MHz 1.0GHz Frequency 1.15V 1.6V Vdd Slow units Fast units

slide-17
SLIDE 17

WCED'02 The KIT COSMOS Processor 17

%Increase in cycles

5 1 1 5 2 2 5 p a r s e r e

  • n

v

  • r

t e x b z i p 2 p i p e l i n e d 3 f a s t / 3 s l

  • w

/ 4 K 5 1 1 5 2 2 5 p a r s e r e

  • n

v

  • r

t e x b z i p 2 n

  • n
  • p

i p e l i n e d 3 f a s t / 3 s l

  • w

/ 4 K

slide-18
SLIDE 18

WCED'02 The KIT COSMOS Processor 18

%Distribution of dispatch

% 2 % 4 % 6 % 8 % 1 % p a r s e r e

  • n

v

  • r

t e x b z i p 2 p i p e l i n e d N S C S N F C F % 2 % 4 % 6 % 8 % 1 % p a r s e r e

  • n

v

  • r

t e x b z i p 2 n

  • n
  • p

i p e l i n e d N S C S N F C F

slide-19
SLIDE 19

WCED'02 The KIT COSMOS Processor 19

%Energy reduction in FU

5 1 1 5 2 2 5 3 3 5 4 4 5 p a r s e r e

  • n

v

  • r

t e x b z i p 2 p i p e l i n e d n

  • n
  • p

i p e l i n e d 5 1 1 5 2 2 5 3 3 5 4 4 5 p a r s e r e

  • n

v

  • r

t e x b z i p 2 p i p e l i n e d n

  • n
  • p

i p e l i n e d

slide-20
SLIDE 20

WCED'02 The KIT COSMOS Processor 20

Clustered microarchitecture

To further reduce power, we split the

instruction queue into a fast and a slow queues.

Fast cluster consists of the fast queue and

fast functional units.

Slow cluster consists of the slow queue and

slow functional units.

2 clusters are connected by small FIFOs,

if necessary.

slide-21
SLIDE 21

WCED'02 The KIT COSMOS Processor 21

Clustered datapath

slide-22
SLIDE 22

WCED'02 The KIT COSMOS Processor 22

Inter-cluster bypassing

slide-23
SLIDE 23

WCED'02 The KIT COSMOS Processor 23

Evaluation

16-entry fast and 48-entry slow queues Every dispatched instructions do not

release its corresponding entry, just like RUU.

36.3% power reduction in the queues.

slide-24
SLIDE 24

WCED'02 The KIT COSMOS Processor 24

Processor models

Non-clustered model Clustered model

slide-25
SLIDE 25

WCED'02 The KIT COSMOS Processor 25

%Increase in cycles

1 2 3 4 5 6 7 8 p a r s e r e

  • n

v

  • r

t e x b z i p 2 p i p e l i n e d 1 6 f a s t Q / 4 8 s l

  • w

Q 1 2 3 4 5 6 7 8 p a r s e r e

  • n

v

  • r

t e x b z i p 2 n

  • n
  • p

i p e l i n e d 1 6 f a s t Q / 4 8 s l

  • w

Q

slide-26
SLIDE 26

WCED'02 The KIT COSMOS Processor 26

%Distribution of dispatch

% 2 % 4 % 6 % 8 % 1 % p a r s e r e

  • n

v

  • r

t e x b z i p 2 p i p e l i n e d N S C F % 2 % 4 % 6 % 8 % 1 % p a r s e r e

  • n

v

  • r

t e x b z i p 2 n

  • n
  • p

i p e l i n e d N S C F

slide-27
SLIDE 27

WCED'02 The KIT COSMOS Processor 27

%Energy reduction in FU

  • 4
  • 3
  • 2
  • 1

1 2 3 p a r s e r e

  • n

v

  • r

t e x b z i p 2 p i p e l i n e d n

  • n
  • p

i p e l i n e d

slide-28
SLIDE 28

WCED'02 The KIT COSMOS Processor 28

Summary

Tradeoff between power and performance

can be carefully investigated by exploiting critical path information in architectural-level design as well as device and circuit design.

We evaluated a criticality-based scheduling. It

reduces energy in FUs by over 30%.

We also evaluated a clustered micro-

  • architecture. Currently, it is not always a

good design choice for energy reduction.