WCED'02 The KIT COSMOS Processor 1
Energy Reduction via Critical Path Prediction Toshinori Sato - - PowerPoint PPT Presentation
Energy Reduction via Critical Path Prediction Toshinori Sato - - PowerPoint PPT Presentation
Energy Reduction via Critical Path Prediction Toshinori Sato Akihiro Chiyonobu Itsujiro Arita Kyushu Institute of Technology WCED'02 The KIT COSMOS Processor 1 Overview Background Power of CMOS circuits Device and circuit design
WCED'02 The KIT COSMOS Processor 2
Overview
Background Power of CMOS circuits Device and circuit design techniques Architectural-level design techniques
Criticality-based instruction scheduling Clustered microarchitecture
Summary
WCED'02 The KIT COSMOS Processor 3
Nanometer design
Nanometer design poses many
challenges for processor designers.
reliability, signal integrity, speed, power…
WCED'02 The KIT COSMOS Processor 4
Applications
Current trend of increasing popularity of
mobile devices such as smart cell phones is a driving force to investigate high-performance and energy-efficient microprocessors.
3D video games, flight ticket reservation, mobile banking,
mobile trading, digital camera, MP3 player …
As computing power of a mobile device
increases, its power consumption is also increasing.
Since mobile devices are battery-operated, energy efficiency
is the first class constraint for microprocessors.
WCED'02 The KIT COSMOS Processor 5
Active power of CMOS circuit
Pactive = f × Cload × Vdd2
f:
clock frequency
Cload:
load capacitance
Vdd:
supply voltage
Supply voltage reduction is the most effective
way to lower power consumption.
WCED'02 The KIT COSMOS Processor 6
Gate delay of CMOS circuit
- Vdd: supply voltage
Vth: threshold voltage of device α: factor depending upon carrier velocity
saturation
Supply voltage reduction increases gate delay,
which results in slower clock frequency.
Vdd (Vdd-Vth)α Tpd ∝
WCED'02 The KIT COSMOS Processor 7
Device & circuit optimizations
We can exploit critical path information
Critical path (CP) is the path which decides
processor cycle time (frequency).
Small transistors (tr) on non-CP Low supply voltage for tr.s on non-CP Large threshold tr.s on non-CP Background
Multiple supply voltages (on-chip supplies) Multiple-threshold CMOS, Variable-threshold CMOS
WCED'02 The KIT COSMOS Processor 8
Architectural-level techniques
We can select functional units based
- n each instruction’s criticality.
High-speed and power-hungry units
- eg. CSA, CLA
Low-speed and power-efficient units
- eg. RCA
Criticality-based instruction scheduling
WCED'02 The KIT COSMOS Processor 9
Criticality-based scheduling
Dispatch policy is based on each
instruction’s criticality.
Only instructions on critical paths should
be dispatched into fast functional units.
Non-critical instructions can use slow
functional units.
WCED'02 The KIT COSMOS Processor 10
Critical path
Chain of instructions, which determines
the number of cycles executing program.
I0 I2 I4 I5 I6 I8 I7 I9 I3 I1 I9 I0 I2 I6 I8 I5 Critical path
WCED'02 The KIT COSMOS Processor 11
Critical path prediction
Tune’s critical path prediction buffer
[HPCA’01]
Simple but uses only local information
Fields’s token passing CP predictor
[ISCA’01]
More accurate due to use of global
information, but complex
WCED'02 The KIT COSMOS Processor 12
Tune’s CPP buffer
PC Counter critical/not > Th?
WCED'02 The KIT COSMOS Processor 13
Criticality-based scheduling
Fast and power-hungry
Critical Non-critical Instruction
Slow and power-efficient
WCED'02 The KIT COSMOS Processor 14
Evaluation
OOO 8-way superscalar processor Based on SimpleScalar/Alpha tool set 6 integer units (fast / slow) 4K-entry CPP buffer, 3-bit counters
+ 1 if critical, -1 if not. Threshold = 5.
SPEC2000 benchmark
WCED'02 The KIT COSMOS Processor 15
Processor models
Baseline model Power-efficient model
WCED'02 The KIT COSMOS Processor 16
Vdd and frequency scaling
500MHz 1.0GHz Frequency 1.15V 1.6V Vdd Slow units Fast units
WCED'02 The KIT COSMOS Processor 17
%Increase in cycles
5 1 1 5 2 2 5 p a r s e r e
- n
v
- r
t e x b z i p 2 p i p e l i n e d 3 f a s t / 3 s l
- w
/ 4 K 5 1 1 5 2 2 5 p a r s e r e
- n
v
- r
t e x b z i p 2 n
- n
- p
i p e l i n e d 3 f a s t / 3 s l
- w
/ 4 K
WCED'02 The KIT COSMOS Processor 18
%Distribution of dispatch
% 2 % 4 % 6 % 8 % 1 % p a r s e r e
- n
v
- r
t e x b z i p 2 p i p e l i n e d N S C S N F C F % 2 % 4 % 6 % 8 % 1 % p a r s e r e
- n
v
- r
t e x b z i p 2 n
- n
- p
i p e l i n e d N S C S N F C F
WCED'02 The KIT COSMOS Processor 19
%Energy reduction in FU
5 1 1 5 2 2 5 3 3 5 4 4 5 p a r s e r e
- n
v
- r
t e x b z i p 2 p i p e l i n e d n
- n
- p
i p e l i n e d 5 1 1 5 2 2 5 3 3 5 4 4 5 p a r s e r e
- n
v
- r
t e x b z i p 2 p i p e l i n e d n
- n
- p
i p e l i n e d
WCED'02 The KIT COSMOS Processor 20
Clustered microarchitecture
To further reduce power, we split the
instruction queue into a fast and a slow queues.
Fast cluster consists of the fast queue and
fast functional units.
Slow cluster consists of the slow queue and
slow functional units.
2 clusters are connected by small FIFOs,
if necessary.
WCED'02 The KIT COSMOS Processor 21
Clustered datapath
WCED'02 The KIT COSMOS Processor 22
Inter-cluster bypassing
WCED'02 The KIT COSMOS Processor 23
Evaluation
16-entry fast and 48-entry slow queues Every dispatched instructions do not
release its corresponding entry, just like RUU.
36.3% power reduction in the queues.
WCED'02 The KIT COSMOS Processor 24
Processor models
Non-clustered model Clustered model
WCED'02 The KIT COSMOS Processor 25
%Increase in cycles
1 2 3 4 5 6 7 8 p a r s e r e
- n
v
- r
t e x b z i p 2 p i p e l i n e d 1 6 f a s t Q / 4 8 s l
- w
Q 1 2 3 4 5 6 7 8 p a r s e r e
- n
v
- r
t e x b z i p 2 n
- n
- p
i p e l i n e d 1 6 f a s t Q / 4 8 s l
- w
Q
WCED'02 The KIT COSMOS Processor 26
%Distribution of dispatch
% 2 % 4 % 6 % 8 % 1 % p a r s e r e
- n
v
- r
t e x b z i p 2 p i p e l i n e d N S C F % 2 % 4 % 6 % 8 % 1 % p a r s e r e
- n
v
- r
t e x b z i p 2 n
- n
- p
i p e l i n e d N S C F
WCED'02 The KIT COSMOS Processor 27
%Energy reduction in FU
- 4
- 3
- 2
- 1
1 2 3 p a r s e r e
- n
v
- r
t e x b z i p 2 p i p e l i n e d n
- n
- p
i p e l i n e d
WCED'02 The KIT COSMOS Processor 28
Summary
Tradeoff between power and performance
can be carefully investigated by exploiting critical path information in architectural-level design as well as device and circuit design.
We evaluated a criticality-based scheduling. It
reduces energy in FUs by over 30%.
We also evaluated a clustered micro-
- architecture. Currently, it is not always a