Low Power SOC Design and Automation Matt Severson Qualcomm CDMA - - PowerPoint PPT Presentation

low power soc design and automation
SMART_READER_LITE
LIVE PREVIEW

Low Power SOC Design and Automation Matt Severson Qualcomm CDMA - - PowerPoint PPT Presentation

Low Power SOC Design and Automation Matt Severson Qualcomm CDMA Technologies July 27, 2009 Outline Introduction Overview of Serra (Qualcomms first 45nm tapeout) Features / Technology / Low Power Techniques Tradeoffs


slide-1
SLIDE 1

Matt Severson Qualcomm CDMA Technologies July 27, 2009

Low Power SOC Design and Automation

slide-2
SLIDE 2
  • Introduction
  • Overview of Serra (Qualcomm’s first 45nm tapeout)
  • Features / Technology / Low Power Techniques
  • Tradeoffs of Automated vs. Custom Design for Low Power
  • Memory IP
  • Standard Cell
  • Mixed-Vt Design
  • Clock Power
  • Clock Gating
  • Clock Tree Synthesis
  • Multiple Power Domains
  • Voltage Scaling
  • Voltage Islands with Power Gating
  • Conclusions & Future Directions

Outline

2 July 2009

slide-3
SLIDE 3

Introduction

  • Power consumption is a key differentiator in wireless

communications products.

  • Power Consumption Limits
  • Battery Life
  • Performance
  • Feature set
  • Form Factor

3 July 2009

slide-4
SLIDE 4

Introduction – Form Factor

July 2009 4

Phone Surface Temperature Rise Above Ambient

Surface Power Density [W/sq-in] 0.02 0.03 0.05 0.07 0.2 0.3 0.5 0.8 0.01 0.1 1 Temperature Rise, [C] 2 3 5 7 20 30 50 80 1 10 100

Surface Power Densities less than 0.1 W/sq-in This is the recommended design area Surface Power Densities between 0.1 and 0.22W/sq-in Phone is likely to have local hot spots Surface Power Densities greater than 0.22W/sq-in Phone should be redesigned

  • Power Densities Increasing
  • Overheating
  • Limit Form Factors
slide-5
SLIDE 5

Introduction - Battery Life

5 July 2009

44% 43% 21% 28% 14% 26% 35% 12% 19% 48% 0% 20% 40% 60% 80% 100% Device 1 Device 2 Device 3 Device 4 Device 5 Device 6 Device 7 Device 8 Device 9 Device 10

% of Total Reviews

Battery Life Analysis

1 Star (Bad) 2 Stars 3 Stars 4 Stars 5 Stars (Good) Expressed Dissatisfaction

Verizon™ A T & T™

Battery Capacity Screen Pixels

860 mAh 800 mAh 800 mAh 1400 mAh 240 x 320 240 x 320 240 x 320 360 x 480

930 mAh 1300 mAh 1130 mAh 1500 mAh 910 mAh 880 mAh

240 x 320

240 x 400 240 x 320 240 x 320 240 x 320 176 x 220 Phone Chipset

slide-6
SLIDE 6

SERRA

Qualcomm’s First 45nm Tapeout

July 2009 6

slide-7
SLIDE 7

Serra Feature Set

July 2009 7

  • Modem
  • CDMA 1xEV-DO revA & B
  • UMTS (includes HSDPA, HSUPA)
  • GSM (includes GPRS and EDGE)
  • Unified GPS engine for both CDMA

and UMTS modes.

  • Processors
  • QDSP4u8 based MDSP core
  • ARM11 core with 32KB I/D cache
  • ARM926 core with 32KB I/D cache
  • QDSP5u4 based ADSP core w/ 256

KB L2 cache

  • Multimedia
  • 24 bit WVGA w/ LCDC (active refresh)
  • ATI LT graphics core (Open GL 2.0)
  • 22M Triangles/Sec
  • 8 Mpixel Camera support
  • Peripherals
  • 2 HS USB interfaces
  • MDDI gen 1.5
slide-8
SLIDE 8

Serra Physical Characteristics

Die size: 8200.08 x 6500.34 um (53.3 mm^2) Signal I/Os: 419 Process: tsmc45lp Metal Layers: 6 (5 thin, 1 thick (4x)) and 1 AP RDL layer Total # Transistors: 170 Million Total # RAM bits: 13.7 Mbits Total # ROM bits: 1.1 Mbits Static IR Drop: < 10mV (@ Worst case 800 mA) Leakage: ~450 uA (TT,25c, 1.125V) 671 pin 13x13 NSP Package 0.5mm ball pitch

Includes Serra (Digital die) + Analog + Memory

8 July 2009

slide-9
SLIDE 9

Serra Low Power Design Goals

Background

  • Leakage Power is increasing due to process
  • 45nm Sub-Threshold is worse than 65nm (pA/um)
  • Gate leakage is increasing.
  • Junction Diode leakage is increasing.
  • 45nm Process has no HVt transistor.
  • Simple scaling of Dynamic Power is not Enough.
  • + Dynamic Power will scale down with process geometries (-C)(-V)
  • However increased wire cap will temper the reduction
  • Increased performance demands and more applications (+f) (+C)
  • Aggressive Product requirements for battery life
  • Conclusion:
  • More aggressive leakage and active power management techniques are required in

45nm

  • Low Power Priorities / Goals for Serra:
  • 1 Decrease Dynamic Power
  • 2 Maintain the total static leakage power.
  • 3 Keep Active Leakage a “small” percentage of Dynamic Power (< ~15%)

9 July 2009

slide-10
SLIDE 10

Serra Low Power Features

  • Low Power Multi-Threshold Qualcomm Standard Cell Library
  • 2 Vt and 2 Channel Lengths
  • Low Power Memory
  • Power Collapsing of RAM/ROM periphery and core
  • Independent Bank Collapsing for Large High Density Memories
  • Advanced Low Power Clocking
  • ~105 Master Clock Domains (I/O or Independent Frequency)
  • ~230 Total Clock Domains (Synchronous, Iso-Synchronous, Asynchronous)
  • Automatically inserted Fine grained clock gating
  • Manually inserted Architectural clock gates
  • Static SW control and Dynamic HW control of clock gating.
  • Custom Raw Clock Tree Routing
  • Low Power CTS with Qualcomm Custom Clock Tree cells.
  • 24 Analog and Pad power domains
  • 2 Digital Power domains
  • Independent Voltage Scaling

– Active and Sleep modes

  • Power Collapsing
  • 8 Digital Power Islands with Power Gating
  • All Low Power Features fully Verified
  • Power Aware simulation
  • Power Structural Checks

10 July 2009

slide-11
SLIDE 11

Serra Floorplan

11 July 2009

slide-12
SLIDE 12

Serra Static IR Drop Map

12 July 2009

slide-13
SLIDE 13

Serra Dynamic IR drop Map

13 July 2009

slide-14
SLIDE 14

DESIGN AUTOMATION

slide-15
SLIDE 15
  • Design Automation is Mandatory
  • Design complexity
  • Time to Market is Critical
  • Fewer design resources required
  • Quality
  • Through Standardized flows and tools
  • Automated Design tools and flows have several

limitations that affect low power

  • Many automated tools don’t consider power
  • Others don’t make the correct tradeoffs between power and

area/timing.

  • This Presentation focuses on the tradeoffs involved with

several low power techniques used on Serra and the limitations of automated design for low power

July 2009 15

Design Automation

slide-16
SLIDE 16

Customized Design for Low Power

  • Custom design flows and circuits can produce better

results

  • Lower Power, Higher Speeds, Less Area
  • Custom design requires more design effort and time
  • Use customized Design and signoff ONLY in critical areas
  • Pick areas of customization to get the greatest benefit
  • Clock Trees
  • Raw Clock Trees
  • Raw Clock Dividers
  • Memory IP
  • Standard Cell
  • Move the customization into IP
  • Use automation to insert the IP, check the IP and optimize with IP.

16 July 2009

slide-17
SLIDE 17

Customization of Raw Clock Network

  • Raw clock networks are high

speed, high power nets from PLLs to dividers

  • Raw clock dividers are stacked

and custom routed.

  • Width and spacing are chosen for
  • ptimal clock isolation while

maintaining fast transition times.

  • Use minimal clock buffers to

distribute clocks within the network but maintain desired transition delay.

  • 10-input tri-state mux
  • Reduces insertion delay and

power

  • Custom Layout of raw dividers
  • Reduces critical path delay,

voltage noise and optimizes rise/fall times.

  • ~4x reduction in Raw clock Power

(Compared to Previous chip)

Selected Clock Path (green) Non-Selected Active Clocks (red)

Traditional Wide Mux Structure PLL PLL

PLL PLL

Raw Clock Network

17 July 2009

slide-18
SLIDE 18

LOW POWER IP

slide-19
SLIDE 19

Periphe ral with footer Bit-cell array with header Sleep with data retention Sleep without data retention 90nm Yes No Yes No 65nm Yes No Yes No 45nm Yes Yes Yes Yes

Leakage 90nm 65nm 45nm

Sleep Sleep Sleep w/ retention

array peri

Function Function Function Sleep w/o retention Sleep w/ retention Sleep w/o retention

with Vdd scaling without Vdd scaling

  • Bit-cell leakage is up 6X in 45nm.
  • No hVt devices.
  • All memories need to have leakage

control

  • Circuit + System solutions
  • Peripheral footer
  • Bit cell header
  • Vdd scaling

Maintain only the useful data with array header and reduce Vdd during sleep mode to manage the leakage.

Low Power Memory

Core Array Periphery

19 July 2009

slide-20
SLIDE 20

Memory Partial Bank Collapse

  • Power Gating portions of the bit-cell array that are not

needed

  • Standby/Active Leakage reduction
  • Some active power reduction since clock/data is gated to

banks that are not accessed.

  • Requires Proper memory management in SW and FW.

July 2009 20

slide-21
SLIDE 21

STANDARD CELL POWER REDUCTION

slide-22
SLIDE 22

Standard Cell Leakage

  • 45nm Standard Cell Challenges
  • Ioff increase and no HVT device

compared to 65nm

  • Performance provided by NVT is not

required everywhere

  • Power Gating not possible in all

blocks

  • 45nm OPTIONS
  • Use Longer Channel length NVT

device

  • Min channel length is 40nm in 45nm tech
  • Use Stacked NVT devices
  • Replace every device with a stack of 2

devices

Length (L) Spacing (S) Pitch Increase 40n 40n 180n 50n 45n 200n 11% 60n 45n 210n 17%

L L 60n S

Pitch = L/2 + S + 60n + S + L/2

S

QCT45

Leakage Scaling Factor, 65nm to 45nm H2N H2NL N2N L2L L2N N 24.71 9.45 1.61 10.48 0.56 P 27.58 5.68 3.00 26.80 1.51 Ave 26.14 7.57 2.31 18.64 1.03 TT, 25c

22 July 2009

slide-23
SLIDE 23

Simulation Results

0.0 1.0 2.0 3.0 4.0 5.0 6.0 7.0 8.0 9.0 NVT, 40n NVT, 50n NVT, 60n NVT, 70n Stacked NVT

Factor Candidates Leakage Savings

0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 NVT , 40n NVT , 50n NVT , 60n NVT , 70n Stacked NVT

Factor Candidates Delay Increase

0% 2% 4% 6% 8% 10% NVT, 40n NVT, 50n NVT, 60n NVT, 70n Stacked NVT

Area Increase (%) Candidates Area Increase

Area : Comb.

0% 10% 20% 30% 40% 50% 60% 70% NVT, 40n NVT, 50n NVT, 60n NVT, 70n Stacked NVT

Switching Cap Increase (%) Candidates Switching Cap Increase

TSMC TT, 1.1V, 25C Not Available

23 July 2009

slide-24
SLIDE 24

Standard Cell Leakage

  • Longer Channel length
  • Leakage savings = 4.5X (L=50nm), 6.6X (L=60nm)
  • Delay = 1.18X (L=50nm), 1.32X (L=60nm)
  • Area increase ~ 5.8% (L=50nm), 6.7% (L=60nm)
  • Switching Cap increase ~ 16% (L=50nm), 32% (L=60nm)
  • Stacking is NOT as beneficial
  • 8.3X Leakage Reduction
  • Steep Delay Penalty 2.9X
  • Substantial area penalty
  • Switching cap increase ~ 62%
  • Conclusion
  • L=40 to L=70 was evaluated
  • Diminishing returns in leakage savings as L increases
  • Dynamic power versus Leakage tradeoff
  • Provide NVT and LVT with two channel lengths in 45nm Library
  • Nvt and PNvt (long channel)
  • Lvt and PLvt (long channel)

24 July 2009

slide-25
SLIDE 25

Block Level Leakage Comparison

  • 65nm vs. 45nm (Serra)
  • Compare the Leakage of 6 Hard Macros with the fewest RTL changes.
  • 65nm vs. 45nm Library
  • 1.23v vs. 1.125v
  • Higher Speed blocks (B, E, F)
  • Benefited from increased speeds in 45nm
  • Standard Cell Leakage actually decreased
  • Medium/Slow speed blocks (A, C, D)
  • Increase in standard cell leakage from 30-66%.
  • At the block level the ave. leakage increase from 65nm -> 45nm is 1.5x to 2x

July 2009 25

0.00 uW 5.00 uW 10.00 uW 15.00 uW 20.00 uW 25.00 uW 30.00 uW 35.00 uW 40.00 uW 45.00 uW 50.00 uW BLK A BLK B BLK C BLK D BLK E BLK F 65nm 45nm

Block 65nm 45nm Diff

BLK A

19.10 uW 24.70 uW 29.3%

BLK B

0.39 uW 0.18 uW

  • 54.4%

BLK C

15.50 uW 20.20 uW 30.3%

BLK D

25.50 uW 42.40 uW 66.3%

BLK E

0.93 uW 0.14 uW

  • 84.6%

BLK F

46.30 uW 32.00 uW

  • 30.9%
slide-26
SLIDE 26

MTCMOS

slide-27
SLIDE 27

MTCMOS - Strategy

  • Mixed Vt Libraries are used to optimize both leakage and active

power

  • Higher Vt reduces leakage
  • Use Long-channel devices to provide additional leakage

reduction

  • Small area and active power penalty
  • Lower Vt can improve active power
  • providing better Isat / Cg ratio
  • Lower Vt has better relative performance and less delay

variation at Low Vdd.

  • Applications of Low Vt
  • High Activity Nets
  • Exclusively for clock trees.
  • Less insertion delay, skew, variation, optimal repeater.
  • Decreased active power at the cost of leakage.
  • In blocks that are power gated
  • In power domains with aggressive voltage scaling
  • In high performance blocks.

27 July 2009

slide-28
SLIDE 28
  • Most EDA tools do a poor job of active power
  • ptimization with mixed Vt Libraries
  • EDA tools DO use Mixed Vt for timing.
  • Don’t know where to use Low Vt for active power reduction.
  • Need usage information to tradeoff leakage and active power
  • High activity nets – Requires Vector Information
  • Clock Nets – Requires Constraints to restrict cell list
  • Blocks that are power gated – Power Intent
  • Blocks that are Voltage scaled – Power Intent
  • Solution
  • Users decide which Vt is most appropriate.
  • Users constrain or restrict the tool’s choice of Vt.
  • Use single Vt for initial runs and multi-Vt for incremental
  • ptimization.

July 2009 28

MTCMOS – EDA Limitations

slide-29
SLIDE 29

MTCMOS – EDA Limitations

  • Many opportunities for leakage

recovery

  • Timing slack on paths
  • Excess margin early in the flow
  • Pessimistic view of timing
  • Optimization cost functions favor

area over leakage

  • Point tools in the industry that

were created for this purpose.

  • Special scripts in Timing Signoff

tool.

  • Leakage Recovery within PTSI™
  • Knowing where to use High Vt for leakage power reduction

July 2009 29

Zhan, “A Utility for Leakage Power Recovery within PrimeTime SI” SNUG Boston 2008

slide-30
SLIDE 30
  • Misc. Standard Cell Power
  • Low Clock Power Flip Flops
  • Additional Flip Flop Family that targets power reduction
  • New Clock Topology
  • Enables designers and tools to make a power vs. speed tradeoff
  • 21% lower clock power, 6% smaller, but 36% slower than regular

flop.

  • Originally they had the same footprint
  • EDA tools did not insert the Low Clock Power Flops unless they

were also smaller. [Cost functions favor Area & Timing]

30 July 2009

slide-31
SLIDE 31

CLOCK TREE POWER

slide-32
SLIDE 32

Clock Tree Power

  • Clock Tree Power is still a major contributor to total active

power (30-40%)

  • Clock Architecture and Frequency Planning
  • Clock Architecture has a huge impact on power

July 2009 32

slide-33
SLIDE 33
  • Number of PLLs
  • Independent PLLs
  • Increase flexibility and ease of use.
  • Optimizes Power for simple use cases
  • Shared high frequency PLLs
  • Lower system power in concurrency modes
  • However, SW may not be able to manage very complex frequency

plans which results in power inefficiencies

  • Clock domain partitioning
  • Independent Clock Domain Control (HW/SW)
  • Frequency and Gating
  • Clock Synchronicity
  • Asynchronous domains are more flexible
  • Increased latency across clock domain boundaries
  • Synchronous vs. Iso-Synchronous
  • Saves power in low performance modes. Cost power in worst case

July 2009 33

Clock Tree Architecture

slide-34
SLIDE 34
  • Still one of the most effective ways to save active power
  • Maximize fine-grain clock gating
  • Optimize settings for integrated clock gating (w/ PowerCompiler™)

– Almost always a win. – Beware of increased skew, insertion delay, timing and area impact.

  • Manually inserted Architectural CGCs
  • Gate modules based on Modal and Temporal need.
  • Can be HW or SW controlled
  • Analyze the Clock Gating Percentage (CGP) and Clock

Gating Efficiency (CGE) to find new opportunities

July 2009 34

Clock Gating

slide-35
SLIDE 35

Example - Clock Power

  • Example of a Complex DMA engine

with multiple bus masters.

  • Separated Client Interface clocks
  • Provided Independent clock

frequency control

  • Div 1-4
  • Iso-sync / Async support on bridges
  • Enables Master/Slave clock division
TREE LEAF

Previous

TREE

~1650 buffers ~18655 registers

LEAF LEAF LEAF LEAF LEAF

~2947 registers ~3009 registers ~2197 registers ~1540 registers ~8962 registers ~1650/2 buffers ~1650/8 buffers ~1650/8 buffers ~1650/8 buffers ~1650/8 buffers

CXC CXC CXC CXC CXC CXC

35 July 2009

90nm 65nm r1 65nm r2 Serra CI3 0.1979 CI2 0.2232 CI1 0.2307 CI0 0.2507 Baseline 17 8.0578 5.7606 3.377

2 4 6 8 10 12 14 16 18

mA

Bus Master Clock Power Comparison

slide-36
SLIDE 36

Low Power CTS

  • Clock Tree Synthesis (CTS)
  • Used for the majority of ~250 clock trees on Serra
  • Small changes to the constraints and cells can dramatically

affect the power of CTS trees.

  • Custom Clock Cell Design
  • Clock Buffers, Inverters, Muxes, Gating Cells, Dividers.
  • Internal clock routes are minimized to reduced clock power
  • Insertion delays are minimized to reduce overall clock power and

skew

  • Low power CTS
  • Higher granularity of clock buffers.
  • Relaxed transition and skew constraints
  • Modal skew balancing (functional vs. test)

36 July 2009

slide-37
SLIDE 37

Low Power CTS – EDA Deficiencies

July 2009 37

  • Most CTS tools only consider
  • Global skew and transition time.
  • CTS – Example
  • 1) Balance Global Skew
  • 2) 100ps Max Transition
  • 3) Min Area/Power
  • A. Default Settings
  • B. Optimized Run
  • Need tools which
  • Consider Local vs. Global skew.
  • Relax clock transition times
  • Cbuf sizing on paths
  • Cluster registers that share a clock
  • r clock gate
  • Insert, clone and de-clone clock

gates

  • Move clock gates up the tree

A B

slide-38
SLIDE 38

MULTIPLE POWER DOMAINS

slide-39
SLIDE 39

Multiple Power Domains

  • Power Domain Partitioning
  • More power regimes from PMIC
  • + Better Power Control and Efficiency
  • + Independent Voltage Scaling and Power Collapsing
  • - Increased PDN impedance and IR Drop
  • - Increased Bill Of Materials (BOM)
  • - Requires level shifters and resynchronization at boundaries.
  • Need a small on-chip regulator with fast response and good

efficiency

  • IR Drop and PDN impedance
  • - Increased Power Density
  • - Increased metal resistance
  • - Increased IR due to Power Switches
  • - IR impact is greater at lower voltages
  • - Dynamic IR affects on skew and timing not well modeled.

39 July 2009

slide-40
SLIDE 40

PDN Impedance

  • Worst Case IR drop may not be at max frequency and power
  • An arbitrary PDN network observed from the die looking back towards the

VRM is shown below. The PMIC has passive modeling.

  • Resonance points in the RLC network.

1 103 1 104 1 105 1 106 1 107 1 108 1 109 0.01 0.1 1 1 0.01 Z spec s i ( ) ( ) Z network s i ( ) ( ) R spec1 R spec3 s i ( ) L spec 1 s i ( ) C spec 1 109 1 103 f i ( )

Potential Peaking due to PMIC & Bulk Cap Board “Mounting Inductance” Mid-band region controlled by decap inventory 40 July 2009

slide-41
SLIDE 41

VOLTAGE SCALING

slide-42
SLIDE 42

Static Voltage Scaling (ARM11)

  • SW selects a performance level

based on application and/or SW performance monitors.

  • HW selects 1 of 8 pre-

programmed Frequency- Voltage pairs

  • Benefits
  • Active power reduction
  • 32%-36% based on lab

measurements at 96 MHz

  • Limited by Memory Vcc Min
  • Easier to implement on

Applications processor

  • Drawbacks
  • Requires separate voltage regulator (vdd_apc)
  • Less efficient than AVS
  • SW Controlled vs. HW controlled AVS
  • SW Performance monitors and algorithms lagging
  • Requires careful characterization of ARM11ss FMAX vs. Voltage
  • Timing Closure (Corners, Variation, Margins)

PMIC7500

SSBI

1 8 8 1 1 lvl 8 lvl 7 lvl 6 lvl 5 lvl 4 lvl 3 lvl 2 lvl 1 data addr addr

FSM

req ack 3 8

Modem ARM Control ARM9 Peripheral Bridge and MPU Serra Digital Die svs_cntl vdd_apc vdd_dig ARM11SS

Next PLevel Current PLevel stat

42 July 2009

slide-43
SLIDE 43

Freq. (MHz) Vdd (V) 128 1.15 96 1.10 64 1.05 48 1.00 32 0.975 25 0.95 20 0.95 Freq. (MHz) Vdd (V) 256 1.20 196 1.15 128 1.10 96 1.05 64 1.00 32 0.975 20 0.950 Freq. (MHz) Vdd (V) 256 1.20 196 1.15 128 1.10 96 1.05 64 1.00 32 0.95 20 0.95

Static Voltage Scaling (MSM Top)

  • Multiple Blocks in the same power regime
  • The highest required voltage is determined from multiple LUTs.
  • LUT data obtained from FMAX characterization.
  • Software programs the voltage upon entry / exit of each mode
  • Drawbacks
  • SW Complexity
  • Fixed frequency blocks don’t scale well
  • Lots of characterization work
  • Risk of Test escapes

Not all blocks included in characterization. 43 July 2009

BLK A BLK B BLK C Mode1 Requires 1.2v Mode2 Requires 1.15v Mode3 Requires .95v

slide-44
SLIDE 44

Process Monitoring DVS

  • Process Monitoring DVS
  • Increased process variation at 45nm, increased benefits
  • f Process related Dynamic Voltage Scaling.
  • Measure PM speed.
  • SW obtains the required voltage from LUT.
  • LUT is created through Characterization.
  • Benefits
  • 16% power reduction for TTT, 31% for FFF estimated.
  • More practical than DVFS on top-level
  • AVS is too complex to implement on top-level
  • Drawbacks
  • Impact on ATE and test time, binning and test escapes.
  • Characterization effort
  • Timing Closure (Corners, Margins, Variation)

Fast Typical Slow 3 Design Target pass fail Frequency Simulation Data: Fmax vs Vdd for FFF, TTT, SSS parts Process Bin Required Vdd (V) Fast 0.950 Typ. 1.025 Slow 1.125 44 July 2009

slide-45
SLIDE 45
  • 45nm designs require MMMC signoff for Hold robustness
  • Hold closure at several PVT corners.
  • Functional Mode and Test Mode.
  • MMMC Implementation is also necessary to achieve

lowest power.

  • Voltage Scaling:
  • Optimize Implementation across several PVT corners to make the

correct decisions

  • MMMC Reduces iterations and effort for closing Setup & Hold.
  • Power Optimization
  • Power corners are different than timing corners.

– Leakage information is very in-accurate at Temperature/Voltage extremes. – Optimize power for the nominal use modes.

Multi-Mode Multi-Corner

45 July 2009

slide-46
SLIDE 46
  • MMMC Requirements
  • Constraints for each Mode are needed early.
  • More library characterization required.
  • >15 Corners used on Serra
  • MMMC Limitations
  • Increased tool run times (Physical Design cycle time)
  • Time to market
  • No mature solution for MMMC throughout the flow
  • Used single corner Synthesis and Placement with some iteration

and over-margining on Serra.

  • MMMC was used during CTS, Post-CTS opt and final timing

closure.

  • Evaluation of Commercial tools

– Evaluated industry leading tools for MMMC synthesis/placement after Serra. (Still Maturing)

Multi-Mode Multi-Corner

46 July 2009

slide-47
SLIDE 47

Variation and Margins

  • Variation is increasing at smaller geometries and lower voltages
  • Adding fixed margins is detrimental to power
  • Increased hold buffer insertion
  • Larger gate sizes
  • Swapping to lower Vt when not needed.
  • Use OCV to apply margin only where necessary.

47 July 2009

Monte Carlo Simulations of Hold Time at 2 VT Corners

slide-48
SLIDE 48

VDD Minimization During Sleep

0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% 0.2 0.4 0.6 0.8 1 1.2 Scaled Leakage Power (%) Supply Voltage (V) Estimated Leakage Power Reduction by VDD Minimization

  • Minimize voltage of entire

digital die during sleep

  • Especially beneficial during

short sleep cycles (BT Scan, QChat)

  • Estimated 70% leakage power

reduction @ Battery by lowering the digital voltage from 1.1V to 0.7V

  • Minimal SW overhead and fast

restoration.

  • All memory and register state

is retained.

Simulated [PNVt TT, 25c] 48 July 2009

slide-49
SLIDE 49

VDD Minimization (Serra)

  • Serra Leakage current during

VDD MINIMIZATION

  • Calculated 64% savings at

Battery

  • 80% efficiency, 3.7V Battery.
  • Voltage minimization is the

most effective way to control digital leakage without complicated SW changes.

batt msm msm batt

V V I I * 8 . *

Serra (Normalized) Off Leakage Measured (3 TT parts, 25c)

measured calculated

49 July 2009

Voltage Ave @ MSM Ave @ Battery 1.078 1.00 0.36 0.81 0.47 0.13 Savings 52.66% 64.43%

0.0x 0.2x 0.4x 0.6x 0.8x 1.0x Ave @ MSM Ave @ Battery 1.00 0.36 0.47 0.13 1.078 0.81

slide-50
SLIDE 50

Full Chip Power Collapse

  • Full Chip collapse at PMIC regulator
  • Requires data to be saved and re-boot of all processors
  • Power Collapse Break-Even Point

Energy Saved (Leakage * Time) > Energy Overhead (Save/Restore);

  • For 2.56 Sec Slot cycle (Assuming 5mJ of Overhead) Power

Collapse is beneficial if leakage is greater than 1.8mA

Which consumes less energy? w/ or w/o PC

1 2 3 4 5 1 2 3 4 5 Sleep Time (Sec) Leakage current (mA) 1mJ 0.5mJ 2mJ 5mJ PC/OH =10mJ PCH Time Slot (2.56 Sec) 0.68 34 76 Scaled/Estimated Serra Current (mA) 2,477 2,508 2,560

50 July 2009

slide-51
SLIDE 51

POWER GATING

slide-52
SLIDE 52

FS/HS Power Gating

  • Headswitch (HS) and footswitch (FS)
  • Serially adding a high resistance device between

power rails in low leakage mode

  • Power Switch design is a trade-off between

leakage power, IR drop (when FS/HS are on), area, ramp-up noise and time.

  • Local FS is integrated within logic circuit
  • Used at Qualcomm since .13u
  • Minimum impact on design automation flow
  • Finer control to turn down logic circuits

independently

  • State Maintained
  • (Leakage savings / Cost) is low.
  • Global FS/HS are integrated within power grid
  • Switches shared by all logic in a domain, thus

smaller size and lower IR drop is possible.

  • Can be implemented with finer grid, e.g. global

distributed FS/HS (GDFS), or coarse grid, e.g. FS/HS power ring

  • GDFS is used in 90nm and 65nm designs
  • Employ a save & restore scheme

….

Vdd Local Footswitch Vss

….

Global FS Vss_int

Vdd Vss

52 July 2009

slide-53
SLIDE 53

Global FS/HS implementations

  • FS/HS ring
  • Less PD effort
  • Shorter sleep control distribution
  • Larger IR drop compared with GDFS,

especially in flipchip case

  • IR drop increases quicker when the

size of the block increases (cubic w.r.t. the length)

  • Only Suitable for small size macros
  • Global distributed FS/HS
  • Can be modeled as an additional

resistance between global and local power mesh

  • Does not break global mesh
  • Needs sleep control signal

distribution throughout.

  • Suitable for large size macros

FS Ring Global PG Mesh Local PG Mesh

v v v v v v v v v v v v v v v v v v

GDFS

En_few_in En_few_out

vssfx

Vss

En_rest_in En_rest_out

Mf Mr

53 July 2009

slide-54
SLIDE 54

Serra GDFS Design

  • Global Distributed Footswitch (GDFS) was chosen for

leakage reduction vs. area cost

  • Sleep leakage control
  • Extra long channel NVT transistors used for footswitch cells
  • Active leakage control
  • Turn off macros that are not used in a certain operating

modes

  • High Temperature active leakage control

54 July 2009

slide-55
SLIDE 55

GDFS Design Flow

  • Use QCOM in-house tools

for inserting isolation cells.

  • Use existing commercial

tools to insert switches and connect enables.

  • Use existing commercial

tools + scripts to insert isolation cells during Physical Design.

  • Pre-place GDFS cells and

do IR drop analysis at early stage

  • Further refine the size of

GDFS cells according to local IR drop

July 2009 55

slide-56
SLIDE 56

Low Power Verification

  • Power Aware Verification is Required
  • Verify entry to & exit from low power states
  • Properly model power collapse
  • Verify clamp polarity
  • Power Aware Simulation tools
  • Qualcomm Scripts
  • Commercial Tools
  • Power Structural Checks
  • Verify Power domain crossings

– Isolation cells

  • Commercial Tools @ 3 design stages

– RTL – Logical Gate – Physical Gate

July 2009 56

slide-57
SLIDE 57

. . . . . . .

Active Idle Dormant Power Off

Total Leakage Current by State

gfs mem std N2 T2 T1 (TTT, 1.125V, 25C)

Serra Leakage

  • Leakage by State
  • Measurements on 3 TT Parts
  • 4x reduction in leakage from Active to Power off states. (Excluding Vdd min)
  • >4x leakage reduction from GDFS.
  • Power Off Variation
  • across 3 Typical Wafers Mean to mean is 20%
  • Variation across dice from same wafer is ~100%

57 July 2009

Columns – Estimated Dots - Measured

slide-58
SLIDE 58

CONCLUSIONS & FUTURE DIRECTIONS

July 2009 58

slide-59
SLIDE 59

Conclusions

  • Power IS a key differentiator and limits the design
  • Tools and users must Design for Power
  • Power is a different beast than area/timing
  • Power/Energy is highly dependant on use case (Vector)
  • Must consider Static and Dynamic Power
  • Power varies across PVT.

– Optimization at Nominal – Constrained by worst case.

  • Designing for Low Power is all about Tradeoffs.
  • Many Low Power Techniques exist
  • Each needs to be applied with tradeoffs in mind.
  • Know the tradeoffs and what your specific goals are.
  • We still need more techniques to meet the customer
  • demands. (We are not doing enough)

59 July 2009

slide-60
SLIDE 60

Conclusions

  • Design automation is required for Complex SOCs, but

Customization in select areas can Dramatically reduce power.

  • Move Customization into IP
  • Several deficiencies and limitations exist within EDA tools.
  • Multi-Mode Multi-Corner Design is a must
  • EDA tools need to improve and move MMMC up in the flow
  • Timing and power corners are not the same.
  • Simultaneous optimization of timing, power and area needed at

all stages

  • High Level Synthesis
  • RTL Design & Optimization
  • Logic Synthesis
  • Clock Tree Synthesis
  • Place and Route
  • Timing Closure

60 July 2009

slide-61
SLIDE 61
  • Getting the Architecture right is critical to power
  • High Level System Power Modeling required for making

correct architectural design decisions

  • Need to have “relatively” accurate power models to make

tradeoffs

  • Power Vectors
  • Accurate power estimation and optimization require

accurate activity information

  • There is a need for tools/methods which extract activity information

from real SW code running on Emulation and/or system models.

  • Benchmarks and verification type vectors are often quite different

from “real” world use cases

  • Vector-less power estimation and optimization is a very

challenging problem.

  • Some have attempted, but no satisfactory solution seen thus far.

July 2009 61

Conclusions

slide-62
SLIDE 62

References

[1]http://www.nttdocomo.co.jp/english/info/notice/page/0902 27_00.html [2] Bruce Zhan, “A Utility for Leakage Power Recovery within PrimeTime SI” SNUG Boston 2008 [3] Krzysztof A. Kozminski, “Optimization for Leakage Power with PrimeTime” 2004 San Jose SNUG Conference

62 July 2009