Low Power SOC Design and Automation Matt Severson Qualcomm CDMA - - PowerPoint PPT Presentation
Low Power SOC Design and Automation Matt Severson Qualcomm CDMA - - PowerPoint PPT Presentation
Low Power SOC Design and Automation Matt Severson Qualcomm CDMA Technologies July 27, 2009 Outline Introduction Overview of Serra (Qualcomms first 45nm tapeout) Features / Technology / Low Power Techniques Tradeoffs
- Introduction
- Overview of Serra (Qualcomm’s first 45nm tapeout)
- Features / Technology / Low Power Techniques
- Tradeoffs of Automated vs. Custom Design for Low Power
- Memory IP
- Standard Cell
- Mixed-Vt Design
- Clock Power
- Clock Gating
- Clock Tree Synthesis
- Multiple Power Domains
- Voltage Scaling
- Voltage Islands with Power Gating
- Conclusions & Future Directions
Outline
2 July 2009
Introduction
- Power consumption is a key differentiator in wireless
communications products.
- Power Consumption Limits
- Battery Life
- Performance
- Feature set
- Form Factor
3 July 2009
Introduction – Form Factor
July 2009 4
Phone Surface Temperature Rise Above Ambient
Surface Power Density [W/sq-in] 0.02 0.03 0.05 0.07 0.2 0.3 0.5 0.8 0.01 0.1 1 Temperature Rise, [C] 2 3 5 7 20 30 50 80 1 10 100
Surface Power Densities less than 0.1 W/sq-in This is the recommended design area Surface Power Densities between 0.1 and 0.22W/sq-in Phone is likely to have local hot spots Surface Power Densities greater than 0.22W/sq-in Phone should be redesigned
- Power Densities Increasing
- Overheating
- Limit Form Factors
Introduction - Battery Life
5 July 2009
44% 43% 21% 28% 14% 26% 35% 12% 19% 48% 0% 20% 40% 60% 80% 100% Device 1 Device 2 Device 3 Device 4 Device 5 Device 6 Device 7 Device 8 Device 9 Device 10
% of Total Reviews
Battery Life Analysis
1 Star (Bad) 2 Stars 3 Stars 4 Stars 5 Stars (Good) Expressed Dissatisfaction
Verizon™ A T & T™
Battery Capacity Screen Pixels
860 mAh 800 mAh 800 mAh 1400 mAh 240 x 320 240 x 320 240 x 320 360 x 480
930 mAh 1300 mAh 1130 mAh 1500 mAh 910 mAh 880 mAh
240 x 320
240 x 400 240 x 320 240 x 320 240 x 320 176 x 220 Phone Chipset
SERRA
Qualcomm’s First 45nm Tapeout
July 2009 6
Serra Feature Set
July 2009 7
- Modem
- CDMA 1xEV-DO revA & B
- UMTS (includes HSDPA, HSUPA)
- GSM (includes GPRS and EDGE)
- Unified GPS engine for both CDMA
and UMTS modes.
- Processors
- QDSP4u8 based MDSP core
- ARM11 core with 32KB I/D cache
- ARM926 core with 32KB I/D cache
- QDSP5u4 based ADSP core w/ 256
KB L2 cache
- Multimedia
- 24 bit WVGA w/ LCDC (active refresh)
- ATI LT graphics core (Open GL 2.0)
- 22M Triangles/Sec
- 8 Mpixel Camera support
- Peripherals
- 2 HS USB interfaces
- MDDI gen 1.5
Serra Physical Characteristics
Die size: 8200.08 x 6500.34 um (53.3 mm^2) Signal I/Os: 419 Process: tsmc45lp Metal Layers: 6 (5 thin, 1 thick (4x)) and 1 AP RDL layer Total # Transistors: 170 Million Total # RAM bits: 13.7 Mbits Total # ROM bits: 1.1 Mbits Static IR Drop: < 10mV (@ Worst case 800 mA) Leakage: ~450 uA (TT,25c, 1.125V) 671 pin 13x13 NSP Package 0.5mm ball pitch
Includes Serra (Digital die) + Analog + Memory
8 July 2009
Serra Low Power Design Goals
Background
- Leakage Power is increasing due to process
- 45nm Sub-Threshold is worse than 65nm (pA/um)
- Gate leakage is increasing.
- Junction Diode leakage is increasing.
- 45nm Process has no HVt transistor.
- Simple scaling of Dynamic Power is not Enough.
- + Dynamic Power will scale down with process geometries (-C)(-V)
- However increased wire cap will temper the reduction
- Increased performance demands and more applications (+f) (+C)
- Aggressive Product requirements for battery life
- Conclusion:
- More aggressive leakage and active power management techniques are required in
45nm
- Low Power Priorities / Goals for Serra:
- 1 Decrease Dynamic Power
- 2 Maintain the total static leakage power.
- 3 Keep Active Leakage a “small” percentage of Dynamic Power (< ~15%)
9 July 2009
Serra Low Power Features
- Low Power Multi-Threshold Qualcomm Standard Cell Library
- 2 Vt and 2 Channel Lengths
- Low Power Memory
- Power Collapsing of RAM/ROM periphery and core
- Independent Bank Collapsing for Large High Density Memories
- Advanced Low Power Clocking
- ~105 Master Clock Domains (I/O or Independent Frequency)
- ~230 Total Clock Domains (Synchronous, Iso-Synchronous, Asynchronous)
- Automatically inserted Fine grained clock gating
- Manually inserted Architectural clock gates
- Static SW control and Dynamic HW control of clock gating.
- Custom Raw Clock Tree Routing
- Low Power CTS with Qualcomm Custom Clock Tree cells.
- 24 Analog and Pad power domains
- 2 Digital Power domains
- Independent Voltage Scaling
– Active and Sleep modes
- Power Collapsing
- 8 Digital Power Islands with Power Gating
- All Low Power Features fully Verified
- Power Aware simulation
- Power Structural Checks
10 July 2009
Serra Floorplan
11 July 2009
Serra Static IR Drop Map
12 July 2009
Serra Dynamic IR drop Map
13 July 2009
DESIGN AUTOMATION
- Design Automation is Mandatory
- Design complexity
- Time to Market is Critical
- Fewer design resources required
- Quality
- Through Standardized flows and tools
- Automated Design tools and flows have several
limitations that affect low power
- Many automated tools don’t consider power
- Others don’t make the correct tradeoffs between power and
area/timing.
- This Presentation focuses on the tradeoffs involved with
several low power techniques used on Serra and the limitations of automated design for low power
July 2009 15
Design Automation
Customized Design for Low Power
- Custom design flows and circuits can produce better
results
- Lower Power, Higher Speeds, Less Area
- Custom design requires more design effort and time
- Use customized Design and signoff ONLY in critical areas
- Pick areas of customization to get the greatest benefit
- Clock Trees
- Raw Clock Trees
- Raw Clock Dividers
- Memory IP
- Standard Cell
- Move the customization into IP
- Use automation to insert the IP, check the IP and optimize with IP.
16 July 2009
Customization of Raw Clock Network
- Raw clock networks are high
speed, high power nets from PLLs to dividers
- Raw clock dividers are stacked
and custom routed.
- Width and spacing are chosen for
- ptimal clock isolation while
maintaining fast transition times.
- Use minimal clock buffers to
distribute clocks within the network but maintain desired transition delay.
- 10-input tri-state mux
- Reduces insertion delay and
power
- Custom Layout of raw dividers
- Reduces critical path delay,
voltage noise and optimizes rise/fall times.
- ~4x reduction in Raw clock Power
(Compared to Previous chip)
Selected Clock Path (green) Non-Selected Active Clocks (red)
Traditional Wide Mux Structure PLL PLL
PLL PLL
Raw Clock Network
17 July 2009
LOW POWER IP
Periphe ral with footer Bit-cell array with header Sleep with data retention Sleep without data retention 90nm Yes No Yes No 65nm Yes No Yes No 45nm Yes Yes Yes Yes
Leakage 90nm 65nm 45nm
Sleep Sleep Sleep w/ retention
array peri
Function Function Function Sleep w/o retention Sleep w/ retention Sleep w/o retention
with Vdd scaling without Vdd scaling
- Bit-cell leakage is up 6X in 45nm.
- No hVt devices.
- All memories need to have leakage
control
- Circuit + System solutions
- Peripheral footer
- Bit cell header
- Vdd scaling
Maintain only the useful data with array header and reduce Vdd during sleep mode to manage the leakage.
Low Power Memory
Core Array Periphery
19 July 2009
Memory Partial Bank Collapse
- Power Gating portions of the bit-cell array that are not
needed
- Standby/Active Leakage reduction
- Some active power reduction since clock/data is gated to
banks that are not accessed.
- Requires Proper memory management in SW and FW.
July 2009 20
STANDARD CELL POWER REDUCTION
Standard Cell Leakage
- 45nm Standard Cell Challenges
- Ioff increase and no HVT device
compared to 65nm
- Performance provided by NVT is not
required everywhere
- Power Gating not possible in all
blocks
- 45nm OPTIONS
- Use Longer Channel length NVT
device
- Min channel length is 40nm in 45nm tech
- Use Stacked NVT devices
- Replace every device with a stack of 2
devices
Length (L) Spacing (S) Pitch Increase 40n 40n 180n 50n 45n 200n 11% 60n 45n 210n 17%
L L 60n S
Pitch = L/2 + S + 60n + S + L/2
S
QCT45
Leakage Scaling Factor, 65nm to 45nm H2N H2NL N2N L2L L2N N 24.71 9.45 1.61 10.48 0.56 P 27.58 5.68 3.00 26.80 1.51 Ave 26.14 7.57 2.31 18.64 1.03 TT, 25c
22 July 2009
Simulation Results
0.0 1.0 2.0 3.0 4.0 5.0 6.0 7.0 8.0 9.0 NVT, 40n NVT, 50n NVT, 60n NVT, 70n Stacked NVT
Factor Candidates Leakage Savings
0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 NVT , 40n NVT , 50n NVT , 60n NVT , 70n Stacked NVT
Factor Candidates Delay Increase
0% 2% 4% 6% 8% 10% NVT, 40n NVT, 50n NVT, 60n NVT, 70n Stacked NVT
Area Increase (%) Candidates Area Increase
Area : Comb.
0% 10% 20% 30% 40% 50% 60% 70% NVT, 40n NVT, 50n NVT, 60n NVT, 70n Stacked NVT
Switching Cap Increase (%) Candidates Switching Cap Increase
TSMC TT, 1.1V, 25C Not Available
23 July 2009
Standard Cell Leakage
- Longer Channel length
- Leakage savings = 4.5X (L=50nm), 6.6X (L=60nm)
- Delay = 1.18X (L=50nm), 1.32X (L=60nm)
- Area increase ~ 5.8% (L=50nm), 6.7% (L=60nm)
- Switching Cap increase ~ 16% (L=50nm), 32% (L=60nm)
- Stacking is NOT as beneficial
- 8.3X Leakage Reduction
- Steep Delay Penalty 2.9X
- Substantial area penalty
- Switching cap increase ~ 62%
- Conclusion
- L=40 to L=70 was evaluated
- Diminishing returns in leakage savings as L increases
- Dynamic power versus Leakage tradeoff
- Provide NVT and LVT with two channel lengths in 45nm Library
- Nvt and PNvt (long channel)
- Lvt and PLvt (long channel)
24 July 2009
Block Level Leakage Comparison
- 65nm vs. 45nm (Serra)
- Compare the Leakage of 6 Hard Macros with the fewest RTL changes.
- 65nm vs. 45nm Library
- 1.23v vs. 1.125v
- Higher Speed blocks (B, E, F)
- Benefited from increased speeds in 45nm
- Standard Cell Leakage actually decreased
- Medium/Slow speed blocks (A, C, D)
- Increase in standard cell leakage from 30-66%.
- At the block level the ave. leakage increase from 65nm -> 45nm is 1.5x to 2x
July 2009 25
0.00 uW 5.00 uW 10.00 uW 15.00 uW 20.00 uW 25.00 uW 30.00 uW 35.00 uW 40.00 uW 45.00 uW 50.00 uW BLK A BLK B BLK C BLK D BLK E BLK F 65nm 45nm
Block 65nm 45nm Diff
BLK A
19.10 uW 24.70 uW 29.3%
BLK B
0.39 uW 0.18 uW
- 54.4%
BLK C
15.50 uW 20.20 uW 30.3%
BLK D
25.50 uW 42.40 uW 66.3%
BLK E
0.93 uW 0.14 uW
- 84.6%
BLK F
46.30 uW 32.00 uW
- 30.9%
MTCMOS
MTCMOS - Strategy
- Mixed Vt Libraries are used to optimize both leakage and active
power
- Higher Vt reduces leakage
- Use Long-channel devices to provide additional leakage
reduction
- Small area and active power penalty
- Lower Vt can improve active power
- providing better Isat / Cg ratio
- Lower Vt has better relative performance and less delay
variation at Low Vdd.
- Applications of Low Vt
- High Activity Nets
- Exclusively for clock trees.
- Less insertion delay, skew, variation, optimal repeater.
- Decreased active power at the cost of leakage.
- In blocks that are power gated
- In power domains with aggressive voltage scaling
- In high performance blocks.
27 July 2009
- Most EDA tools do a poor job of active power
- ptimization with mixed Vt Libraries
- EDA tools DO use Mixed Vt for timing.
- Don’t know where to use Low Vt for active power reduction.
- Need usage information to tradeoff leakage and active power
- High activity nets – Requires Vector Information
- Clock Nets – Requires Constraints to restrict cell list
- Blocks that are power gated – Power Intent
- Blocks that are Voltage scaled – Power Intent
- Solution
- Users decide which Vt is most appropriate.
- Users constrain or restrict the tool’s choice of Vt.
- Use single Vt for initial runs and multi-Vt for incremental
- ptimization.
July 2009 28
MTCMOS – EDA Limitations
MTCMOS – EDA Limitations
- Many opportunities for leakage
recovery
- Timing slack on paths
- Excess margin early in the flow
- Pessimistic view of timing
- Optimization cost functions favor
area over leakage
- Point tools in the industry that
were created for this purpose.
- Special scripts in Timing Signoff
tool.
- Leakage Recovery within PTSI™
- Knowing where to use High Vt for leakage power reduction
July 2009 29
Zhan, “A Utility for Leakage Power Recovery within PrimeTime SI” SNUG Boston 2008
- Misc. Standard Cell Power
- Low Clock Power Flip Flops
- Additional Flip Flop Family that targets power reduction
- New Clock Topology
- Enables designers and tools to make a power vs. speed tradeoff
- 21% lower clock power, 6% smaller, but 36% slower than regular
flop.
- Originally they had the same footprint
- EDA tools did not insert the Low Clock Power Flops unless they
were also smaller. [Cost functions favor Area & Timing]
30 July 2009
CLOCK TREE POWER
Clock Tree Power
- Clock Tree Power is still a major contributor to total active
power (30-40%)
- Clock Architecture and Frequency Planning
- Clock Architecture has a huge impact on power
July 2009 32
- Number of PLLs
- Independent PLLs
- Increase flexibility and ease of use.
- Optimizes Power for simple use cases
- Shared high frequency PLLs
- Lower system power in concurrency modes
- However, SW may not be able to manage very complex frequency
plans which results in power inefficiencies
- Clock domain partitioning
- Independent Clock Domain Control (HW/SW)
- Frequency and Gating
- Clock Synchronicity
- Asynchronous domains are more flexible
- Increased latency across clock domain boundaries
- Synchronous vs. Iso-Synchronous
- Saves power in low performance modes. Cost power in worst case
July 2009 33
Clock Tree Architecture
- Still one of the most effective ways to save active power
- Maximize fine-grain clock gating
- Optimize settings for integrated clock gating (w/ PowerCompiler™)
– Almost always a win. – Beware of increased skew, insertion delay, timing and area impact.
- Manually inserted Architectural CGCs
- Gate modules based on Modal and Temporal need.
- Can be HW or SW controlled
- Analyze the Clock Gating Percentage (CGP) and Clock
Gating Efficiency (CGE) to find new opportunities
July 2009 34
Clock Gating
Example - Clock Power
- Example of a Complex DMA engine
with multiple bus masters.
- Separated Client Interface clocks
- Provided Independent clock
frequency control
- Div 1-4
- Iso-sync / Async support on bridges
- Enables Master/Slave clock division
Previous
TREE~1650 buffers ~18655 registers
LEAF LEAF LEAF LEAF LEAF~2947 registers ~3009 registers ~2197 registers ~1540 registers ~8962 registers ~1650/2 buffers ~1650/8 buffers ~1650/8 buffers ~1650/8 buffers ~1650/8 buffers
CXC CXC CXC CXC CXC CXC
35 July 2009
90nm 65nm r1 65nm r2 Serra CI3 0.1979 CI2 0.2232 CI1 0.2307 CI0 0.2507 Baseline 17 8.0578 5.7606 3.377
2 4 6 8 10 12 14 16 18
mA
Bus Master Clock Power Comparison
Low Power CTS
- Clock Tree Synthesis (CTS)
- Used for the majority of ~250 clock trees on Serra
- Small changes to the constraints and cells can dramatically
affect the power of CTS trees.
- Custom Clock Cell Design
- Clock Buffers, Inverters, Muxes, Gating Cells, Dividers.
- Internal clock routes are minimized to reduced clock power
- Insertion delays are minimized to reduce overall clock power and
skew
- Low power CTS
- Higher granularity of clock buffers.
- Relaxed transition and skew constraints
- Modal skew balancing (functional vs. test)
36 July 2009
Low Power CTS – EDA Deficiencies
July 2009 37
- Most CTS tools only consider
- Global skew and transition time.
- CTS – Example
- 1) Balance Global Skew
- 2) 100ps Max Transition
- 3) Min Area/Power
- A. Default Settings
- B. Optimized Run
- Need tools which
- Consider Local vs. Global skew.
- Relax clock transition times
- Cbuf sizing on paths
- Cluster registers that share a clock
- r clock gate
- Insert, clone and de-clone clock
gates
- Move clock gates up the tree
A B
MULTIPLE POWER DOMAINS
Multiple Power Domains
- Power Domain Partitioning
- More power regimes from PMIC
- + Better Power Control and Efficiency
- + Independent Voltage Scaling and Power Collapsing
- - Increased PDN impedance and IR Drop
- - Increased Bill Of Materials (BOM)
- - Requires level shifters and resynchronization at boundaries.
- Need a small on-chip regulator with fast response and good
efficiency
- IR Drop and PDN impedance
- - Increased Power Density
- - Increased metal resistance
- - Increased IR due to Power Switches
- - IR impact is greater at lower voltages
- - Dynamic IR affects on skew and timing not well modeled.
39 July 2009
PDN Impedance
- Worst Case IR drop may not be at max frequency and power
- An arbitrary PDN network observed from the die looking back towards the
VRM is shown below. The PMIC has passive modeling.
- Resonance points in the RLC network.
1 103 1 104 1 105 1 106 1 107 1 108 1 109 0.01 0.1 1 1 0.01 Z spec s i ( ) ( ) Z network s i ( ) ( ) R spec1 R spec3 s i ( ) L spec 1 s i ( ) C spec 1 109 1 103 f i ( )
Potential Peaking due to PMIC & Bulk Cap Board “Mounting Inductance” Mid-band region controlled by decap inventory 40 July 2009
VOLTAGE SCALING
Static Voltage Scaling (ARM11)
- SW selects a performance level
based on application and/or SW performance monitors.
- HW selects 1 of 8 pre-
programmed Frequency- Voltage pairs
- Benefits
- Active power reduction
- 32%-36% based on lab
measurements at 96 MHz
- Limited by Memory Vcc Min
- Easier to implement on
Applications processor
- Drawbacks
- Requires separate voltage regulator (vdd_apc)
- Less efficient than AVS
- SW Controlled vs. HW controlled AVS
- SW Performance monitors and algorithms lagging
- Requires careful characterization of ARM11ss FMAX vs. Voltage
- Timing Closure (Corners, Variation, Margins)
PMIC7500
SSBI
1 8 8 1 1 lvl 8 lvl 7 lvl 6 lvl 5 lvl 4 lvl 3 lvl 2 lvl 1 data addr addr
FSM
req ack 3 8
Modem ARM Control ARM9 Peripheral Bridge and MPU Serra Digital Die svs_cntl vdd_apc vdd_dig ARM11SS
Next PLevel Current PLevel stat
42 July 2009
Freq. (MHz) Vdd (V) 128 1.15 96 1.10 64 1.05 48 1.00 32 0.975 25 0.95 20 0.95 Freq. (MHz) Vdd (V) 256 1.20 196 1.15 128 1.10 96 1.05 64 1.00 32 0.975 20 0.950 Freq. (MHz) Vdd (V) 256 1.20 196 1.15 128 1.10 96 1.05 64 1.00 32 0.95 20 0.95
Static Voltage Scaling (MSM Top)
- Multiple Blocks in the same power regime
- The highest required voltage is determined from multiple LUTs.
- LUT data obtained from FMAX characterization.
- Software programs the voltage upon entry / exit of each mode
- Drawbacks
- SW Complexity
- Fixed frequency blocks don’t scale well
- Lots of characterization work
- Risk of Test escapes
Not all blocks included in characterization. 43 July 2009
BLK A BLK B BLK C Mode1 Requires 1.2v Mode2 Requires 1.15v Mode3 Requires .95v
Process Monitoring DVS
- Process Monitoring DVS
- Increased process variation at 45nm, increased benefits
- f Process related Dynamic Voltage Scaling.
- Measure PM speed.
- SW obtains the required voltage from LUT.
- LUT is created through Characterization.
- Benefits
- 16% power reduction for TTT, 31% for FFF estimated.
- More practical than DVFS on top-level
- AVS is too complex to implement on top-level
- Drawbacks
- Impact on ATE and test time, binning and test escapes.
- Characterization effort
- Timing Closure (Corners, Margins, Variation)
Fast Typical Slow 3 Design Target pass fail Frequency Simulation Data: Fmax vs Vdd for FFF, TTT, SSS parts Process Bin Required Vdd (V) Fast 0.950 Typ. 1.025 Slow 1.125 44 July 2009
- 45nm designs require MMMC signoff for Hold robustness
- Hold closure at several PVT corners.
- Functional Mode and Test Mode.
- MMMC Implementation is also necessary to achieve
lowest power.
- Voltage Scaling:
- Optimize Implementation across several PVT corners to make the
correct decisions
- MMMC Reduces iterations and effort for closing Setup & Hold.
- Power Optimization
- Power corners are different than timing corners.
– Leakage information is very in-accurate at Temperature/Voltage extremes. – Optimize power for the nominal use modes.
Multi-Mode Multi-Corner
45 July 2009
- MMMC Requirements
- Constraints for each Mode are needed early.
- More library characterization required.
- >15 Corners used on Serra
- MMMC Limitations
- Increased tool run times (Physical Design cycle time)
- Time to market
- No mature solution for MMMC throughout the flow
- Used single corner Synthesis and Placement with some iteration
and over-margining on Serra.
- MMMC was used during CTS, Post-CTS opt and final timing
closure.
- Evaluation of Commercial tools
– Evaluated industry leading tools for MMMC synthesis/placement after Serra. (Still Maturing)
Multi-Mode Multi-Corner
46 July 2009
Variation and Margins
- Variation is increasing at smaller geometries and lower voltages
- Adding fixed margins is detrimental to power
- Increased hold buffer insertion
- Larger gate sizes
- Swapping to lower Vt when not needed.
- Use OCV to apply margin only where necessary.
47 July 2009
Monte Carlo Simulations of Hold Time at 2 VT Corners
VDD Minimization During Sleep
0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% 0.2 0.4 0.6 0.8 1 1.2 Scaled Leakage Power (%) Supply Voltage (V) Estimated Leakage Power Reduction by VDD Minimization
- Minimize voltage of entire
digital die during sleep
- Especially beneficial during
short sleep cycles (BT Scan, QChat)
- Estimated 70% leakage power
reduction @ Battery by lowering the digital voltage from 1.1V to 0.7V
- Minimal SW overhead and fast
restoration.
- All memory and register state
is retained.
Simulated [PNVt TT, 25c] 48 July 2009
VDD Minimization (Serra)
- Serra Leakage current during
VDD MINIMIZATION
- Calculated 64% savings at
Battery
- 80% efficiency, 3.7V Battery.
- Voltage minimization is the
most effective way to control digital leakage without complicated SW changes.
batt msm msm batt
V V I I * 8 . *
Serra (Normalized) Off Leakage Measured (3 TT parts, 25c)
measured calculated
49 July 2009
Voltage Ave @ MSM Ave @ Battery 1.078 1.00 0.36 0.81 0.47 0.13 Savings 52.66% 64.43%
0.0x 0.2x 0.4x 0.6x 0.8x 1.0x Ave @ MSM Ave @ Battery 1.00 0.36 0.47 0.13 1.078 0.81
Full Chip Power Collapse
- Full Chip collapse at PMIC regulator
- Requires data to be saved and re-boot of all processors
- Power Collapse Break-Even Point
Energy Saved (Leakage * Time) > Energy Overhead (Save/Restore);
- For 2.56 Sec Slot cycle (Assuming 5mJ of Overhead) Power
Collapse is beneficial if leakage is greater than 1.8mA
Which consumes less energy? w/ or w/o PC
1 2 3 4 5 1 2 3 4 5 Sleep Time (Sec) Leakage current (mA) 1mJ 0.5mJ 2mJ 5mJ PC/OH =10mJ PCH Time Slot (2.56 Sec) 0.68 34 76 Scaled/Estimated Serra Current (mA) 2,477 2,508 2,560
50 July 2009
POWER GATING
FS/HS Power Gating
- Headswitch (HS) and footswitch (FS)
- Serially adding a high resistance device between
power rails in low leakage mode
- Power Switch design is a trade-off between
leakage power, IR drop (when FS/HS are on), area, ramp-up noise and time.
- Local FS is integrated within logic circuit
- Used at Qualcomm since .13u
- Minimum impact on design automation flow
- Finer control to turn down logic circuits
independently
- State Maintained
- (Leakage savings / Cost) is low.
- Global FS/HS are integrated within power grid
- Switches shared by all logic in a domain, thus
smaller size and lower IR drop is possible.
- Can be implemented with finer grid, e.g. global
distributed FS/HS (GDFS), or coarse grid, e.g. FS/HS power ring
- GDFS is used in 90nm and 65nm designs
- Employ a save & restore scheme
….
Vdd Local Footswitch Vss
….
Global FS Vss_int
Vdd Vss
52 July 2009
Global FS/HS implementations
- FS/HS ring
- Less PD effort
- Shorter sleep control distribution
- Larger IR drop compared with GDFS,
especially in flipchip case
- IR drop increases quicker when the
size of the block increases (cubic w.r.t. the length)
- Only Suitable for small size macros
- Global distributed FS/HS
- Can be modeled as an additional
resistance between global and local power mesh
- Does not break global mesh
- Needs sleep control signal
distribution throughout.
- Suitable for large size macros
FS Ring Global PG Mesh Local PG Mesh
v v v v v v v v v v v v v v v v v v
GDFS
En_few_in En_few_out
vssfx
Vss
En_rest_in En_rest_out
Mf Mr
53 July 2009
Serra GDFS Design
- Global Distributed Footswitch (GDFS) was chosen for
leakage reduction vs. area cost
- Sleep leakage control
- Extra long channel NVT transistors used for footswitch cells
- Active leakage control
- Turn off macros that are not used in a certain operating
modes
- High Temperature active leakage control
54 July 2009
GDFS Design Flow
- Use QCOM in-house tools
for inserting isolation cells.
- Use existing commercial
tools to insert switches and connect enables.
- Use existing commercial
tools + scripts to insert isolation cells during Physical Design.
- Pre-place GDFS cells and
do IR drop analysis at early stage
- Further refine the size of
GDFS cells according to local IR drop
July 2009 55
Low Power Verification
- Power Aware Verification is Required
- Verify entry to & exit from low power states
- Properly model power collapse
- Verify clamp polarity
- Power Aware Simulation tools
- Qualcomm Scripts
- Commercial Tools
- Power Structural Checks
- Verify Power domain crossings
– Isolation cells
- Commercial Tools @ 3 design stages
– RTL – Logical Gate – Physical Gate
July 2009 56
. . . . . . .
Active Idle Dormant Power Off
Total Leakage Current by State
gfs mem std N2 T2 T1 (TTT, 1.125V, 25C)
Serra Leakage
- Leakage by State
- Measurements on 3 TT Parts
- 4x reduction in leakage from Active to Power off states. (Excluding Vdd min)
- >4x leakage reduction from GDFS.
- Power Off Variation
- across 3 Typical Wafers Mean to mean is 20%
- Variation across dice from same wafer is ~100%
57 July 2009
Columns – Estimated Dots - Measured
CONCLUSIONS & FUTURE DIRECTIONS
July 2009 58
Conclusions
- Power IS a key differentiator and limits the design
- Tools and users must Design for Power
- Power is a different beast than area/timing
- Power/Energy is highly dependant on use case (Vector)
- Must consider Static and Dynamic Power
- Power varies across PVT.
– Optimization at Nominal – Constrained by worst case.
- Designing for Low Power is all about Tradeoffs.
- Many Low Power Techniques exist
- Each needs to be applied with tradeoffs in mind.
- Know the tradeoffs and what your specific goals are.
- We still need more techniques to meet the customer
- demands. (We are not doing enough)
59 July 2009
Conclusions
- Design automation is required for Complex SOCs, but
Customization in select areas can Dramatically reduce power.
- Move Customization into IP
- Several deficiencies and limitations exist within EDA tools.
- Multi-Mode Multi-Corner Design is a must
- EDA tools need to improve and move MMMC up in the flow
- Timing and power corners are not the same.
- Simultaneous optimization of timing, power and area needed at
all stages
- High Level Synthesis
- RTL Design & Optimization
- Logic Synthesis
- Clock Tree Synthesis
- Place and Route
- Timing Closure
60 July 2009
- Getting the Architecture right is critical to power
- High Level System Power Modeling required for making
correct architectural design decisions
- Need to have “relatively” accurate power models to make
tradeoffs
- Power Vectors
- Accurate power estimation and optimization require
accurate activity information
- There is a need for tools/methods which extract activity information
from real SW code running on Emulation and/or system models.
- Benchmarks and verification type vectors are often quite different
from “real” world use cases
- Vector-less power estimation and optimization is a very
challenging problem.
- Some have attempted, but no satisfactory solution seen thus far.
July 2009 61
Conclusions
References
[1]http://www.nttdocomo.co.jp/english/info/notice/page/0902 27_00.html [2] Bruce Zhan, “A Utility for Leakage Power Recovery within PrimeTime SI” SNUG Boston 2008 [3] Krzysztof A. Kozminski, “Optimization for Leakage Power with PrimeTime” 2004 San Jose SNUG Conference
62 July 2009