Models and Metrics for Energy-Efficient Computer Systems
Suzanne Rivoire May 22, 2007 Ph.D. Defense EE Department, Stanford University
Models and Metrics for Energy-Efficient Computer Systems Suzanne - - PowerPoint PPT Presentation
Models and Metrics for Energy-Efficient Computer Systems Suzanne Rivoire May 22, 2007 Ph.D. Defense EE Department, Stanford University Power and Energy Concerns Processors: power density [Borkar, Intel] Power and Energy Concerns (2)
Suzanne Rivoire May 22, 2007 Ph.D. Defense EE Department, Stanford University
Processors: power density
[Borkar, Intel]
Personal computers
Servers and data centers
Compare energy efficiency Identify / motivate new designs
Understand how high-level properties affect power Improve power-aware scheduling policies / usage
First complete, full-system energy-efficiency benchmark Design of winning system
Generates family of high-level full-system models Generic, accurate, portable
Workload, metric, guidelines Rationale and pitfalls
3.5 better than previous best Insights for future designs
[S. Rivoire, M. A. Shah, P. Ranganathan, C. Kozyrakis, “JouleSort: A Balanced Energy-Efficiency Benchmark,” SIGMOD 2007.]
Under-specified or “under construction” Limited to a particular component or domain
Exercises all core components CPU, memory, disk, I/O, OS, filesystem End-to-end measure of improvement
PDAs, laptops, desktops, supercomputers
Supercomputers to clusters, GPU?
MinuteSort: How many records sorted in 1 min? Terabyte: How much time to sort 1 TB?
PennySort: How many records sorted for $0.01? Performance-Price: MinuteSort/$$
More info at http://research.microsoft.com/barc/SortBenchmark/
Equally (energy)?
Energy (Joules) = Power (Watts) Time (sec.)
Privilege performance (energy-delay product)?
Fix energy budget and compare records sorted? Fix num. records and compare energy? Fix time budget and compare records/Joule?
2000 4000 6000 8000 10000 12000 14000 16000 18000 1.0E+05 1.0E+06 1.0E+07 1.0E+08 1.0E+09 1.0E+10
Records Sorted SRecs/J .
Records Sorted
1-pass sort < 10 sec (N lg N) complexity SortedRecs/Joule
Power Power meter Sorting system Monitoring system Wall AC power
Power readings (serial cable) Sort timing (network)
First complete, full-system energy-efficiency benchmark Design of winning system
Generates family of high-level full-system models Generic, accurate, portable
406 22 140 90 290
Pwr (W)
~3200 59GB n/a 9 GPUTeraSort (estimated) ~3400 10GB 1% 1 Laptop ~3800 10GB >90% 12 Commodity fileserver ~1200 10GB 26% 2 Low-end server 11%
CPU %
~300 5GB 1 Blade
SRecs/J SRecs Disks
406 22 140 90 290
Pwr (W)
~3200 59GB n/a 9 GPUTeraSort (estimated) ~3400 10GB 1% 1 Laptop ~3800 10GB >90% 12 Commodity fileserver ~1200 10GB 26% 2 Low-end server 11%
CPU %
~300 5GB 1 Blade
SRecs/J SRecs Disks
406 22 140 90 290
Pwr (W)
~3200 59GB n/a 9 GPUTeraSort (estimated) ~3400 10GB 1% 1 Laptop ~3800 10GB >90% 12 Commodity fileserver ~1200 10GB 26% 2 Low-end server 11%
CPU %
~300 5GB 1 Blade
SRecs/J SRecs Disks
52% power 75% perf Fileserver CoolSort Sort BW: 313 MB/s 65W (peak) Sort BW: 236 MB/s 34W (peak)
15% power 50% perf Fileserver Our winner Seagate Barracuda
13W Hitachi Travelstar
2W
Asus motherboard: Mobile CPU + 2 PCI-e slots RocketRAID Disk Controllers 13 Hitachi TravelStar 160GB
2000 4000 6000 8000 10000 12000 2 3 4 5 6 7 8 9 10 11 12 13
Disks Used SortedRecs/Joule
20 40 60 80 100 120 140
SortedRecs/sec (x 10E4)
SRecs/J Perf
GPUTeraSort
Low-hanging fruit: use low-power HW
Best power-performance trade-off Still need to fully utilize resources Challenge: adequate interfaces and “glue” to bring laptop components into servers
Scaledown efficiency
Limited dynamic range For fixed HW: peak efficiency = peak performance How can we design machines that perform equally well in different benchmark classes?
How efficient is system at 50% utilization? 20%?
JouleSort and PennySort give pieces of the answer
Need energy-efficiency benchmark JouleSort specification
CoolSort system
Part of the sort benchmark suite
First complete, full-system energy-efficiency benchmark Design of winning system
Generates family of high-level full-system models Generic, accurate, portable
How do design decisions affect power?
How do my usage patterns affect power?
How will workload distribution decisions affect power?
Non-intrusive and low-overhead Easy to develop and use Fast enough for online use Reasonably accurate (within 10%) Inexpensive Generic and portable
Simulation-based Hardware metric-based
Inexpensive, arbitrarily accurate Not full-system Slow (not real-time) Not portable
Input:
Simulation Output: Predicted power (component)
Highly accurate Not full-system Complex, require specialized knowledge Not portable
Input:
Equation Output: Predicted power (component)
[Contreras and Martonosi, ISLPED 2005] [Isci and Martonosi, MICRO 2003]
How accurate? How portable? Tradeoff between model parameters/complexity and accuracy?
Input: Common util. metrics Equation Output: Predicted power (system)
Run one-time calibration scheme (possibly at vendor)
Stress individual components: CPU, memory, disk Outputs: time-stamped performance metrics & AC power measurements
Fit model parameters to calibration data Use model to predict power
Inputs: performance metrics at each time t Output: estimation of AC power at each time t
Input: CPU util. % Equation Output: Predicted power (system)
[Fan et al, ISCA 2007]
1u + C2ur
1u
Input:
Equation Output: Predicted power (system)
1uCPU + C2udisk
[Heath et al, PPoPP 2005]
Input:
Equation Output: Predicted power (system)
1uCPU + C2udisk +
i
Memory bus transactions Unhalted CPU clock cycles Instructions retired/ILP Last-level cache references Floating-point instructions
Highest and lowest frequencies
Highest and lowest frequencies
Laptop: gcc and gromacs only
ClamAV Nsort (CoolSort-13 only) SPECweb (Itanium only)
Any model is more accurate than none, and more detail/complexity is better than less.
Performance counter model is most accurate across the board. Any model is more accurate than none, and more detail/complexity is better than less.
Performance counter model is most accurate across the board. Any model is more accurate than none, and more detail/complexity is better than less. Simple linear CPU-util. model gets within 10%…with some exceptions.
(Xeon server)
(Xeon server)
Useful to model shared resources and bottlenecks
(Xeon server and CoolSort-13)
(Xeon server and CoolSort-13)
Necessary when dynamic memory power is high
(Xeon server and CoolSort-13)
Necessary when dynamic memory power is high Useful to tell how CPU is being utilized
Generic approach to power modeling yields accurate results
Linear CPU util. model not enough for…
GPUs Network (not a factor today)
Combine exponential CPU model w/ perfctrs? Cooling?
JouleSort energy-efficiency benchmark specification Winning JouleSort machine
Simple, portable high-level modeling technique Trade-offs between accuracy and simplicity
Advisor: Christos Kozyrakis Mentor: Partha Ranganathan Committee: Kunle Olukotun & Dwight Nishimura Collaborators: Mehul Shah, Dimitris Economou, Justin Meza Assistance: Jacob Leverich, HP Labs, Charlie Orgish, Teresa Lynn Defense food! Jayanth and Amin Architecture grad students Grant Gavranovic, Kelley Rivoire, friends & family