Performance, Power CS301 Prof Szajda Performance Metrics (How do - - PowerPoint PPT Presentation
Performance, Power CS301 Prof Szajda Performance Metrics (How do - - PowerPoint PPT Presentation
Performance, Power CS301 Prof Szajda Performance Metrics (How do we compare two machines?) What to Measure? Which airplane has the best performance? 3 Performance One size does not fit all Depends on application domain
Performance Metrics
(How do we compare two machines?)
What to Measure?
3
Which airplane has the best performance?
Performance
- One size does not fit all
- Depends on application domain
Scientific computing Graphics Databases General-Purpose desktop Beware of designing to benchmark!
- Depends on technology characteristics
DRAM speed and capacity, chip size, etc.
Which Metric Do We Use?
- Response or execution time
Difgerence between start and end time Individual user cares most about this
- Throughput
Total amount of work done in given time Frequently used for servers and clusters
- How are these afgected by
Replacing processor with faster version? Adding more processors?
Execution Time
- Shorter execution time is better
- Allows comparison between 2
machines
Relative Performance
- “X is n times faster than Y”
- Example:
Machine A takes 10s to run program Machine B takes 15s to run same program What is the performance ratio?
Difgerent Time Values
- Execution time
Wall-clock, response, or elapsed time Includes everything (processing,I/O, OS overhead, etc)! Determines system performance
- CPU time
Time spent executing code for this task only Does not include I/O or time-sharing Comprises user CPU time and system CPU time
Difgerent programs are afgected difgerently by CPU and system performance
man time
- 90.7u 12.9s 2:39 65%
User: 90.7 sec System: 12.9 sec Elapsed time: 2 min 39 sec
Clock Cycles
- Instead of expressing time in seconds, use
clock cycles
- Clock
Determines when events take place Runs at constant rate (ex. 1 GHz) Easy to convert between clock rate and seconds
Clock rate = 1 / Clock Cycle time 500 MHz = 1 / (2 ns) 1 ns = 10-9 s
Chapter 1 — Computer Abstractions and Technology —
CPU Clocking
Operation of digital hardware governed by a
constant-rate clock
Clock (cycles) Data transfer and computation Update state Clock period
Clock period: duration of a clock cycle
e.g., 250ps = 0.25ns = 250×10–12s
Clock frequency (rate): cycles per second
e.g., 4.0GHz = 4000MHz = 4.0×109Hz
Chapter 1 — Computer Abstractions and Technology —
An Aside
Chapter 1 — Computer Abstractions and Technology —
CPU Time
Performance improved by
Reducing number of clock cycles Increasing clock rate Hardware designer must often trade off
clock rate against cycle count
Chapter 1 — Computer Abstractions and Technology —
CPU Time Example
Computer A: 2GHz clock, 10s CPU time Designing Computer B
Aim for 6s CPU time Can do faster clock, but causes 1.2 × clock cycles
How fast must Computer B clock be?
Chapter 1 — Computer Abstractions and Technology —
Instruction Count and CPI
Instruction Count for a program
Determined by program, ISA and compiler
Average cycles per instruction
Determined by CPU hardware If different instructions have different CPI
Average CPI affected by instruction mix
Chapter 1 — Computer Abstractions and Technology —
CPI Example
Computer A: Cycle Time = 250ps, CPI = 2.0 Computer B: Cycle Time = 500ps, CPI = 1.2 Same ISA Which is faster, and by how much?
A is faster… …by this much
Application Characteristics
- Determine the mix of difgerent
instruction types
Integer arithmetic Logical operations Floating point arithmetic Loads and stores
- Difgerent applications have difgerent
CPI because of difgerent instruction mixes
Chapter 1 — Computer Abstractions and Technology —
CPI in More Detail
If different instruction classes take
different numbers of cycles
Weighted average CPI
Relative frequency
Chapter 1 — Computer Abstractions and Technology —
CPI Example
Alternative compiled code sequences using
instructions in classes A, B, C
Class A B C CPI for class 1 2 3 IC in sequence 1 2 1 2 IC in sequence 2 4 1 1
Sequence 1: IC = 5
Clock Cycles
= 2×1 + 1×2 + 2×3 = 10
- Avg. CPI = 10/5 = 2.0
Sequence 2: IC = 6
Clock Cycles
= 4×1 + 1×2 + 1×3 = 9
- Avg. CPI = 9/6 = 1.5
Chapter 1 — Computer Abstractions and Technology —
Performance Summary
Performance depends on
Algorithm: affects IC, possibly CPI Programming language: affects IC, CPI Compiler: affects IC, CPI Instruction set architecture: affects IC,
CPI, Tc
The BIG Picture
Amdahl’s Law
- How much speedup do you get from an
enhancement?
- Based on
Fraction of time enhancement used Improvement in enhanced mode
Speedup = Execution time w/o enhancement Execution time w/ enhancement Execnew = Execold × ((1-fractionenh) + Speedupenh fractionenh )
Chapter 1 — Computer Abstractions and Technology —
Pitfall: Amdahl’s Law
Improving an aspect of a computer and
expecting a proportional improvement in
- verall performance
§1.10 Fallacies and Pitfalls
Can’t be done!
Example: multiply accounts for 80s/100s
How much improvement in multiply performance
to get 5× overall?
Corollary: make the common case fast
Review Question
- Your machine has a clock rate of
2.4GHz. How long is the clock cycle?
Review Questions
- Suppose you are given the following:
Machine A
1 GHz Average CPI = 1.6 Instructions = 1.7 Billion
Machine B
3.3 GHz Average CPI = 6.1 Instructions = 2 Billion
- Which machine is faster? By how
much?
Review Questions
- What is the average CPI for a machine
with the following CPIs on an application with the following instruction frequency?
Type Frequen cy CPI Arithmeti c 0.45 1 Memory 0.3 8 Control 0.2 3 Mult/Div 0.05 5
Review Questions
- What factors must be included when
comparing the relative performance of two machines?
Amdahl’s Law
- Suppose you have an enhancement
that makes a functional unit 10x faster.
- Speedup if used 5% of the time?
- Speedup if used 40% of the time?
Execnew = Execold × ((1-fractionenh) + Speedupenh fractionenh )
Review Questions
- What is the equation for execution
time?
- What does Amdahl’s Law say?
Benchmarks
- Programs specifically used to measure
performance
- Hope is that it is representative of how
computer will be used
- Examples
SPEC Integer and Floating Point MediaBench MineBench TPC
Chapter 1 — Computer Abstractions and Technology —
SPEC CPU Benchmark
Programs used to measure performance
Supposedly typical of actual workload
Standard Performance Evaluation Corp
(SPEC)
Develops benchmarks for CPU, I/O, Web, …
SPEC CPU2006
Elapsed time to execute a selection of programs
Negligible I/O, so focuses on CPU performance
Normalize relative to reference machine Summarize as geometric mean of performance
ratios
CINT2006 (integer) and CFP2006 (floating-point)
Chapter 1 — Computer Abstractions and Technology —
CINT2006 for Intel Core i7 920
Chapter 1 — Computer Abstractions and Technology —
Recent Concern: Power
In CMOS IC technology
§1.7 The Power Wall ×1000 ×40 5V → 1V
Tricks to Increase Power
- Attach large cooling devices
- Turn ofg parts of chips not used in
given clock cycle
Can increase power to 300 watts... ...But these and other ways all prohibitively expensive for desktop
- computers. So...
32
More Recent Approaches: Chip Multiprocessors
- Reasons for change
Limited opportunities to improve single thread performance Power On-chip communication latencies
Chapter 1 — Computer Abstractions and Technology —
Uniprocessor Performance
§1.8 The Sea Change: The Switch to Multiprocessors
Constrained by power, instruction-level parallelism, memory latency
Chapter 1 — Computer Abstractions and Technology —
Multiprocessors
Multicore microprocessors
More than one processor per chip
Requires explicitly parallel
programming
Compare with instruction level parallelism
Hardware executes multiple instructions at
- nce
Hidden from the programmer
Hard to do
Programming for performance Load balancing Optimizing communication and synchronization
Chapter 1 — Computer Abstractions and Technology —
Concluding Remarks
Cost/performance is improving
Due to underlying technology development
Hierarchical layers of abstraction
In both hardware and software
Instruction set architecture
The hardware/software interface
Execution time: the best performance
measure
Power is a limiting factor
Use parallelism to improve performance
§1.9 Concluding Remarks