[PPT] - Limits of Superscalar Architecture Virendra Singh Associate PowerPoint Presentation

SLIDE 1

CADSL

Limits of Superscalar Architecture

Virendra Singh

Associate Professor Computer Architecture and Dependable Systems Lab Department of Electrical Engineering Indian Institute of Technology Bombay

http://www.ee.iitb.ac.in/~viren/ E-mail: viren@ee.iitb.ac.in

CS-683: Advanced Computer Architecture

Lecture 21 (09 Oct 2013)

SLIDE 2

CADSL

CS-683@IITB

VLIW Processors

Package multiple operations into one

instruction

Example VLIW processor:

– One integer instruction (or branch) – Two independent floating-point operations – Two independent memory references

Must be enough parallelism in code to fill the available slots

09 Oct 2013 2

SLIDE 3

CADSL

CS-683@IITB

VLIW Processors

Disadvantages:

– Statically finding parallelism – Code size – No hazard detection hardware – Binary code compatibility

09 Oct 2013 3

SLIDE 4

CADSL

CS-683@IITB

Summary: Multiple Issue

09 Oct 2013 4

SLIDE 5

CADSL

09 Oct 2013 5

Recap: Advanced Superscalars

Even simple branch prediction can be quite effective
Path-based predictors can achieve >95% accuracy
BTB redirects control flow early in pipe, BHT cheaper per

entry but must wait for instruction decode

Branch mispredict recovery requires snapshots of

pipeline state to reduce penalty

Unified physical register file design, avoids reading data

from multiple locations (ROB+arch regfile)

Superscalars can rename multiple dependent

instructions in one clock cycle

Need speculative store buffer to avoid waiting for stores

to commit

CS-683@IITB

SLIDE 6

CADSL

09 Oct 2013 6

Fetch Decode & Rename Reorder Buffer PC

Branch Prediction Update predictors

Commit

Branch Resolution

Branch Unit ALU

Reg. File

MEM Store Buffer D$ Execute kill kill kill kill

Recap: Branch Prediction and Speculative Execution

CS-683@IITB

SLIDE 7

CADSL

09 Oct 2013 7

Little’s Law

Parallelism = Throughput * Latency

r

Latency in Cycles Throughput per Cycle One Operation

CS-683@IITB

L T N

× =

SLIDE 8

CADSL

09 Oct 2013 8

Example Pipelined ILP Machine

How much instruction-level parallelism (ILP)

required to keep machine pipelines busy?

One Pipeline Stage Two Integer Units, Single Cycle Latency Two Load/Store Units, Three Cycle Latency Two Floating-Point Units, Four Cycle Latency

Max Throughput, Six Instructions per Cycle Latency in Cycles

CS-683@IITB

6 T =

3 2 2 6 2x4) 2x3 (2x1 L = + + = 6 1 3 2 2 6 N = × =

SLIDE 9

CADSL

09 Oct 2013 9

Superscalar Control Logic Scaling

Each issued instruction must check against W*L instructions, i.e.,

growth in hardware ∝ W*(W*L)

For in-order machines, L is related to pipeline latencies
For out-of-order machines, L also includes time spent in instruction

buffers (instruction window or ROB)

As W increases, larger instruction window is needed to find enough

parallelism to keep machine busy => greater L => Out-of-order control logic grows faster than W2 (~W3) Lifetime L Issue Group Previously Issued Instructions Issue Width W

CS-683@IITB

SLIDE 10

CADSL

Superscalar Scenario

Interest in multiple-issue because wanted to improve

performance without affecting uniprocessor programming model

Taking advantage of ILP is conceptually simple, but design

problems are amazingly complex in practice

Conservative in ideas, just faster clock and bigger
Processors of last 10 years (Pentium 4, IBM Power 5, AMD

Opteron) have the same basic structure and similar sustained issue rates (3 to 4 instructions per clock) as the 1st dynamically scheduled, multiple-issue processors announced in 1995

– Clocks 10 to 20X faster, caches 4 to 8X bigger, 2 to 4X

as many renaming registers, and 2X as many load- store units  performance 8 to 16X

Peak v. delivered performance gap increasing

09 Oct 2013 CS-683@IITB 10

SLIDE 11

CADSL

09 Oct 2013 11

Out-of-Order Control Complexity: MIPS R10000

Control Logic

[ SGI/MIPS Technologies Inc., 1995 ]

CS-683@IITB

SLIDE 12

CADSL

09 Oct 2013 12

Check instruction dependencies

Superscalar processor

Sequential ISA Bottleneck

a = foo(b); for (i=0, i< Sequential source code

Superscalar compiler

Find independent

perations

Schedule

perations

Sequential machine code Schedule execution

CS-683@IITB

SLIDE 13

CADSL

09 Oct 2013 CS-683@IITB 13

For most apps, most execution units lie idle

From: Tullsen, Eggers, and Levy, “Simultaneous Multithreading: Maximizing On-chip Parallelism, ISCA 1995.

For an 8-way superscalar.

SLIDE 14

CADSL

09 Oct 2013 CS-683@IITB 14

Limits to ILP

Conflicting studies of amount

–

Benchmarks (vectorized Fortran FP vs. integer C programs)

–

Hardware sophistication

–

Compiler sophistication

How much ILP is available using existing

mechanisms with increasing HW budgets?

Do we need to invent new HW/SW

mechanisms to keep on processor performance curve?

–

Intel MMX, SSE (Streaming SIMD Extensions): 64 bit ints

–

Intel SSE2: 128 bit, including 2 64-bit Fl. Pt. per clock

–

Motorola AltaVec: 128 bit ints and FPs

–

Supersparc Multimedia ops, etc.

SLIDE 15

CADSL

09 Oct 2013 CS-683@IITB 15

Overcoming Limits

Advances in compiler technology + significantly

new and different hardware techniques may be able to overcome limitations assumed in studies

However, unlikely such advances when coupled

with realistic hardware will overcome these limits in near future

SLIDE 16

CADSL

09 Oct 2013 CS-683@IITB 16

Limits to ILP

Initial HW Model here; MIPS compilers. Assumptions for ideal/perfect machine to start:

1. Register renaming – infinite virtual registers

 all register WAW & WAR hazards are avoided

2. Branch prediction – perfect; no mispredictions
3. Jump prediction – all jumps perfectly predicted

(returns, case statements) 2 & 3  no control dependencies; perfect speculation & an unbounded buffer of instructions available

4. Memory-address alias analysis – addresses known &

a load can be moved before a store provided addresses not equal; 1&4 eliminates all but RAW Also: perfect caches; 1 cycle latency for all instructions (FP *,/); unlimited instructions issued/clock cycle;

SLIDE 17

CADSL

09 Oct 2013 CS-683@IITB 17

Model Power 5

Instructions Issued per clock Infinite 4 Instruction Window Size Infinite 200 Renaming Registers Infinite 48 integer + 40 Fl. Pt. Branch Prediction Perfect 2% to 6% misprediction (Tournament Branch Predictor) Cache Perfect 64KI, 32KD, 1.92MB L2, 36 MB L3 Memory Alias Analysis Perfect ??

Limits to ILP HW Model comparison

SLIDE 18

CADSL

09 Oct 2013 CS-683@IITB 18

Upper Limit to ILP: Ideal Machine

Integer: 18 - 60

Instructions Per Clock

FP: 75 - 150

SLIDE 19

CADSL

09 Oct 2013 CS-683@IITB 19

New Model Model Power 5

Instructions Issued per clock Infinite Infinite 4 Instruction Window Size Infinite, 2K, 512, 128, 32 Infinite 200 Renaming Registers Infinite Infinite 48 integer + 40 Fl. Pt. Branch Prediction Perfect Perfect 2% to 6% misprediction (Tournament Branch Predictor) Cache Perfect Perfect 64KI, 32KD, 1.92MB L2, 36 MB L3 Memory Alias Perfect Perfect ??

Limits to ILP HW Model comparison

SLIDE 20

CADSL

Click to edit the outline text format

–

Second Outline Level

Third Outline Level

– Fourth Outline Level

Fifth Outline

Level

Sixth Outline

Level

Seventh Outline LevelClick to edit

Master text styles

– Second level

Third level

– Fourth level

Fifth level

09 Oct 2013 CS-683@IITB 20

More Realistic HW: Window Impact

Change from Infinite window 2048, 512, 128, 32

FP: 9 - 150 Integer: 8 - 63

IPC

55 63 18 75 119 150 36 41 15 61 59 60 10 15 12 49 16 45 10 13 11 35 15 34 8 8 9 14 9 14 20 40 60 80 100 120 140 160 gcc espresso li fpppp doduc tomcatv

Instructions Per Cl

Infinite 2048 512 128 32

SLIDE 21

CADSL

09 Oct 2013 CS-683@IITB 21

New Model Model Power 5

Instructions Issued per clock 64 Infinite 4 Instruction Window Size 2048 Infinite 200 Renaming Registers Infinite Infinite 48 integer + 40 Fl. Pt. Branch Prediction Perfect vs. 8K Tournament

vs. 512 2-bit
vs. profile vs.

none Perfect 2% to 6% misprediction (Tournament Branch Predictor) Cache Perfect Perfect 64KI, 32KD, 1.92MB L2, 36 MB L3 Memory Alias Perfect Perfect ??

Limits to ILP HW Model comparison

SLIDE 22

CADSL

09 Oct 2013 CS-683@IITB 22

More Realistic HW: Branch Impact

Change from Infinite window to examine to 2048 and maximum issue of 64 instructions per clock cycle

Profile BHT (512)

FP: 15 - 45 Integer: 6 - 12

IPC

SLIDE 23

CADSL

09 Oct 2013 CS-683@IITB 23

Misprediction Rates

1% 5% 14% 12% 14% 12% 1% 16% 18% 23% 18% 30% 0% 3% 2% 2% 4% 6%

0% 5% 10% 15% 20% 25% 30% 35%

tomcatv doduc fpppp li espresso gcc

Misprediction Rate Profile-based 2-bit counter Tournament

SLIDE 24

CADSL

09 Oct 2013 CS-683@IITB 24

New Model Model Power 5

Instructions Issued per clock 64 Infinite 4 Instruction Window Size 2048 Infinite 200 Renaming Registers Infinite v. 256, 128, 64, 32, none Infinite 48 integer + 40 Fl. Pt. Branch Prediction 8K 2-bit Perfect Tournament Branch Predictor Cache Perfect Perfect 64KI, 32KD, 1.92MB L2, 36 MB L3 Memory Alias Perfect Perfect Perfect

Limits to ILP HW Model comparison

SLIDE 25

CADSL

09 Oct 2013 CS-683@IITB 25

More Realistic HW: Renaming Register Impact (N int + N fp)

Change 2048 instr window, 64 instr issue, 8K 2 level Prediction Integer: 5 - 15 FP: 11 - 45

IPC

SLIDE 26

CADSL

09 Oct 2013 CS-683@IITB 26

New Model Model Power 5

Instructions Issued per clock 64 Infinite 4 Instruction Window Size 2048 Infinite 200 Renaming Registers 256 Int + 256 FP Infinite 48 integer + 40 Fl. Pt. Branch Prediction 8K 2-bit Perfect Tournament Cache Perfect Perfect 64KI, 32KD, 1.92MB L2, 36 MB L3 Memory Alias Perfect v. Stack

v. Inspect v.

none Perfect Perfect

Limits to ILP HW Model comparison

SLIDE 27

CADSL

09 Oct 2013 CS-683@IITB 27

More Realistic HW: Memory Address Alias Impact

Change 2048 instr window, 64 instr issue, 8K 2 level Prediction, 256 renaming registers FP: 4 - 45 (Fortran, no heap) Integer: 4 - 9

IPC

SLIDE 28

CADSL

09 Oct 2013 CS-683@IITB 28

New Model Model Power 5

Instructions Issued per clock 64 (no restrictions) Infinite 4 Instruction Window Size Infinite vs. 256, 128, 64, 32 Infinite 200 Renaming Registers 64 Int + 64 FP Infinite 48 integer + 40 Fl. Pt. Branch Prediction 1K 2-bit Perfect Tournament Cache Perfect Perfect 64KI, 32KD, 1.92MB L2, 36 MB L3 Memory Alias HW disambiguation Perfect Perfect

Limits to ILP HW Model comparison

SLIDE 29

CADSL

09 Oct 2013 CS-683@IITB 29

P r o g r a m Inst ruct io n issue s p e r c y cle 1 0 2 0 3 0 4 0 5 0 6 0 gcc expresso l i f ppp p doducd t om cat v

10 15 12 52 17 56 10 15 12 47 16 10 13 11 35 15 34 9 10 11 22 12 8 8 9 14 9 14 6 6 6 8 7 9 4 4 4 5 4 6 3 2 3 3 3 3 45 22

In f in it e 2 5 6 1 2 8 6 4 3 2 1 6 8 4

Realistic HW: Window Impact

Perfect disambiguation (HW), 1K Selective Prediction, 16 entry return, 64 registers, issue as many as window Integer: 6 - 12 FP: 8 - 45

IPC

SLIDE 30

CADSL

09 Oct 2013 CS-683@IITB 30

How to Exceed ILP Limits of this study?

These are not laws of physics; just practical

limits for today, and perhaps overcome via research

Compiler and ISA advances could change

results

WAR and WAW hazards through memory:

eliminated WAW and WAR hazards through register renaming, but not in memory usage

SLIDE 31

CADSL

09 Oct 2013 CS-683@IITB 31

HW v. SW to increase ILP

Memory disambiguation: HW best
Speculation:

– HW best when dynamic branch prediction better than

compile time prediction

– Exceptions easier for HW – HW doesn’t need bookkeeping code or

compensation code

– Very complicated to get right

Scheduling: SW can look ahead to schedule

better

Compiler independence: does not require new

compiler, recompilation to run well