Limits of Superscalar Architecture Virendra Singh Associate - - PowerPoint PPT Presentation

limits of superscalar architecture
SMART_READER_LITE
LIVE PREVIEW

Limits of Superscalar Architecture Virendra Singh Associate - - PowerPoint PPT Presentation

Limits of Superscalar Architecture Virendra Singh Associate Professor Computer Architecture and Dependable Systems Lab Department of Electrical Engineering Indian Institute of Technology Bombay http://www.ee.iitb.ac.in/~viren/ E-mail:


slide-1
SLIDE 1

CADSL

Limits of Superscalar Architecture

Virendra Singh

Associate Professor Computer Architecture and Dependable Systems Lab Department of Electrical Engineering Indian Institute of Technology Bombay

http://www.ee.iitb.ac.in/~viren/ E-mail: viren@ee.iitb.ac.in

CS-683: Advanced Computer Architecture

Lecture 21 (09 Oct 2013)

slide-2
SLIDE 2

CADSL

CS-683@IITB

VLIW Processors

  • Package multiple operations into one

instruction

  • Example VLIW processor:

– One integer instruction (or branch) – Two independent floating-point operations – Two independent memory references

Must be enough parallelism in code to fill the available slots

09 Oct 2013 2

slide-3
SLIDE 3

CADSL

CS-683@IITB

VLIW Processors

  • Disadvantages:

– Statically finding parallelism – Code size – No hazard detection hardware – Binary code compatibility

09 Oct 2013 3

slide-4
SLIDE 4

CADSL

CS-683@IITB

Summary: Multiple Issue

09 Oct 2013 4

slide-5
SLIDE 5

CADSL

09 Oct 2013 5

Recap: Advanced Superscalars

  • Even simple branch prediction can be quite effective
  • Path-based predictors can achieve >95% accuracy
  • BTB redirects control flow early in pipe, BHT cheaper per

entry but must wait for instruction decode

  • Branch mispredict recovery requires snapshots of

pipeline state to reduce penalty

  • Unified physical register file design, avoids reading data

from multiple locations (ROB+arch regfile)

  • Superscalars can rename multiple dependent

instructions in one clock cycle

  • Need speculative store buffer to avoid waiting for stores

to commit

CS-683@IITB

slide-6
SLIDE 6

CADSL

09 Oct 2013 6

Fetch Decode & Rename Reorder Buffer PC

Branch Prediction Update predictors

Commit

Branch Resolution

Branch Unit ALU

  • Reg. File

MEM Store Buffer D$ Execute kill kill kill kill

Recap: Branch Prediction and Speculative Execution

CS-683@IITB

slide-7
SLIDE 7

CADSL

09 Oct 2013 7

Little’s Law

Parallelism = Throughput * Latency

  • r

Latency in Cycles Throughput per Cycle One Operation

CS-683@IITB

L T N

× =

slide-8
SLIDE 8

CADSL

09 Oct 2013 8

Example Pipelined ILP Machine

  • How much instruction-level parallelism (ILP)

required to keep machine pipelines busy?

One Pipeline Stage Two Integer Units, Single Cycle Latency Two Load/Store Units, Three Cycle Latency Two Floating-Point Units, Four Cycle Latency

Max Throughput, Six Instructions per Cycle Latency in Cycles

CS-683@IITB

6 T =

3 2 2 6 2x4) 2x3 (2x1 L = + + = 6 1 3 2 2 6 N = × =

slide-9
SLIDE 9

CADSL

09 Oct 2013 9

Superscalar Control Logic Scaling

  • Each issued instruction must check against W*L instructions, i.e.,

growth in hardware ∝ W*(W*L)

  • For in-order machines, L is related to pipeline latencies
  • For out-of-order machines, L also includes time spent in instruction

buffers (instruction window or ROB)

  • As W increases, larger instruction window is needed to find enough

parallelism to keep machine busy => greater L => Out-of-order control logic grows faster than W2 (~W3) Lifetime L Issue Group Previously Issued Instructions Issue Width W

CS-683@IITB

slide-10
SLIDE 10

CADSL

Superscalar Scenario

  • Interest in multiple-issue because wanted to improve

performance without affecting uniprocessor programming model

  • Taking advantage of ILP is conceptually simple, but design

problems are amazingly complex in practice

  • Conservative in ideas, just faster clock and bigger
  • Processors of last 10 years (Pentium 4, IBM Power 5, AMD

Opteron) have the same basic structure and similar sustained issue rates (3 to 4 instructions per clock) as the 1st dynamically scheduled, multiple-issue processors announced in 1995

– Clocks 10 to 20X faster, caches 4 to 8X bigger, 2 to 4X

as many renaming registers, and 2X as many load- store units  performance 8 to 16X

  • Peak v. delivered performance gap increasing

09 Oct 2013 CS-683@IITB 10

slide-11
SLIDE 11

CADSL

09 Oct 2013 11

Out-of-Order Control Complexity: MIPS R10000

Control Logic

[ SGI/MIPS Technologies Inc., 1995 ]

CS-683@IITB

slide-12
SLIDE 12

CADSL

09 Oct 2013 12

Check instruction dependencies

Superscalar processor

Sequential ISA Bottleneck

a = foo(b); for (i=0, i< Sequential source code

Superscalar compiler

Find independent

  • perations

Schedule

  • perations

Sequential machine code Schedule execution

CS-683@IITB

slide-13
SLIDE 13

CADSL

09 Oct 2013 CS-683@IITB 13

For most apps, most execution units lie idle

From: Tullsen, Eggers, and Levy, “Simultaneous Multithreading: Maximizing On-chip Parallelism, ISCA 1995.

For an 8-way superscalar.

slide-14
SLIDE 14

CADSL

09 Oct 2013 CS-683@IITB 14

Limits to ILP

  • Conflicting studies of amount

Benchmarks (vectorized Fortran FP vs. integer C programs)

Hardware sophistication

Compiler sophistication

  • How much ILP is available using existing

mechanisms with increasing HW budgets?

  • Do we need to invent new HW/SW

mechanisms to keep on processor performance curve?

Intel MMX, SSE (Streaming SIMD Extensions): 64 bit ints

Intel SSE2: 128 bit, including 2 64-bit Fl. Pt. per clock

Motorola AltaVec: 128 bit ints and FPs

Supersparc Multimedia ops, etc.

slide-15
SLIDE 15

CADSL

09 Oct 2013 CS-683@IITB 15

Overcoming Limits

  • Advances in compiler technology + significantly

new and different hardware techniques may be able to overcome limitations assumed in studies

  • However, unlikely such advances when coupled

with realistic hardware will overcome these limits in near future

slide-16
SLIDE 16

CADSL

09 Oct 2013 CS-683@IITB 16

Limits to ILP

Initial HW Model here; MIPS compilers. Assumptions for ideal/perfect machine to start:

  • 1. Register renaming – infinite virtual registers

 all register WAW & WAR hazards are avoided

  • 2. Branch prediction – perfect; no mispredictions
  • 3. Jump prediction – all jumps perfectly predicted

(returns, case statements) 2 & 3  no control dependencies; perfect speculation & an unbounded buffer of instructions available

  • 4. Memory-address alias analysis – addresses known &

a load can be moved before a store provided addresses not equal; 1&4 eliminates all but RAW Also: perfect caches; 1 cycle latency for all instructions (FP *,/); unlimited instructions issued/clock cycle;

slide-17
SLIDE 17

CADSL

09 Oct 2013 CS-683@IITB 17

Model Power 5

Instructions Issued per clock Infinite 4 Instruction Window Size Infinite 200 Renaming Registers Infinite 48 integer + 40 Fl. Pt. Branch Prediction Perfect 2% to 6% misprediction (Tournament Branch Predictor) Cache Perfect 64KI, 32KD, 1.92MB L2, 36 MB L3 Memory Alias Analysis Perfect ??

Limits to ILP HW Model comparison

slide-18
SLIDE 18

CADSL

09 Oct 2013 CS-683@IITB 18

Upper Limit to ILP: Ideal Machine

Integer: 18 - 60

Instructions Per Clock

FP: 75 - 150

slide-19
SLIDE 19

CADSL

09 Oct 2013 CS-683@IITB 19

New Model Model Power 5

Instructions Issued per clock Infinite Infinite 4 Instruction Window Size Infinite, 2K, 512, 128, 32 Infinite 200 Renaming Registers Infinite Infinite 48 integer + 40 Fl. Pt. Branch Prediction Perfect Perfect 2% to 6% misprediction (Tournament Branch Predictor) Cache Perfect Perfect 64KI, 32KD, 1.92MB L2, 36 MB L3 Memory Alias Perfect Perfect ??

Limits to ILP HW Model comparison

slide-20
SLIDE 20

CADSL

  • Click to edit the outline text format

Second Outline Level

  • Third Outline Level

– Fourth Outline Level

  • Fifth Outline

Level

  • Sixth Outline

Level

  • Seventh Outline LevelClick to edit

Master text styles

– Second level

  • Third level

– Fourth level

  • Fifth level

09 Oct 2013 CS-683@IITB 20

More Realistic HW: Window Impact

Change from Infinite window 2048, 512, 128, 32

FP: 9 - 150 Integer: 8 - 63

IPC

55 63 18 75 119 150 36 41 15 61 59 60 10 15 12 49 16 45 10 13 11 35 15 34 8 8 9 14 9 14 20 40 60 80 100 120 140 160 gcc espresso li fpppp doduc tomcatv

Instructions Per Cl

Infinite 2048 512 128 32

slide-21
SLIDE 21

CADSL

09 Oct 2013 CS-683@IITB 21

New Model Model Power 5

Instructions Issued per clock 64 Infinite 4 Instruction Window Size 2048 Infinite 200 Renaming Registers Infinite Infinite 48 integer + 40 Fl. Pt. Branch Prediction Perfect vs. 8K Tournament

  • vs. 512 2-bit
  • vs. profile vs.

none Perfect 2% to 6% misprediction (Tournament Branch Predictor) Cache Perfect Perfect 64KI, 32KD, 1.92MB L2, 36 MB L3 Memory Alias Perfect Perfect ??

Limits to ILP HW Model comparison

slide-22
SLIDE 22

CADSL

09 Oct 2013 CS-683@IITB 22

More Realistic HW: Branch Impact

Change from Infinite window to examine to 2048 and maximum issue of 64 instructions per clock cycle

Profile BHT (512)

FP: 15 - 45 Integer: 6 - 12

IPC

slide-23
SLIDE 23

CADSL

09 Oct 2013 CS-683@IITB 23

Misprediction Rates

1% 5% 14% 12% 14% 12% 1% 16% 18% 23% 18% 30% 0% 3% 2% 2% 4% 6%

0% 5% 10% 15% 20% 25% 30% 35%

tomcatv doduc fpppp li espresso gcc

Misprediction Rate Profile-based 2-bit counter Tournament

slide-24
SLIDE 24

CADSL

09 Oct 2013 CS-683@IITB 24

New Model Model Power 5

Instructions Issued per clock 64 Infinite 4 Instruction Window Size 2048 Infinite 200 Renaming Registers Infinite v. 256, 128, 64, 32, none Infinite 48 integer + 40 Fl. Pt. Branch Prediction 8K 2-bit Perfect Tournament Branch Predictor Cache Perfect Perfect 64KI, 32KD, 1.92MB L2, 36 MB L3 Memory Alias Perfect Perfect Perfect

Limits to ILP HW Model comparison

slide-25
SLIDE 25

CADSL

09 Oct 2013 CS-683@IITB 25

More Realistic HW: Renaming Register Impact (N int + N fp)

Change 2048 instr window, 64 instr issue, 8K 2 level Prediction Integer: 5 - 15 FP: 11 - 45

IPC

slide-26
SLIDE 26

CADSL

09 Oct 2013 CS-683@IITB 26

New Model Model Power 5

Instructions Issued per clock 64 Infinite 4 Instruction Window Size 2048 Infinite 200 Renaming Registers 256 Int + 256 FP Infinite 48 integer + 40 Fl. Pt. Branch Prediction 8K 2-bit Perfect Tournament Cache Perfect Perfect 64KI, 32KD, 1.92MB L2, 36 MB L3 Memory Alias Perfect v. Stack

  • v. Inspect v.

none Perfect Perfect

Limits to ILP HW Model comparison

slide-27
SLIDE 27

CADSL

09 Oct 2013 CS-683@IITB 27

More Realistic HW: Memory Address Alias Impact

Change 2048 instr window, 64 instr issue, 8K 2 level Prediction, 256 renaming registers FP: 4 - 45 (Fortran, no heap) Integer: 4 - 9

IPC

slide-28
SLIDE 28

CADSL

09 Oct 2013 CS-683@IITB 28

New Model Model Power 5

Instructions Issued per clock 64 (no restrictions) Infinite 4 Instruction Window Size Infinite vs. 256, 128, 64, 32 Infinite 200 Renaming Registers 64 Int + 64 FP Infinite 48 integer + 40 Fl. Pt. Branch Prediction 1K 2-bit Perfect Tournament Cache Perfect Perfect 64KI, 32KD, 1.92MB L2, 36 MB L3 Memory Alias HW disambiguation Perfect Perfect

Limits to ILP HW Model comparison

slide-29
SLIDE 29

CADSL

09 Oct 2013 CS-683@IITB 29

P r o g r a m Inst ruct io n issue s p e r c y cle 1 0 2 0 3 0 4 0 5 0 6 0 gcc expresso l i f ppp p doducd t om cat v

10 15 12 52 17 56 10 15 12 47 16 10 13 11 35 15 34 9 10 11 22 12 8 8 9 14 9 14 6 6 6 8 7 9 4 4 4 5 4 6 3 2 3 3 3 3 45 22

In f in it e 2 5 6 1 2 8 6 4 3 2 1 6 8 4

Realistic HW: Window Impact

Perfect disambiguation (HW), 1K Selective Prediction, 16 entry return, 64 registers, issue as many as window Integer: 6 - 12 FP: 8 - 45

IPC

slide-30
SLIDE 30

CADSL

09 Oct 2013 CS-683@IITB 30

How to Exceed ILP Limits of this study?

  • These are not laws of physics; just practical

limits for today, and perhaps overcome via research

  • Compiler and ISA advances could change

results

  • WAR and WAW hazards through memory:

eliminated WAW and WAR hazards through register renaming, but not in memory usage

slide-31
SLIDE 31

CADSL

09 Oct 2013 CS-683@IITB 31

HW v. SW to increase ILP

  • Memory disambiguation: HW best
  • Speculation:

– HW best when dynamic branch prediction better than

compile time prediction

– Exceptions easier for HW – HW doesn’t need bookkeeping code or

compensation code

– Very complicated to get right

  • Scheduling: SW can look ahead to schedule

better

  • Compiler independence: does not require new

compiler, recompilation to run well