[PPT] - 1 last time HCL details/built in components HCL debug/interactive PowerPoint Presentation

SLIDE 1

1

SLIDE 2

last time

HCL details/built in components HCL debug/interactive options walkthrough of SEQ stages/needed MUXes

2

SLIDE 3

critical path

every path from state output to state input needs enough time

utput — may change on rising edge of clock

input — must be stable suffjciently before rising edge of clock

critical path: slowest of all these paths — determines cycle time

3

SLIDE 4

SEQ paths

PC

Instr. Mem.

register fjle

srcA srcB R[srcA] R[srcB] dstE next R[dstE] dstM next R[dstM]

Data Mem.

ZF/SF Stat

Data in Addr in Data out

valC

0xF 0xF %rsp %rsp 0xF 0xF %rsp rA rB

ALU

aluA aluB valE 8 add/sub xor/and (function

f instr.)

write? function

f opcode

PC+9

instr. length +

path 1: 25 picoseconds path 2: 50 picoseconds path 3: 400 picoseconds path 4: 900 picoseconds … …

…and many, many more paths

4

SLIDE 5

SEQ paths

PC

Instr. Mem.

register fjle

srcA srcB R[srcA] R[srcB] dstE next R[dstE] dstM next R[dstM]

Data Mem.

ZF/SF Stat

Data in Addr in Data out

valC

0xF 0xF %rsp %rsp 0xF 0xF %rsp rA rB

ALU

aluA aluB valE 8 add/sub xor/and (function

f instr.)

write? function

f opcode

PC+9

instr. length +

path 1: 25 picoseconds path 2: 50 picoseconds path 3: 400 picoseconds path 4: 900 picoseconds … …

…and many, many more paths

4

SLIDE 6

SEQ paths

PC

Instr. Mem.

register fjle

srcA srcB R[srcA] R[srcB] dstE next R[dstE] dstM next R[dstM]

Data Mem.

ZF/SF Stat

Data in Addr in Data out

valC

0xF 0xF %rsp %rsp 0xF 0xF %rsp rA rB

ALU

aluA aluB valE 8 add/sub xor/and (function

f instr.)

write? function

f opcode

PC+9

instr. length +

path 1: 25 picoseconds path 2: 50 picoseconds path 3: 400 picoseconds path 4: 900 picoseconds … …

…and many, many more paths

4

SLIDE 7

SEQ paths

PC

Instr. Mem.

register fjle

srcA srcB R[srcA] R[srcB] dstE next R[dstE] dstM next R[dstM]

Data Mem.

ZF/SF Stat

Data in Addr in Data out

valC

0xF 0xF %rsp %rsp 0xF 0xF %rsp rA rB

ALU

aluA aluB valE 8 add/sub xor/and (function

f instr.)

write? function

f opcode

PC+9

instr. length +

path 1: 25 picoseconds path 2: 50 picoseconds path 3: 400 picoseconds path 4: 900 picoseconds … …

…and many, many more paths

4

SLIDE 8

SEQ paths

PC

Instr. Mem.

register fjle

srcA srcB R[srcA] R[srcB] dstE next R[dstE] dstM next R[dstM]

Data Mem.

ZF/SF Stat

Data in Addr in Data out

valC

0xF 0xF %rsp %rsp 0xF 0xF %rsp rA rB

ALU

aluA aluB valE 8 add/sub xor/and (function

f instr.)

write? function

f opcode

PC+9

instr. length +

path 1: 25 picoseconds path 2: 50 picoseconds path 3: 400 picoseconds path 4: 900 picoseconds … …

…and many, many more paths

4

SLIDE 9

SEQ paths

PC

Instr. Mem.

register fjle

srcA srcB R[srcA] R[srcB] dstE next R[dstE] dstM next R[dstM]

Data Mem.

ZF/SF Stat

Data in Addr in Data out

valC

0xF 0xF %rsp %rsp 0xF 0xF %rsp rA rB

ALU

aluA aluB valE 8 add/sub xor/and (function

f instr.)

write? function

f opcode

PC+9

instr. length +

path 1: 25 picoseconds path 2: 50 picoseconds path 3: 400 picoseconds path 4: 900 picoseconds … …

…and many, many more paths

4

SLIDE 10

sequential addq paths

PC

Instr. Mem.

register fjle

srcA srcB R[srcA] R[srcB] dstE next R[dstE] dstM next R[dstM] split

0xF

ADD

add 2

path 1: 25 picoseconds path 2: 375 picoseconds path 3: 500 picoseconds path 4: 500 picoseconds

verall cycle time: 500 picoseconds (longest path)

5

SLIDE 11

sequential addq paths

PC

Instr. Mem.

register fjle

srcA srcB R[srcA] R[srcB] dstE next R[dstE] dstM next R[dstM] split

0xF

ADD

add 2

path 1: 25 picoseconds path 2: 375 picoseconds path 3: 500 picoseconds path 4: 500 picoseconds

verall cycle time: 500 picoseconds (longest path)

5

SLIDE 12

sequential addq paths

PC

Instr. Mem.

register fjle

srcA srcB R[srcA] R[srcB] dstE next R[dstE] dstM next R[dstM] split

0xF

ADD

add 2

path 1: 25 picoseconds path 2: 375 picoseconds path 3: 500 picoseconds path 4: 500 picoseconds

verall cycle time: 500 picoseconds (longest path)

5

SLIDE 13

sequential addq paths

PC

Instr. Mem.

register fjle

srcA srcB R[srcA] R[srcB] dstE next R[dstE] dstM next R[dstM] split

0xF

ADD

add 2

path 1: 25 picoseconds path 2: 375 picoseconds path 3: 500 picoseconds path 4: 500 picoseconds

verall cycle time: 500 picoseconds (longest path)

5

SLIDE 14

sequential addq paths

PC

Instr. Mem.

register fjle

srcA srcB R[srcA] R[srcB] dstE next R[dstE] dstM next R[dstM] split

0xF

ADD

add 2

path 1: 25 picoseconds path 2: 375 picoseconds path 3: 500 picoseconds path 4: 500 picoseconds

verall cycle time: 500 picoseconds (longest path)

5

SLIDE 15

sequential addq paths

PC

Instr. Mem.

register fjle

srcA srcB R[srcA] R[srcB] dstE next R[dstE] dstM next R[dstM] split

0xF

ADD

add 2

path 1: 25 picoseconds path 2: 375 picoseconds path 3: 500 picoseconds path 4: 500 picoseconds

verall cycle time: 500 picoseconds (longest path)

5

SLIDE 16

Human pipeline: laundry

Washer Dryer Folding Table

11:00 12:00 13:00 14:00

Washer Dryer Folding Table

11:00 12:00 13:00 14:00

whites whites whites colors colors colors whites whites whites colors colors colors sheets sheets sheets

6

SLIDE 17

Human pipeline: laundry

Washer Dryer Folding Table

11:00 12:00 13:00 14:00

Washer Dryer Folding Table

11:00 12:00 13:00 14:00

whites whites whites colors colors colors whites whites whites colors colors colors sheets sheets sheets

6

SLIDE 18

Waste (1)

Washer Dryer Folding Table

11:00 12:00 13:00 14:00

whites whites whites colors colors colors sheets sheets sheets

wasted time! wasted time!

7

SLIDE 19

Waste (1)

Washer Dryer Folding Table

11:00 12:00 13:00 14:00

whites whites whites colors colors colors sheets sheets sheets

wasted time! wasted time!

7

SLIDE 20

Waste (2)

Washer Dryer Folding Table

11:00 12:00 13:00 14:00

whites whites whites colors colors colors sheets sheets sheets

8

SLIDE 21

Latency — Time for One

Washer Dryer Folding Table

11:00 12:00 13:00 14:00

whites whites whites colors colors colors sheets sheets sheets

pipelined latency (2.1 h)

colors colors colors

normal latency (1.8 h)

9

SLIDE 22

Latency — Time for One

Washer Dryer Folding Table

11:00 12:00 13:00 14:00

whites whites whites colors colors colors sheets sheets sheets

pipelined latency (2.1 h)

colors colors colors

normal latency (1.8 h)

9

SLIDE 23

Latency — Time for One

Washer Dryer Folding Table

11:00 12:00 13:00 14:00

whites whites whites colors colors colors sheets sheets sheets

pipelined latency (2.1 h)

colors colors colors

normal latency (1.8 h)

9

SLIDE 24

Throughput — Rate of Many

Washer Dryer Folding Table

11:00 12:00 13:00 14:00

whites whites whites colors colors colors sheets sheets sheets

time between fjnishes (0.83 h) load h loads/h time between starts (0.83 h)

10

SLIDE 25

Throughput — Rate of Many

Washer Dryer Folding Table

11:00 12:00 13:00 14:00

whites whites whites colors colors colors sheets sheets sheets

time between fjnishes (0.83 h) 1 load 0.83h = 1.2 loads/h time between starts (0.83 h)

10

SLIDE 26

Throughput — Rate of Many

Washer Dryer Folding Table

11:00 12:00 13:00 14:00

whites whites whites colors colors colors sheets sheets sheets

time between fjnishes (0.83 h) 1 load 0.83h = 1.2 loads/h time between starts (0.83 h)

10

SLIDE 27

times three circuit

A

ADD

ADD 2 × A 3 × A add add 7 14 21

0 ps 50 ps 100 ps

100 ps latency 10 results/ns throughput

11

SLIDE 28

times three circuit

A

ADD

ADD 2 × A 3 × A A add A + A 2 × A add 2A + A 3 × A 7 14 21

0 ps 50 ps 100 ps

100 ps latency 10 results/ns throughput

11

SLIDE 29

times three circuit

A

ADD

ADD 2 × A 3 × A A add A + A 2 × A add 2A + A 3 × A 7 14 21

0 ps 50 ps 100 ps

100 ps latency = ⇒ 10 results/ns throughput

11

SLIDE 30

times three and repeat

A add A + A 2 × A add 2A + A 3 × A 7 14 17 34 4 8 1 2 23 46

0 ps 100 ps 200 ps 300 ps 400 ps 500 ps

add add 7 14 21 17 34 51 4 8 12 1 2 3 23 46 69

0 ps 100 ps 200 ps 300 ps 400 ps 500 ps

12

SLIDE 31

times three and repeat

A add A + A 2 × A add 2A + A 3 × A 7 14 17 34 4 8 1 2 23 46

0 ps 100 ps 200 ps 300 ps 400 ps 500 ps

A add A + A 2 × A add 2A + A 3 × A 7 14 21 17 34 51 4 8 12 1 2 3 23 46 69

0 ps 100 ps 200 ps 300 ps 400 ps 500 ps

12

SLIDE 32

pipelined times three

A (t + 2) ADD

ADD

2 × A (t + 1) A (t + 1) 3 × A (t + 0)

( ) ( ) ( ) ( ) 7 7 14 21 17 17 34

13

SLIDE 33

pipelined times three

A (t + 2) ADD

ADD

2 × A (t + 1) A (t + 1) 3 × A (t + 0)

A (t + 2) A (t + 1) 2 × A (t + 1) 3 × A (t + 0) 7 7 14 21 17 17 34

13

SLIDE 34

register tolerances

register output register input

utput

changes input must not change register delay

14

SLIDE 35

register tolerances

register output register input

utput

changes input must not change register delay

14

SLIDE 36

register tolerances

register output register input

utput

changes input must not change register delay

14

SLIDE 37

times three pipeline timing

A (t + 2) ADD

ADD

2 × A (t + 1) A (t + 1) 3 × A (t + 0)

10 ps 50 ps 10 ps 50 ps 10 ps throughput: ps G operations/sec

15

SLIDE 38

times three pipeline timing

A (t + 2) ADD

ADD

2 × A (t + 1) A (t + 1) 3 × A (t + 0)

10 ps 50 ps 10 ps 50 ps 10 ps throughput: 1 60 ps ≈ 16 G operations/sec

15

SLIDE 39

deeper pipeline

A (t + 4) ADD

ADD

2 × A (t + 2) A (t + 2) 3 × A (t + 0)

ps ps ps ps ps ps ps ps ps

exercise: throughput now? throughput: ps G ops/sec

30 ps 20 ps

exercise: throughput now? (didn’t split second add evenly) throughput: ps G ops/sec

A (t + 3) 2 × A partial results 3 × A partial results

Problem: How much faster can we get? Problem: Can we even do this?

17

SLIDE 40

deeper pipeline

A (t + 4) ADD

ADD

2 × A (t + 2) A (t + 2) 3 × A (t + 0)

10 ps 25 ps 25 ps 10 ps 10 ps 25 ps 10 ps 25 ps 10 ps

exercise: throughput now? throughput: ps G ops/sec

30 ps 20 ps

exercise: throughput now? (didn’t split second add evenly) throughput: ps G ops/sec

A (t + 3) 2 × A partial results 3 × A partial results

Problem: How much faster can we get? Problem: Can we even do this?

17

SLIDE 41

deeper pipeline

A (t + 4) ADD

ADD

2 × A (t + 2) A (t + 2) 3 × A (t + 0)

10 ps 25 ps 25 ps 10 ps 10 ps 25 ps 10 ps 25 ps 10 ps

exercise: throughput now? throughput: ps G ops/sec

30 ps 20 ps

exercise: throughput now? (didn’t split second add evenly) throughput: ps G ops/sec

A (t + 3) 2 × A partial results 3 × A partial results

Problem: How much faster can we get? Problem: Can we even do this?

17

SLIDE 42

deeper pipeline

A (t + 4) ADD

ADD

2 × A (t + 2) A (t + 2) 3 × A (t + 0)

10 ps 25 ps 25 ps 10 ps 10 ps 25 ps 10 ps 25 ps 10 ps

exercise: throughput now? throughput: 1 35 ps ≈ 28 G ops/sec

30 ps 20 ps

exercise: throughput now? (didn’t split second add evenly) throughput: ps G ops/sec

A (t + 3) 2 × A partial results 3 × A partial results

Problem: How much faster can we get? Problem: Can we even do this?

17

SLIDE 43

diminishing returns: register delays

logic (all)

100 ps

110 ps per cycle

10 ps

logic (1/2)

50 ps

60 ps per cycle

10 ps

logic (2/2)

50 ps 10 ps

logic (1/3)

33 ps

43 ps per cycle

10 ps

logic (2/3)

33 ps 10 ps

logic (3/3)

33 ps 10 ps

. . . . . . . . . . . .

1 ps

11 ps per cycle

10 ps 1 ps 10 ps 1 ps 10 ps 1 ps 10 ps

…

18

SLIDE 44

diminishing returns: register delays

2 4 6 8 10 12 14 20 40 60 80 100 120 number of stages time per completion (ps)

19

SLIDE 45

diminishing returns: register delays

2 4 6 8 10 12 14 20 40 60 80 100 120 register delay number of stages time per completion (ps)

19

SLIDE 46

diminishing returns: register delays

2 4 6 8 10 12 14 20 40 60 80 100 120 register delay 1.83x speedup 1.02x speedup number of stages time per completion (ps)

19

SLIDE 47

diminishing returns: register delays

2 4 6 8 10 12 14 20 40 60 80 100 1.83x throughput 1.02x throughput number of stages throughput (ops/ns)

20

SLIDE 48

diminishing returns: register delays

2 4 6 8 10 12 14 20 40 60 80 100 1.83x throughput 1.02x throughput

max. rate of register updates

number of stages throughput (ops/ns)

20

SLIDE 49

deeper pipeline

A (t + 4) ADD

ADD

2 × A (t + 2) A (t + 2) 3 × A (t + 0)

10 ps 25 ps 25 ps 10 ps 10 ps 25 ps 10 ps 25 ps 10 ps

exercise: throughput now? throughput: ps G ops/sec

30 ps 20 ps

exercise: throughput now? (didn’t split second add evenly) throughput: ps G ops/sec

A (t + 3) 2 × A partial results 3 × A partial results

Problem: How much faster can we get? Problem: Can we even do this?

21

SLIDE 50

deeper pipeline

A (t + 4) ADD

ADD

2 × A (t + 2) A (t + 2) 3 × A (t + 0)

10 ps 25 ps 25 ps 10 ps 10 ps 25 ps 10 ps 25 ps 10 ps

exercise: throughput now? throughput: ps G ops/sec

30 ps 20 ps

exercise: throughput now? (didn’t split second add evenly) throughput: ps G ops/sec

A (t + 3) 2 × A partial results 3 × A partial results

Problem: How much faster can we get? Problem: Can we even do this?

22

SLIDE 51

deeper pipeline

A (t + 4) ADD

ADD

2 × A (t + 2) A (t + 2) 3 × A (t + 0)

10 ps 25 ps 25 ps 10 ps 10 ps 25 ps 10 ps 25 ps 10 ps

exercise: throughput now? throughput: ps G ops/sec

30 ps 20 ps

exercise: throughput now? (didn’t split second add evenly) throughput: 1 40 ps ≈ 25 G ops/sec

A (t + 3) 2 × A partial results 3 × A partial results

Problem: How much faster can we get? Problem: Can we even do this?

22

SLIDE 52

diminishing returns: uneven split

Can we split up some logic (e.g. adder) arbitrarily? Probably not...

logic (all)

100 ps

110 ps per cycle

10 ps

logic (1/2)

60 ps

70 ps per cycle

10 ps

logic (2/2)

45 ps 10 ps

logic (1/3)

40 ps

50 ps per cycle

10 ps

logic (2/3)

40 ps 10 ps

logic (3/3)

30 ps 10 ps

. . . . . . . . . . . .

23

SLIDE 53

diminishing returns: uneven split

Can we split up some logic (e.g. adder) arbitrarily? Probably not...

logic (all)

100 ps

110 ps per cycle

10 ps

logic (1/2)

60 ps

70 ps per cycle

10 ps

logic (2/2)

45 ps 10 ps

logic (1/3)

40 ps

50 ps per cycle

10 ps

logic (2/3)

40 ps 10 ps

logic (3/3)

30 ps 10 ps

. . . . . . . . . . . .

23

SLIDE 54

diminishing returns: uneven split

Can we split up some logic (e.g. adder) arbitrarily? Probably not...

logic (all)

100 ps

110 ps per cycle

10 ps

logic (1/2)

60 ps

70 ps per cycle

10 ps

logic (2/2)

45 ps 10 ps

logic (1/3)

40 ps

50 ps per cycle

10 ps

logic (2/3)

40 ps 10 ps

logic (3/3)

30 ps 10 ps

. . . . . . . . . . . .

23

SLIDE 55

textbook SEQ ‘stages’

conceptual order only Fetch: read instruction memory Decode: read register fjle Execute: arithmetic (ALU) Memory: read/write data memory Writeback: write register fjle PC Update: write PC register

writes happen at end of cycle reads — “magic” like combinatorial logic as values available

24

SLIDE 56

textbook SEQ ‘stages’

conceptual order only Fetch: read instruction memory Decode: read register fjle Execute: arithmetic (ALU) Memory: read/write data memory Writeback: write register fjle PC Update: write PC register

writes happen at end of cycle reads — “magic” like combinatorial logic as values available

24

SLIDE 57

textbook SEQ ‘stages’

conceptual order only Fetch: read instruction memory Decode: read register fjle Execute: arithmetic (ALU) Memory: read/write data memory Writeback: write register fjle PC Update: write PC register

writes happen at end of cycle reads — “magic” like combinatorial logic as values available

24

SLIDE 58

textbook stages

conceptual order only pipeline stages Fetch/PC Update: read instruction memory; compute next PC Decode: read register fjle Execute: arithmetic (ALU) Memory: read/write data memory Writeback: write register fjle

5 stages

ne instruction in each

compute next to start immediatelly

25

SLIDE 59

textbook stages

conceptual order only pipeline stages Fetch/PC Update: read instruction memory; compute next PC Decode: read register fjle Execute: arithmetic (ALU) Memory: read/write data memory Writeback: write register fjle

5 stages

ne instruction in each

compute next to start immediatelly

25

SLIDE 60

addq CPU

PC

Instr. Mem.

register fjle

srcA srcB R[srcA] R[srcB] dstE next R[dstE] dstM next R[dstM] split

0xF

ADD

add 2 decode execute writeback signal skips two stages fetch and PC update

26

SLIDE 61

addq CPU

PC

Instr. Mem.

register fjle

srcA srcB R[srcA] R[srcB] dstE next R[dstE] dstM next R[dstM] split

0xF

ADD

add 2 decode execute writeback signal skips two stages fetch and PC update

26

SLIDE 62

addq CPU

PC

Instr. Mem.

register fjle

srcA srcB R[srcA] R[srcB] dstE next R[dstE] dstM next R[dstM] split

0xF

ADD

add 2 decode execute writeback signal skips two stages fetch and PC update

26

SLIDE 63

addq CPU

PC

Instr. Mem.

register fjle

srcA srcB R[srcA] R[srcB] dstE next R[dstE] dstM next R[dstM] split

0xF

ADD

add 2 decode execute writeback signal skips two stages fetch and PC update

26

SLIDE 64

pipelined addq processor

PC

Instr. Mem.

register fjle

srcA srcB R[srcA] R[srcB] dstE next R[dstE] dstM next R[dstM] split

0xF

ADD

add 2 fetch and PC update decode execute writeback fetch/decode decode/execute execute/writeback fetch/fetch

27

SLIDE 65

pipelined addq processor

PC

Instr. Mem.

register fjle

srcA srcB R[srcA] R[srcB] dstE next R[dstE] dstM next R[dstM] split

0xF

ADD

add 2 fetch and PC update decode execute writeback fetch/decode decode/execute execute/writeback fetch/fetch

27

SLIDE 66

pipelined addq processor

PC

Instr. Mem.

register fjle

srcA srcB R[srcA] R[srcB] dstE next R[dstE] dstM next R[dstM] split

0xF

ADD

add 2 fetch and PC update decode execute writeback fetch/decode decode/execute execute/writeback fetch/fetch

27

SLIDE 67

pipelined addq processor

PC

Instr. Mem.

register fjle

srcA srcB R[srcA] R[srcB] dstE next R[dstE] dstM next R[dstM] split

0xF

ADD

add 2 fetch and PC update decode execute writeback fetch/decode decode/execute execute/writeback fetch/fetch

27

SLIDE 68

addq execution

PC

Instr. Mem.

register fjle

srcA srcB R[srcA] R[srcB] dstE next R[dstE] dstM next R[dstM] split

0xF

ADD

add 2 fetch/decode decode/execute execute/writeback fetch/fetch

addq %r8, %r9 // (1) addq %r10, %r11 // (2)

addq %r8, %r9 //(1) address of (2) addq %r10, %r11 //(2) reg #s 8, 9 from (1) reg #s 10, 11 from (2) reg # 9, values for (1)

28

SLIDE 69

addq execution

PC

Instr. Mem.

register fjle

srcA srcB R[srcA] R[srcB] dstE next R[dstE] dstM next R[dstM] split

0xF

ADD

add 2 fetch/decode decode/execute execute/writeback fetch/fetch

addq %r8, %r9 // (1) addq %r10, %r11 // (2)

addq %r8, %r9 //(1) address of (2) addq %r10, %r11 //(2) reg #s 8, 9 from (1) reg #s 10, 11 from (2) reg # 9, values for (1)

28

SLIDE 70

addq execution

PC

Instr. Mem.

register fjle

srcA srcB R[srcA] R[srcB] dstE next R[dstE] dstM next R[dstM] split

0xF

ADD

add 2 fetch/decode decode/execute execute/writeback fetch/fetch

addq %r8, %r9 // (1) addq %r10, %r11 // (2)

addq %r8, %r9 //(1) address of (2) addq %r10, %r11 //(2) reg #s 8, 9 from (1) reg #s 10, 11 from (2) reg # 9, values for (1)

28

SLIDE 71

addq execution

PC

Instr. Mem.

register fjle

srcA srcB R[srcA] R[srcB] dstE next R[dstE] dstM next R[dstM] split

0xF

ADD

add 2 fetch/decode decode/execute execute/writeback fetch/fetch

addq %r8, %r9 // (1) addq %r10, %r11 // (2)

addq %r8, %r9 //(1) address of (2) addq %r10, %r11 //(2) reg #s 8, 9 from (1) reg #s 10, 11 from (2) reg # 9, values for (1)

28

SLIDE 72

addq processor timing

PC

Instr. Mem.

register fjle

srcA srcB R[srcA] R[srcB] dstE next R[dstE] dstM next R[dstM] split

0xF

ADD

add 2

// initially %r8 = 800, // %r9 = 900, etc. addq %r8, %r9 addq %r10, %r11 addq %r12, %r13 addq %r9, %r8

fetch rA rB R[srcA] R[srcB] dstE next R[dstE] dstE cycle PC rA rB R[srcA] R[srcB] dstE next R[dstE] dstE 0x0 1 0x2 8 9 2 0x4 10 11 800 900 9 3 0x6 12 13 1000 1100 11 1700 9 4 9 8 1200 1300 13 2100 11 5 1700 800 8 2500 13 6 2500 8 fetch/decode decode/execute execute/writeback

29

SLIDE 73

addq processor timing

PC

Instr. Mem.

register fjle

srcA srcB R[srcA] R[srcB] dstE next R[dstE] dstM next R[dstM] split

0xF

ADD

add 2

// initially %r8 = 800, // %r9 = 900, etc. addq %r8, %r9 addq %r10, %r11 addq %r12, %r13 addq %r9, %r8

fetch rA rB R[srcA] R[srcB] dstE next R[dstE] dstE cycle PC rA rB R[srcA] R[srcB] dstE next R[dstE] dstE 0x0 1 0x2 8 9 2 0x4 10 11 800 900 9 3 0x6 12 13 1000 1100 11 1700 9 4 9 8 1200 1300 13 2100 11 5 1700 800 8 2500 13 6 2500 8 fetch/decode decode/execute execute/writeback

29

SLIDE 74

addq processor timing

PC

Instr. Mem.

register fjle

srcA srcB R[srcA] R[srcB] dstE next R[dstE] dstM next R[dstM] split

0xF

ADD

add 2

// initially %r8 = 800, // %r9 = 900, etc. addq %r8, %r9 addq %r10, %r11 addq %r12, %r13 addq %r9, %r8

fetch rA rB R[srcA] R[srcB] dstE next R[dstE] dstE cycle PC rA rB R[srcA] R[srcB] dstE next R[dstE] dstE 0x0 1 0x2 8 9 2 0x4 10 11 800 900 9 3 0x6 12 13 1000 1100 11 1700 9 4 9 8 1200 1300 13 2100 11 5 1700 800 8 2500 13 6 2500 8 fetch/decode decode/execute execute/writeback

29

SLIDE 75

addq processor timing

PC

Instr. Mem.

register fjle

srcA srcB R[srcA] R[srcB] dstE next R[dstE] dstM next R[dstM] split

0xF

ADD

add 2

// initially %r8 = 800, // %r9 = 900, etc. addq %r8, %r9 addq %r10, %r11 addq %r12, %r13 addq %r9, %r8

fetch rA rB R[srcA] R[srcB] dstE next R[dstE] dstE cycle PC rA rB R[srcA] R[srcB] dstE next R[dstE] dstE 0x0 1 0x2 8 9 2 0x4 10 11 800 900 9 3 0x6 12 13 1000 1100 11 1700 9 4 9 8 1200 1300 13 2100 11 5 1700 800 8 2500 13 6 2500 8 fetch/decode decode/execute execute/writeback

29

SLIDE 76

addq processor timing

PC

Instr. Mem.

register fjle

srcA srcB R[srcA] R[srcB] dstE next R[dstE] dstM next R[dstM] split

0xF

ADD

add 2

// initially %r8 = 800, // %r9 = 900, etc. addq %r8, %r9 addq %r10, %r11 addq %r12, %r13 addq %r9, %r8

fetch rA rB R[srcA] R[srcB] dstE next R[dstE] dstE cycle PC rA rB R[srcA] R[srcB] dstE next R[dstE] dstE 0x0 1 0x2 8 9 2 0x4 10 11 800 900 9 3 0x6 12 13 1000 1100 11 1700 9 4 9 8 1200 1300 13 2100 11 5 1700 800 8 2500 13 6 2500 8 fetch/decode decode/execute execute/writeback

29

SLIDE 77

addq processor performance

PC

Instr. Mem.

register fjle

srcA srcB R[srcA] R[srcB] dstE next R[dstE] dstM next R[dstM] split

0xF

ADD

add 2

example delays: path time add 2 80 ps instruction memory 200 ps register fjle read 125 ps add 100 ps register fjle write 125 ps

no pipelining: 1 instruction per 550 ps

add up everything but add 2 (critical (slowest) path)

pipelining: 1 instruction per 200 ps + pipeline register delays

slowest path through stage + pipeline register delays latency: 800 ps + pipeline register delays (4 cycles)

30