1 last time HCL details/built in components HCL debug/interactive - - PowerPoint PPT Presentation

1 last time
SMART_READER_LITE
LIVE PREVIEW

1 last time HCL details/built in components HCL debug/interactive - - PowerPoint PPT Presentation

1 last time HCL details/built in components HCL debug/interactive options walkthrough of SEQ stages/needed MUXes 2 critical path every path from state output to state input needs enough time output may change on rising edge of clock


slide-1
SLIDE 1

1

slide-2
SLIDE 2

last time

HCL details/built in components HCL debug/interactive options walkthrough of SEQ stages/needed MUXes

2

slide-3
SLIDE 3

critical path

every path from state output to state input needs enough time

  • utput — may change on rising edge of clock

input — must be stable suffjciently before rising edge of clock

critical path: slowest of all these paths — determines cycle time

3

slide-4
SLIDE 4

SEQ paths

PC

Instr. Mem.

register fjle

srcA srcB R[srcA] R[srcB] dstE next R[dstE] dstM next R[dstM]

Data Mem.

ZF/SF Stat

Data in Addr in Data out

valC

0xF 0xF %rsp %rsp 0xF 0xF %rsp rA rB

ALU

aluA aluB valE 8 add/sub xor/and (function

  • f instr.)

write? function

  • f opcode

PC+9

instr. length +

path 1: 25 picoseconds path 2: 50 picoseconds path 3: 400 picoseconds path 4: 900 picoseconds … …

…and many, many more paths

4

slide-5
SLIDE 5

SEQ paths

PC

Instr. Mem.

register fjle

srcA srcB R[srcA] R[srcB] dstE next R[dstE] dstM next R[dstM]

Data Mem.

ZF/SF Stat

Data in Addr in Data out

valC

0xF 0xF %rsp %rsp 0xF 0xF %rsp rA rB

ALU

aluA aluB valE 8 add/sub xor/and (function

  • f instr.)

write? function

  • f opcode

PC+9

instr. length +

path 1: 25 picoseconds path 2: 50 picoseconds path 3: 400 picoseconds path 4: 900 picoseconds … …

…and many, many more paths

4

slide-6
SLIDE 6

SEQ paths

PC

Instr. Mem.

register fjle

srcA srcB R[srcA] R[srcB] dstE next R[dstE] dstM next R[dstM]

Data Mem.

ZF/SF Stat

Data in Addr in Data out

valC

0xF 0xF %rsp %rsp 0xF 0xF %rsp rA rB

ALU

aluA aluB valE 8 add/sub xor/and (function

  • f instr.)

write? function

  • f opcode

PC+9

instr. length +

path 1: 25 picoseconds path 2: 50 picoseconds path 3: 400 picoseconds path 4: 900 picoseconds … …

…and many, many more paths

4

slide-7
SLIDE 7

SEQ paths

PC

Instr. Mem.

register fjle

srcA srcB R[srcA] R[srcB] dstE next R[dstE] dstM next R[dstM]

Data Mem.

ZF/SF Stat

Data in Addr in Data out

valC

0xF 0xF %rsp %rsp 0xF 0xF %rsp rA rB

ALU

aluA aluB valE 8 add/sub xor/and (function

  • f instr.)

write? function

  • f opcode

PC+9

instr. length +

path 1: 25 picoseconds path 2: 50 picoseconds path 3: 400 picoseconds path 4: 900 picoseconds … …

…and many, many more paths

4

slide-8
SLIDE 8

SEQ paths

PC

Instr. Mem.

register fjle

srcA srcB R[srcA] R[srcB] dstE next R[dstE] dstM next R[dstM]

Data Mem.

ZF/SF Stat

Data in Addr in Data out

valC

0xF 0xF %rsp %rsp 0xF 0xF %rsp rA rB

ALU

aluA aluB valE 8 add/sub xor/and (function

  • f instr.)

write? function

  • f opcode

PC+9

instr. length +

path 1: 25 picoseconds path 2: 50 picoseconds path 3: 400 picoseconds path 4: 900 picoseconds … …

…and many, many more paths

4

slide-9
SLIDE 9

SEQ paths

PC

Instr. Mem.

register fjle

srcA srcB R[srcA] R[srcB] dstE next R[dstE] dstM next R[dstM]

Data Mem.

ZF/SF Stat

Data in Addr in Data out

valC

0xF 0xF %rsp %rsp 0xF 0xF %rsp rA rB

ALU

aluA aluB valE 8 add/sub xor/and (function

  • f instr.)

write? function

  • f opcode

PC+9

instr. length +

path 1: 25 picoseconds path 2: 50 picoseconds path 3: 400 picoseconds path 4: 900 picoseconds … …

…and many, many more paths

4

slide-10
SLIDE 10

sequential addq paths

PC

Instr. Mem.

register fjle

srcA srcB R[srcA] R[srcB] dstE next R[dstE] dstM next R[dstM] split

0xF

ADD

ADD

add 2

path 1: 25 picoseconds path 2: 375 picoseconds path 3: 500 picoseconds path 4: 500 picoseconds

  • verall cycle time: 500 picoseconds (longest path)

5

slide-11
SLIDE 11

sequential addq paths

PC

Instr. Mem.

register fjle

srcA srcB R[srcA] R[srcB] dstE next R[dstE] dstM next R[dstM] split

0xF

ADD

ADD

add 2

path 1: 25 picoseconds path 2: 375 picoseconds path 3: 500 picoseconds path 4: 500 picoseconds

  • verall cycle time: 500 picoseconds (longest path)

5

slide-12
SLIDE 12

sequential addq paths

PC

Instr. Mem.

register fjle

srcA srcB R[srcA] R[srcB] dstE next R[dstE] dstM next R[dstM] split

0xF

ADD

ADD

add 2

path 1: 25 picoseconds path 2: 375 picoseconds path 3: 500 picoseconds path 4: 500 picoseconds

  • verall cycle time: 500 picoseconds (longest path)

5

slide-13
SLIDE 13

sequential addq paths

PC

Instr. Mem.

register fjle

srcA srcB R[srcA] R[srcB] dstE next R[dstE] dstM next R[dstM] split

0xF

ADD

ADD

add 2

path 1: 25 picoseconds path 2: 375 picoseconds path 3: 500 picoseconds path 4: 500 picoseconds

  • verall cycle time: 500 picoseconds (longest path)

5

slide-14
SLIDE 14

sequential addq paths

PC

Instr. Mem.

register fjle

srcA srcB R[srcA] R[srcB] dstE next R[dstE] dstM next R[dstM] split

0xF

ADD

ADD

add 2

path 1: 25 picoseconds path 2: 375 picoseconds path 3: 500 picoseconds path 4: 500 picoseconds

  • verall cycle time: 500 picoseconds (longest path)

5

slide-15
SLIDE 15

sequential addq paths

PC

Instr. Mem.

register fjle

srcA srcB R[srcA] R[srcB] dstE next R[dstE] dstM next R[dstM] split

0xF

ADD

ADD

add 2

path 1: 25 picoseconds path 2: 375 picoseconds path 3: 500 picoseconds path 4: 500 picoseconds

  • verall cycle time: 500 picoseconds (longest path)

5

slide-16
SLIDE 16

Human pipeline: laundry

Washer Dryer Folding Table

11:00 12:00 13:00 14:00

Washer Dryer Folding Table

11:00 12:00 13:00 14:00

whites whites whites colors colors colors whites whites whites colors colors colors sheets sheets sheets

6

slide-17
SLIDE 17

Human pipeline: laundry

Washer Dryer Folding Table

11:00 12:00 13:00 14:00

Washer Dryer Folding Table

11:00 12:00 13:00 14:00

whites whites whites colors colors colors whites whites whites colors colors colors sheets sheets sheets

6

slide-18
SLIDE 18

Waste (1)

Washer Dryer Folding Table

11:00 12:00 13:00 14:00

whites whites whites colors colors colors sheets sheets sheets

wasted time! wasted time!

7

slide-19
SLIDE 19

Waste (1)

Washer Dryer Folding Table

11:00 12:00 13:00 14:00

whites whites whites colors colors colors sheets sheets sheets

wasted time! wasted time!

7

slide-20
SLIDE 20

Waste (2)

Washer Dryer Folding Table

11:00 12:00 13:00 14:00

whites whites whites colors colors colors sheets sheets sheets

8

slide-21
SLIDE 21

Latency — Time for One

Washer Dryer Folding Table

11:00 12:00 13:00 14:00

whites whites whites colors colors colors sheets sheets sheets

pipelined latency (2.1 h)

colors colors colors

normal latency (1.8 h)

9

slide-22
SLIDE 22

Latency — Time for One

Washer Dryer Folding Table

11:00 12:00 13:00 14:00

whites whites whites colors colors colors sheets sheets sheets

pipelined latency (2.1 h)

colors colors colors

normal latency (1.8 h)

9

slide-23
SLIDE 23

Latency — Time for One

Washer Dryer Folding Table

11:00 12:00 13:00 14:00

whites whites whites colors colors colors sheets sheets sheets

pipelined latency (2.1 h)

colors colors colors

normal latency (1.8 h)

9

slide-24
SLIDE 24

Throughput — Rate of Many

Washer Dryer Folding Table

11:00 12:00 13:00 14:00

whites whites whites colors colors colors sheets sheets sheets

time between fjnishes (0.83 h) load h loads/h time between starts (0.83 h)

10

slide-25
SLIDE 25

Throughput — Rate of Many

Washer Dryer Folding Table

11:00 12:00 13:00 14:00

whites whites whites colors colors colors sheets sheets sheets

time between fjnishes (0.83 h) 1 load 0.83h = 1.2 loads/h time between starts (0.83 h)

10

slide-26
SLIDE 26

Throughput — Rate of Many

Washer Dryer Folding Table

11:00 12:00 13:00 14:00

whites whites whites colors colors colors sheets sheets sheets

time between fjnishes (0.83 h) 1 load 0.83h = 1.2 loads/h time between starts (0.83 h)

10

slide-27
SLIDE 27

times three circuit

A

ADD

ADD

ADD

ADD 2 × A 3 × A add add 7 14 21

0 ps 50 ps 100 ps

100 ps latency 10 results/ns throughput

11

slide-28
SLIDE 28

times three circuit

A

ADD

ADD

ADD

ADD 2 × A 3 × A A add A + A 2 × A add 2A + A 3 × A 7 14 21

0 ps 50 ps 100 ps

100 ps latency 10 results/ns throughput

11

slide-29
SLIDE 29

times three circuit

A

ADD

ADD

ADD

ADD 2 × A 3 × A A add A + A 2 × A add 2A + A 3 × A 7 14 21

0 ps 50 ps 100 ps

100 ps latency = ⇒ 10 results/ns throughput

11

slide-30
SLIDE 30

times three and repeat

A add A + A 2 × A add 2A + A 3 × A 7 14 17 34 4 8 1 2 23 46

0 ps 100 ps 200 ps 300 ps 400 ps 500 ps

add add 7 14 21 17 34 51 4 8 12 1 2 3 23 46 69

0 ps 100 ps 200 ps 300 ps 400 ps 500 ps

12

slide-31
SLIDE 31

times three and repeat

A add A + A 2 × A add 2A + A 3 × A 7 14 17 34 4 8 1 2 23 46

0 ps 100 ps 200 ps 300 ps 400 ps 500 ps

A add A + A 2 × A add 2A + A 3 × A 7 14 21 17 34 51 4 8 12 1 2 3 23 46 69

0 ps 100 ps 200 ps 300 ps 400 ps 500 ps

12

slide-32
SLIDE 32

pipelined times three

A (t + 2) ADD

ADD

ADD

ADD

2 × A (t + 1) A (t + 1) 3 × A (t + 0)

( ) ( ) ( ) ( ) 7 7 14 21 17 17 34

13

slide-33
SLIDE 33

pipelined times three

A (t + 2) ADD

ADD

ADD

ADD

2 × A (t + 1) A (t + 1) 3 × A (t + 0)

A (t + 2) A (t + 1) 2 × A (t + 1) 3 × A (t + 0) 7 7 14 21 17 17 34

13

slide-34
SLIDE 34

register tolerances

register output register input

  • utput

changes input must not change register delay

14

slide-35
SLIDE 35

register tolerances

register output register input

  • utput

changes input must not change register delay

14

slide-36
SLIDE 36

register tolerances

register output register input

  • utput

changes input must not change register delay

14

slide-37
SLIDE 37

times three pipeline timing

A (t + 2) ADD

ADD

ADD

ADD

2 × A (t + 1) A (t + 1) 3 × A (t + 0)

10 ps 50 ps 10 ps 50 ps 10 ps throughput: ps G operations/sec

15

slide-38
SLIDE 38

times three pipeline timing

A (t + 2) ADD

ADD

ADD

ADD

2 × A (t + 1) A (t + 1) 3 × A (t + 0)

10 ps 50 ps 10 ps 50 ps 10 ps throughput: 1 60 ps ≈ 16 G operations/sec

15

slide-39
SLIDE 39

deeper pipeline

A (t + 4) ADD

ADD

ADD

ADD

2 × A (t + 2) A (t + 2) 3 × A (t + 0)

ps ps ps ps ps ps ps ps ps

exercise: throughput now? throughput: ps G ops/sec

30 ps 20 ps

exercise: throughput now? (didn’t split second add evenly) throughput: ps G ops/sec

A (t + 3) 2 × A partial results 3 × A partial results

Problem: How much faster can we get? Problem: Can we even do this?

17

slide-40
SLIDE 40

deeper pipeline

A (t + 4) ADD

ADD

ADD

ADD

2 × A (t + 2) A (t + 2) 3 × A (t + 0)

10 ps 25 ps 25 ps 10 ps 10 ps 25 ps 10 ps 25 ps 10 ps

exercise: throughput now? throughput: ps G ops/sec

30 ps 20 ps

exercise: throughput now? (didn’t split second add evenly) throughput: ps G ops/sec

A (t + 3) 2 × A partial results 3 × A partial results

Problem: How much faster can we get? Problem: Can we even do this?

17

slide-41
SLIDE 41

deeper pipeline

A (t + 4) ADD

ADD

ADD

ADD

2 × A (t + 2) A (t + 2) 3 × A (t + 0)

10 ps 25 ps 25 ps 10 ps 10 ps 25 ps 10 ps 25 ps 10 ps

exercise: throughput now? throughput: ps G ops/sec

30 ps 20 ps

exercise: throughput now? (didn’t split second add evenly) throughput: ps G ops/sec

A (t + 3) 2 × A partial results 3 × A partial results

Problem: How much faster can we get? Problem: Can we even do this?

17

slide-42
SLIDE 42

deeper pipeline

A (t + 4) ADD

ADD

ADD

ADD

2 × A (t + 2) A (t + 2) 3 × A (t + 0)

10 ps 25 ps 25 ps 10 ps 10 ps 25 ps 10 ps 25 ps 10 ps

exercise: throughput now? throughput: 1 35 ps ≈ 28 G ops/sec

30 ps 20 ps

exercise: throughput now? (didn’t split second add evenly) throughput: ps G ops/sec

A (t + 3) 2 × A partial results 3 × A partial results

Problem: How much faster can we get? Problem: Can we even do this?

17

slide-43
SLIDE 43

diminishing returns: register delays

logic (all)

100 ps

110 ps per cycle

10 ps

logic (1/2)

50 ps

60 ps per cycle

10 ps

logic (2/2)

50 ps 10 ps

logic (1/3)

33 ps

43 ps per cycle

10 ps

logic (2/3)

33 ps 10 ps

logic (3/3)

33 ps 10 ps

. . . . . . . . . . . .

1 ps

11 ps per cycle

10 ps 1 ps 10 ps 1 ps 10 ps 1 ps 10 ps

18

slide-44
SLIDE 44

diminishing returns: register delays

2 4 6 8 10 12 14 20 40 60 80 100 120 number of stages time per completion (ps)

19

slide-45
SLIDE 45

diminishing returns: register delays

2 4 6 8 10 12 14 20 40 60 80 100 120 register delay number of stages time per completion (ps)

19

slide-46
SLIDE 46

diminishing returns: register delays

2 4 6 8 10 12 14 20 40 60 80 100 120 register delay 1.83x speedup 1.02x speedup number of stages time per completion (ps)

19

slide-47
SLIDE 47

diminishing returns: register delays

2 4 6 8 10 12 14 20 40 60 80 100 1.83x throughput 1.02x throughput number of stages throughput (ops/ns)

20

slide-48
SLIDE 48

diminishing returns: register delays

2 4 6 8 10 12 14 20 40 60 80 100 1.83x throughput 1.02x throughput

  • max. rate of register updates

number of stages throughput (ops/ns)

20

slide-49
SLIDE 49

deeper pipeline

A (t + 4) ADD

ADD

ADD

ADD

2 × A (t + 2) A (t + 2) 3 × A (t + 0)

10 ps 25 ps 25 ps 10 ps 10 ps 25 ps 10 ps 25 ps 10 ps

exercise: throughput now? throughput: ps G ops/sec

30 ps 20 ps

exercise: throughput now? (didn’t split second add evenly) throughput: ps G ops/sec

A (t + 3) 2 × A partial results 3 × A partial results

Problem: How much faster can we get? Problem: Can we even do this?

21

slide-50
SLIDE 50

deeper pipeline

A (t + 4) ADD

ADD

ADD

ADD

2 × A (t + 2) A (t + 2) 3 × A (t + 0)

10 ps 25 ps 25 ps 10 ps 10 ps 25 ps 10 ps 25 ps 10 ps

exercise: throughput now? throughput: ps G ops/sec

30 ps 20 ps

exercise: throughput now? (didn’t split second add evenly) throughput: ps G ops/sec

A (t + 3) 2 × A partial results 3 × A partial results

Problem: How much faster can we get? Problem: Can we even do this?

22

slide-51
SLIDE 51

deeper pipeline

A (t + 4) ADD

ADD

ADD

ADD

2 × A (t + 2) A (t + 2) 3 × A (t + 0)

10 ps 25 ps 25 ps 10 ps 10 ps 25 ps 10 ps 25 ps 10 ps

exercise: throughput now? throughput: ps G ops/sec

30 ps 20 ps

exercise: throughput now? (didn’t split second add evenly) throughput: 1 40 ps ≈ 25 G ops/sec

A (t + 3) 2 × A partial results 3 × A partial results

Problem: How much faster can we get? Problem: Can we even do this?

22

slide-52
SLIDE 52

diminishing returns: uneven split

Can we split up some logic (e.g. adder) arbitrarily? Probably not...

logic (all)

100 ps

110 ps per cycle

10 ps

logic (1/2)

60 ps

70 ps per cycle

10 ps

logic (2/2)

45 ps 10 ps

logic (1/3)

40 ps

50 ps per cycle

10 ps

logic (2/3)

40 ps 10 ps

logic (3/3)

30 ps 10 ps

. . . . . . . . . . . .

23

slide-53
SLIDE 53

diminishing returns: uneven split

Can we split up some logic (e.g. adder) arbitrarily? Probably not...

logic (all)

100 ps

110 ps per cycle

10 ps

logic (1/2)

60 ps

70 ps per cycle

10 ps

logic (2/2)

45 ps 10 ps

logic (1/3)

40 ps

50 ps per cycle

10 ps

logic (2/3)

40 ps 10 ps

logic (3/3)

30 ps 10 ps

. . . . . . . . . . . .

23

slide-54
SLIDE 54

diminishing returns: uneven split

Can we split up some logic (e.g. adder) arbitrarily? Probably not...

logic (all)

100 ps

110 ps per cycle

10 ps

logic (1/2)

60 ps

70 ps per cycle

10 ps

logic (2/2)

45 ps 10 ps

logic (1/3)

40 ps

50 ps per cycle

10 ps

logic (2/3)

40 ps 10 ps

logic (3/3)

30 ps 10 ps

. . . . . . . . . . . .

23

slide-55
SLIDE 55

textbook SEQ ‘stages’

conceptual order only Fetch: read instruction memory Decode: read register fjle Execute: arithmetic (ALU) Memory: read/write data memory Writeback: write register fjle PC Update: write PC register

writes happen at end of cycle reads — “magic” like combinatorial logic as values available

24

slide-56
SLIDE 56

textbook SEQ ‘stages’

conceptual order only Fetch: read instruction memory Decode: read register fjle Execute: arithmetic (ALU) Memory: read/write data memory Writeback: write register fjle PC Update: write PC register

writes happen at end of cycle reads — “magic” like combinatorial logic as values available

24

slide-57
SLIDE 57

textbook SEQ ‘stages’

conceptual order only Fetch: read instruction memory Decode: read register fjle Execute: arithmetic (ALU) Memory: read/write data memory Writeback: write register fjle PC Update: write PC register

writes happen at end of cycle reads — “magic” like combinatorial logic as values available

24

slide-58
SLIDE 58

textbook stages

conceptual order only pipeline stages Fetch/PC Update: read instruction memory; compute next PC Decode: read register fjle Execute: arithmetic (ALU) Memory: read/write data memory Writeback: write register fjle

5 stages

  • ne instruction in each

compute next to start immediatelly

25

slide-59
SLIDE 59

textbook stages

conceptual order only pipeline stages Fetch/PC Update: read instruction memory; compute next PC Decode: read register fjle Execute: arithmetic (ALU) Memory: read/write data memory Writeback: write register fjle

5 stages

  • ne instruction in each

compute next to start immediatelly

25

slide-60
SLIDE 60

addq CPU

PC

Instr. Mem.

register fjle

srcA srcB R[srcA] R[srcB] dstE next R[dstE] dstM next R[dstM] split

0xF

ADD

ADD

add 2 decode execute writeback signal skips two stages fetch and PC update

26

slide-61
SLIDE 61

addq CPU

PC

Instr. Mem.

register fjle

srcA srcB R[srcA] R[srcB] dstE next R[dstE] dstM next R[dstM] split

0xF

ADD

ADD

add 2 decode execute writeback signal skips two stages fetch and PC update

26

slide-62
SLIDE 62

addq CPU

PC

Instr. Mem.

register fjle

srcA srcB R[srcA] R[srcB] dstE next R[dstE] dstM next R[dstM] split

0xF

ADD

ADD

add 2 decode execute writeback signal skips two stages fetch and PC update

26

slide-63
SLIDE 63

addq CPU

PC

Instr. Mem.

register fjle

srcA srcB R[srcA] R[srcB] dstE next R[dstE] dstM next R[dstM] split

0xF

ADD

ADD

add 2 decode execute writeback signal skips two stages fetch and PC update

26

slide-64
SLIDE 64

pipelined addq processor

PC

Instr. Mem.

register fjle

srcA srcB R[srcA] R[srcB] dstE next R[dstE] dstM next R[dstM] split

0xF

ADD

ADD

add 2 fetch and PC update decode execute writeback fetch/decode decode/execute execute/writeback fetch/fetch

27

slide-65
SLIDE 65

pipelined addq processor

PC

Instr. Mem.

register fjle

srcA srcB R[srcA] R[srcB] dstE next R[dstE] dstM next R[dstM] split

0xF

ADD

ADD

add 2 fetch and PC update decode execute writeback fetch/decode decode/execute execute/writeback fetch/fetch

27

slide-66
SLIDE 66

pipelined addq processor

PC

Instr. Mem.

register fjle

srcA srcB R[srcA] R[srcB] dstE next R[dstE] dstM next R[dstM] split

0xF

ADD

ADD

add 2 fetch and PC update decode execute writeback fetch/decode decode/execute execute/writeback fetch/fetch

27

slide-67
SLIDE 67

pipelined addq processor

PC

Instr. Mem.

register fjle

srcA srcB R[srcA] R[srcB] dstE next R[dstE] dstM next R[dstM] split

0xF

ADD

ADD

add 2 fetch and PC update decode execute writeback fetch/decode decode/execute execute/writeback fetch/fetch

27

slide-68
SLIDE 68

addq execution

PC

Instr. Mem.

register fjle

srcA srcB R[srcA] R[srcB] dstE next R[dstE] dstM next R[dstM] split

0xF

ADD

ADD

add 2 fetch/decode decode/execute execute/writeback fetch/fetch

addq %r8, %r9 // (1) addq %r10, %r11 // (2)

addq %r8, %r9 //(1) address of (2) addq %r10, %r11 //(2) reg #s 8, 9 from (1) reg #s 10, 11 from (2) reg # 9, values for (1)

28

slide-69
SLIDE 69

addq execution

PC

Instr. Mem.

register fjle

srcA srcB R[srcA] R[srcB] dstE next R[dstE] dstM next R[dstM] split

0xF

ADD

ADD

add 2 fetch/decode decode/execute execute/writeback fetch/fetch

addq %r8, %r9 // (1) addq %r10, %r11 // (2)

addq %r8, %r9 //(1) address of (2) addq %r10, %r11 //(2) reg #s 8, 9 from (1) reg #s 10, 11 from (2) reg # 9, values for (1)

28

slide-70
SLIDE 70

addq execution

PC

Instr. Mem.

register fjle

srcA srcB R[srcA] R[srcB] dstE next R[dstE] dstM next R[dstM] split

0xF

ADD

ADD

add 2 fetch/decode decode/execute execute/writeback fetch/fetch

addq %r8, %r9 // (1) addq %r10, %r11 // (2)

addq %r8, %r9 //(1) address of (2) addq %r10, %r11 //(2) reg #s 8, 9 from (1) reg #s 10, 11 from (2) reg # 9, values for (1)

28

slide-71
SLIDE 71

addq execution

PC

Instr. Mem.

register fjle

srcA srcB R[srcA] R[srcB] dstE next R[dstE] dstM next R[dstM] split

0xF

ADD

ADD

add 2 fetch/decode decode/execute execute/writeback fetch/fetch

addq %r8, %r9 // (1) addq %r10, %r11 // (2)

addq %r8, %r9 //(1) address of (2) addq %r10, %r11 //(2) reg #s 8, 9 from (1) reg #s 10, 11 from (2) reg # 9, values for (1)

28

slide-72
SLIDE 72

addq processor timing

PC

Instr. Mem.

register fjle

srcA srcB R[srcA] R[srcB] dstE next R[dstE] dstM next R[dstM] split

0xF

ADD

ADD

add 2

// initially %r8 = 800, // %r9 = 900, etc. addq %r8, %r9 addq %r10, %r11 addq %r12, %r13 addq %r9, %r8

fetch rA rB R[srcA] R[srcB] dstE next R[dstE] dstE cycle PC rA rB R[srcA] R[srcB] dstE next R[dstE] dstE 0x0 1 0x2 8 9 2 0x4 10 11 800 900 9 3 0x6 12 13 1000 1100 11 1700 9 4 9 8 1200 1300 13 2100 11 5 1700 800 8 2500 13 6 2500 8 fetch/decode decode/execute execute/writeback

29

slide-73
SLIDE 73

addq processor timing

PC

Instr. Mem.

register fjle

srcA srcB R[srcA] R[srcB] dstE next R[dstE] dstM next R[dstM] split

0xF

ADD

ADD

add 2

// initially %r8 = 800, // %r9 = 900, etc. addq %r8, %r9 addq %r10, %r11 addq %r12, %r13 addq %r9, %r8

fetch rA rB R[srcA] R[srcB] dstE next R[dstE] dstE cycle PC rA rB R[srcA] R[srcB] dstE next R[dstE] dstE 0x0 1 0x2 8 9 2 0x4 10 11 800 900 9 3 0x6 12 13 1000 1100 11 1700 9 4 9 8 1200 1300 13 2100 11 5 1700 800 8 2500 13 6 2500 8 fetch/decode decode/execute execute/writeback

29

slide-74
SLIDE 74

addq processor timing

PC

Instr. Mem.

register fjle

srcA srcB R[srcA] R[srcB] dstE next R[dstE] dstM next R[dstM] split

0xF

ADD

ADD

add 2

// initially %r8 = 800, // %r9 = 900, etc. addq %r8, %r9 addq %r10, %r11 addq %r12, %r13 addq %r9, %r8

fetch rA rB R[srcA] R[srcB] dstE next R[dstE] dstE cycle PC rA rB R[srcA] R[srcB] dstE next R[dstE] dstE 0x0 1 0x2 8 9 2 0x4 10 11 800 900 9 3 0x6 12 13 1000 1100 11 1700 9 4 9 8 1200 1300 13 2100 11 5 1700 800 8 2500 13 6 2500 8 fetch/decode decode/execute execute/writeback

29

slide-75
SLIDE 75

addq processor timing

PC

Instr. Mem.

register fjle

srcA srcB R[srcA] R[srcB] dstE next R[dstE] dstM next R[dstM] split

0xF

ADD

ADD

add 2

// initially %r8 = 800, // %r9 = 900, etc. addq %r8, %r9 addq %r10, %r11 addq %r12, %r13 addq %r9, %r8

fetch rA rB R[srcA] R[srcB] dstE next R[dstE] dstE cycle PC rA rB R[srcA] R[srcB] dstE next R[dstE] dstE 0x0 1 0x2 8 9 2 0x4 10 11 800 900 9 3 0x6 12 13 1000 1100 11 1700 9 4 9 8 1200 1300 13 2100 11 5 1700 800 8 2500 13 6 2500 8 fetch/decode decode/execute execute/writeback

29

slide-76
SLIDE 76

addq processor timing

PC

Instr. Mem.

register fjle

srcA srcB R[srcA] R[srcB] dstE next R[dstE] dstM next R[dstM] split

0xF

ADD

ADD

add 2

// initially %r8 = 800, // %r9 = 900, etc. addq %r8, %r9 addq %r10, %r11 addq %r12, %r13 addq %r9, %r8

fetch rA rB R[srcA] R[srcB] dstE next R[dstE] dstE cycle PC rA rB R[srcA] R[srcB] dstE next R[dstE] dstE 0x0 1 0x2 8 9 2 0x4 10 11 800 900 9 3 0x6 12 13 1000 1100 11 1700 9 4 9 8 1200 1300 13 2100 11 5 1700 800 8 2500 13 6 2500 8 fetch/decode decode/execute execute/writeback

29

slide-77
SLIDE 77

addq processor performance

PC

Instr. Mem.

register fjle

srcA srcB R[srcA] R[srcB] dstE next R[dstE] dstM next R[dstM] split

0xF

ADD

ADD

add 2

example delays: path time add 2 80 ps instruction memory 200 ps register fjle read 125 ps add 100 ps register fjle write 125 ps

no pipelining: 1 instruction per 550 ps

add up everything but add 2 (critical (slowest) path)

pipelining: 1 instruction per 200 ps + pipeline register delays

slowest path through stage + pipeline register delays latency: 800 ps + pipeline register delays (4 cycles)

30