MIPS Pipeline with Tomasulos Algorithm ADD ADD RS IR Issue WB - - PowerPoint PPT Presentation

mips pipeline with tomasulo s algorithm add add rs ir
SMART_READER_LITE
LIVE PREVIEW

MIPS Pipeline with Tomasulos Algorithm ADD ADD RS IR Issue WB - - PowerPoint PPT Presentation

MIPS Pipeline with Tomasulos Algorithm ADD ADD RS IR Issue WB Dispatch DIV LSQ MEM REG FILE MULT Common Data Bus (CDB) Example LOOP: A LD F0, 0(R1) | temp = x[i] B MUL F4, F0, F2 | temp = temp * a C SD F4, 0(R1) |


slide-1
SLIDE 1

MIPS Pipeline with Tomasulo’s Algorithm Issue ADD MULT DIV

LSQ

IR WB ADD RS

Dispatch

REG FILE Common Data Bus (CDB) MEM

slide-2
SLIDE 2

Example

LOOP: A LD F0, 0(R1) | temp = x[i] B MUL F4, F0, F2 | temp = temp * a C SD F4, 0(R1) | x[i] = temp D ADDI R1, R1, #8 | i++ E BNE R1, R2, LOOP | branch if R1.ne.R2

  • Within an iteration RAW between (A, B) and (C, D) [focusing on FP registers only]
  • Across iterations WAR and WAW dependencies become apparent

A1 LD F0, 0(R1) | temp = x[i] B1 MUL F4, F0, F2 | temp = temp * a C1 SD F4, 0(R1) | x[i] = temp D1 ADDI R1, R1, #8 | i++ E1 BNE R1, R2, LOOP | branch if R1.ne.R2 A2 LD F0, 0(R1) | temp = x[i] B2 MUL F4, F0, F2 | temp = temp * a C2 SD F4, 0(R1) | x[i] = temp D2 ADDI R1, R1, #8 | i++ E2 BNE R1, R2, LOOP | branch if R1.ne.R2

2

slide-3
SLIDE 3

Example

LOOP: A LD F0, 0(R1) | temp = x[i] B MUL F4, F0, F2 | temp = temp * a C SD F4, 0(R1) | x[i] = temp D ADDI R1, R1, #8 | i++ E BNE R1, R2, LOOP | branch if R1.ne.R2

  • Within an iteration RAW between (A, B) and (C, D) [focusing on FP registers only]
  • Across iterations WAR and WAW dependencies become apparent

3

A1 B1 C1 A2 B2 C2 A3 B3 C3 A1 B1 C1 A2 B2 C2 A3 B3 C3

Renaming removes WAR and WAW dependencies

slide-4
SLIDE 4

In-Order Scheduling of MEM Operations

5

W

21

M

20

D D D D D D I IF W E I IF

E2

W E I IF

D2

M D D D D D D D D I IF

C2 W * * * * D D

D D I IF

B2

W M D D D I IF

A2

W E I IF

E1

W E I IF

D1

M D D D D D D I IF

C1

W * * * * D D I IF

B1

W M D I IF

A1 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1

7

slide-5
SLIDE 5

Load Store Queue (LSQ)

C1

LOAD/STORE Q

CDB

MEM

A1

DISPATCH ISSUE WRITE

  • Since MEM access is in general a multi-cycle operation: buffer several LOAD/STORE instructions.
  • Snapshot at Cycle 4

8

slide-6
SLIDE 6

Load Store Queue (LSQ)

C1 A2 C2

LOAD/STORE Q

CDB

MEM DISPATCH ISSUE WRITE

  • Since MEM access is in general a multi-cycle operation: buffer several LOAD/STORE instructions.
  • Snapshot at end of Cycle 10
  • Can I promote A2 ahead of C1 and let it use MEM during cycle 10?

8

slide-7
SLIDE 7

Load Store Queue (LSQ)

C1 C2

LOAD/STORE Q

CDB

MEM

A2

DISPATCH ISSUE WRITE

  • Since MEM access is in general a multi-cycle operation: buffer several LOAD/STORE instructions.
  • Snapshot at end of Cycle 10
  • Can I promote A2 ahead of C1 and let it use MEM during cycle 10?

8

slide-8
SLIDE 8

Schedule

21 20

W M D I IF W E I IF

BEQ

W E I IF

ADD

M D D D D D D I IF

SD W * * * *

D D I IF

MUL

W M D I IF

LD

W E I IF

BEQ

W E I IF

ADD

M D D D D D D I IF

SD

W * * * * D D I IF

MUL

W M D I IF

LD 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 T = 6(n-1) + 11 = 6n + 5 cycles for n iterations

4

slide-9
SLIDE 9

LSQ Features

  • Assume memory address of the Load/Store (ea) is calculated during Issue and stored as

part of the Load or Store buffer in the LSQ

  • The register source of a Store either copies the current register value or waits for an

earlier instruction to broadcast its result on the CDB (same as other instructions).

  • ILP affected by size of the LSQ
  • FIFO buffer provides simple implementation for FCFS scheduling
  • Greater concurrency by out-of-order scheduling
  • In example:
  • C1 (Store of iteration 1) does memory write at cycle 11
  • A2 (Load of iteration 2) does memory read at cycle 10
  • C1 was stalled on an RAW waiting for data
  • We promoted the later load (L2) ahead of the earlier store (C1)
  • Is this OK?
  • Under what conditions?

9

slide-10
SLIDE 10

Example

LOOP: A LD F0, 0(R1) | temp = x[i] B MUL F4, F0, F2 | temp = temp * a C SD F4, 0(R1) | x[i] = temp D ADDI R1, R1, #8 | i++ E BNE R1, R2, LOOP | branch if R1.ne.R2 How about dependencies between memory operations?

5

A1 B1 C1 A2 B2 C2 A3 B3 C3

W A R

WAW ? RAW?

Need to ensure these dependencies a) do not exist or b) do not result in memory data hazards.

slide-11
SLIDE 11

Load Store Queue (LSQ)

B C

LOAD/STORE Q

CDB

MEM DISPATCH ISSUE WRITE

  • Since MEM access is in general a multi-cycle operation: buffer several LOAD/STORE instructions.
  • Question:
  • Is the order of serving memory requests important?

8

slide-12
SLIDE 12

Load Store Queue (LSQ)

B C

LOAD/STORE Q

CDB

MEM DISPATCH ISSUE WRITE

  • Since MEM access is in general a multi-cycle operation: buffer several LOAD/STORE instructions
  • An instruction that is ready to execute can be dispatched to MEM while others wait for operand
  • Example 2: Can we promote C (Store) over an earlier waiting store (B)?

A: MULD F0, F2, F4 B: SD F0, 0(R2) C: SD F2, 100(R4)

11

slide-13
SLIDE 13

Load Store Queue (LSQ)

B C D

LOAD/STORE Q

CDB

MEM DISPATCH ISSUE WRITE

  • Example: A: MULD F0, F2, F4

B: SD F0, 0(R2) C: SD F2, 100(R4) D: LD F6, 200(R6)

  • Can one promote the LD (instruction) D in front of B and C?
  • Memory Disambiguation: Need to avoid RAW, WAR, WAW hazards through memory

12

slide-14
SLIDE 14

Memory Disambiguation

A: MULD F0, F2, F4 B: SD F0, 0(R2) C: SD F2, 100(R4) D: LD F6, 200(R6) E: SD F8, 100(R2)

  • Instructions may have data dependencies through memory if they read or write the same memory location
  • Since the memory address is only known after the LOAD/STORE is issued run-time checks required
  • Memory disambiguation logic in the LSQ is required to see if memory instructions can be executed out of order
  • RAW
  • Newly issued LOAD must be compared for possible RAW hazards against all currently

pending STORE instructions

  • When D is issued:
  • Compare effective address of D (LD) against the addresses of pending stores: B and C
  • If no addresses match: safe to promote D and load from memory before the two stores
  • Load Forwarding: Optimization possible for LOAD
  • If there is a pending STORE with address matching that of the the issuing LOAD
  • LOAD can obtain its value directly from the Store Buffer without accessing memory
  • Need to choose the closest STORE if several matching pending STOREs in the queue

13

slide-15
SLIDE 15

Memory Disambiguation

A: MULD F0, F2, F4 B: SD F0, 0(R2) C: SD F2, 100(R4) D: LD F6, 200(R6) E: SD F8, 100(R2)

  • Instructions may have data dependencies through memory if they read or write the same memory location
  • Since the memory address is only known after the LOAD/STORE is issued run-time checks required
  • WAR
  • STORE must not overtake a LOAD with matching address
  • WAW
  • If an issuing STORE has the same address as a pending STORE
  • If LOAD FORWARDING is being employed:
  • All LOADS from that memory address have already received their value
  • Only purpose of the pending STORE is to update the memory location
  • Cancel (remove from queue) pending STORE

14

slide-16
SLIDE 16

Load Store Queue (LSQ)

STORE Q LOAD Q

CDB

MEM MEM MEM DISPATCH ISSUE WRITE

  • Useful to give LOADs priority over STOREs
  • Use LOAD FORWARDING to bypass memory
  • WAR to memory (earlier LD completing after later SD) will not occur
  • Use WRITE MERGING to collapse several writes to the same location

15

slide-17
SLIDE 17

Implementation Issues with Tomasulo’s Algorithm

  • Instructions tagged with id of Reservation Station they are issued to
  • Unique RS for each issued instruction
  • TAG serves to identify the instruction uniquely
  • Tag the destination register with TAG
  • Tag all source fields of later instructions that match this destination with TAG
  • TAG carried with the instruction to the WB stage so that the completing instruction is identified

in the CDB broadcast

  • Results of completing instructions are broadcast on a bus known as Common Data Bus (CDB)
  • REGISTER FILE, RESERVATION STATIONS, LOAD/STORE BUFFERS monitor CDB
  • TAG of completing instruction broadcast along with value
  • Broadcasts on bus must be serialized
  • Distributed Control
  • Registers and RS and LSQ entries self-identify whether or not they are the target of the

broadcast by comparing their tag with broadcast TAG

  • Associative Search hardware
  • Centralized Controller
  • Maintains bit vector of destinations for each possible outstanding tag value
  • Routes data broadcast on CDB to units waiting on TAG

16

slide-18
SLIDE 18

Summary of Pipelined Architectures

  • Pipeline

Issue Execute Write

  • Simple 5-stage

In-order In-order In-order

  • Multi-cycle 5-stage

In-order In-order Out-of-order

  • Scoreboard

In-order Out-of-order Out-of-order

  • Tomasulo

In-order Out-of-order Out-of-order

  • Maintaining precise interrupts :
  • Complicated when instructions can complete (write) out of order.
  • Earlier instruction may raise interrupt long after later instructions have completed.
  • Solutions:
  • 1. Allow imprecise interrupts

Not permitted by IEEE Standard

  • 2. Stall pipeline as needed to ensure in-order writes

Reduces potential parallelism

  • 3. Trap handlers restore machine to consistent state using saved information

Ad-hoc (machine dependent solution)

  • 4. Reorder Buffer:
  • Buffer the results of completing instructions reorder them and writethem in order
  • Idea of reorder buffer can be used to implement aggressive branch speculation

17