[PPT] - MIPS Pipeline with Tomasulos Algorithm ADD ADD RS IR Issue WB PowerPoint Presentation

SLIDE 1

MIPS Pipeline with Tomasulo’s Algorithm Issue ADD MULT DIV

LSQ

IR WB ADD RS

Dispatch

REG FILE Common Data Bus (CDB) MEM

SLIDE 2

Example

LOOP: A LD F0, 0(R1) | temp = x[i] B MUL F4, F0, F2 | temp = temp * a C SD F4, 0(R1) | x[i] = temp D ADDI R1, R1, #8 | i++ E BNE R1, R2, LOOP | branch if R1.ne.R2

Within an iteration RAW between (A, B) and (C, D) [focusing on FP registers only]
Across iterations WAR and WAW dependencies become apparent

A1 LD F0, 0(R1) | temp = x[i] B1 MUL F4, F0, F2 | temp = temp * a C1 SD F4, 0(R1) | x[i] = temp D1 ADDI R1, R1, #8 | i++ E1 BNE R1, R2, LOOP | branch if R1.ne.R2 A2 LD F0, 0(R1) | temp = x[i] B2 MUL F4, F0, F2 | temp = temp * a C2 SD F4, 0(R1) | x[i] = temp D2 ADDI R1, R1, #8 | i++ E2 BNE R1, R2, LOOP | branch if R1.ne.R2

2

SLIDE 3

Example

LOOP: A LD F0, 0(R1) | temp = x[i] B MUL F4, F0, F2 | temp = temp * a C SD F4, 0(R1) | x[i] = temp D ADDI R1, R1, #8 | i++ E BNE R1, R2, LOOP | branch if R1.ne.R2

Within an iteration RAW between (A, B) and (C, D) [focusing on FP registers only]
Across iterations WAR and WAW dependencies become apparent

3

A1 B1 C1 A2 B2 C2 A3 B3 C3 A1 B1 C1 A2 B2 C2 A3 B3 C3

Renaming removes WAR and WAW dependencies

SLIDE 4

In-Order Scheduling of MEM Operations

5

W

21 M

20 D D D D D D I IF W E I IF

E2

W E I IF

D2

M D D D D D D D D I IF

C2 W * * * * D D

D D I IF

B2

W M D D D I IF

A2

W E I IF

E1

W E I IF

D1

M D D D D D D I IF

C1

W * * * * D D I IF

B1

W M D I IF

A1 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1

7

SLIDE 5

Load Store Queue (LSQ)

C1

LOAD/STORE Q

CDB

MEM

A1

DISPATCH ISSUE WRITE

Since MEM access is in general a multi-cycle operation: buffer several LOAD/STORE instructions.
Snapshot at Cycle 4

8

SLIDE 6

Load Store Queue (LSQ)

C1 A2 C2

LOAD/STORE Q

CDB

MEM DISPATCH ISSUE WRITE

Since MEM access is in general a multi-cycle operation: buffer several LOAD/STORE instructions.
Snapshot at end of Cycle 10
Can I promote A2 ahead of C1 and let it use MEM during cycle 10?

8

SLIDE 7

Load Store Queue (LSQ)

C1 C2

LOAD/STORE Q

CDB

MEM

A2

DISPATCH ISSUE WRITE

Since MEM access is in general a multi-cycle operation: buffer several LOAD/STORE instructions.
Snapshot at end of Cycle 10
Can I promote A2 ahead of C1 and let it use MEM during cycle 10?

8

SLIDE 8

Schedule

21 20

W M D I IF W E I IF

BEQ

W E I IF

ADD

M D D D D D D I IF

SD W * * * *

D D I IF

MUL

W M D I IF

LD

W E I IF

BEQ

W E I IF

ADD

M D D D D D D I IF

SD

W * * * * D D I IF

MUL

W M D I IF

LD 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 T = 6(n-1) + 11 = 6n + 5 cycles for n iterations

4

SLIDE 9

LSQ Features

Assume memory address of the Load/Store (ea) is calculated during Issue and stored as

part of the Load or Store buffer in the LSQ

The register source of a Store either copies the current register value or waits for an

earlier instruction to broadcast its result on the CDB (same as other instructions).

ILP affected by size of the LSQ
FIFO buffer provides simple implementation for FCFS scheduling
Greater concurrency by out-of-order scheduling
In example:
C1 (Store of iteration 1) does memory write at cycle 11
A2 (Load of iteration 2) does memory read at cycle 10
C1 was stalled on an RAW waiting for data
We promoted the later load (L2) ahead of the earlier store (C1)
Is this OK?
Under what conditions?

9

SLIDE 10

Example

LOOP: A LD F0, 0(R1) | temp = x[i] B MUL F4, F0, F2 | temp = temp * a C SD F4, 0(R1) | x[i] = temp D ADDI R1, R1, #8 | i++ E BNE R1, R2, LOOP | branch if R1.ne.R2 How about dependencies between memory operations?

5

A1 B1 C1 A2 B2 C2 A3 B3 C3

W A R

WAW ? RAW?

Need to ensure these dependencies a) do not exist or b) do not result in memory data hazards.

SLIDE 11

Load Store Queue (LSQ)

B C

LOAD/STORE Q

CDB

MEM DISPATCH ISSUE WRITE

Since MEM access is in general a multi-cycle operation: buffer several LOAD/STORE instructions.
Question:
Is the order of serving memory requests important?

8

SLIDE 12

Load Store Queue (LSQ)

B C

LOAD/STORE Q

CDB

MEM DISPATCH ISSUE WRITE

Since MEM access is in general a multi-cycle operation: buffer several LOAD/STORE instructions
An instruction that is ready to execute can be dispatched to MEM while others wait for operand
Example 2: Can we promote C (Store) over an earlier waiting store (B)?

A: MULD F0, F2, F4 B: SD F0, 0(R2) C: SD F2, 100(R4)

11

SLIDE 13

Load Store Queue (LSQ)

B C D

LOAD/STORE Q

CDB

MEM DISPATCH ISSUE WRITE

Example: A: MULD F0, F2, F4

B: SD F0, 0(R2) C: SD F2, 100(R4) D: LD F6, 200(R6)

Can one promote the LD (instruction) D in front of B and C?
Memory Disambiguation: Need to avoid RAW, WAR, WAW hazards through memory

12

SLIDE 14

Memory Disambiguation

A: MULD F0, F2, F4 B: SD F0, 0(R2) C: SD F2, 100(R4) D: LD F6, 200(R6) E: SD F8, 100(R2)

Instructions may have data dependencies through memory if they read or write the same memory location
Since the memory address is only known after the LOAD/STORE is issued run-time checks required
Memory disambiguation logic in the LSQ is required to see if memory instructions can be executed out of order
RAW
Newly issued LOAD must be compared for possible RAW hazards against all currently

pending STORE instructions

When D is issued:
Compare effective address of D (LD) against the addresses of pending stores: B and C
If no addresses match: safe to promote D and load from memory before the two stores
Load Forwarding: Optimization possible for LOAD
If there is a pending STORE with address matching that of the the issuing LOAD
LOAD can obtain its value directly from the Store Buffer without accessing memory
Need to choose the closest STORE if several matching pending STOREs in the queue

13

SLIDE 15

Memory Disambiguation

A: MULD F0, F2, F4 B: SD F0, 0(R2) C: SD F2, 100(R4) D: LD F6, 200(R6) E: SD F8, 100(R2)

Instructions may have data dependencies through memory if they read or write the same memory location
Since the memory address is only known after the LOAD/STORE is issued run-time checks required
WAR
STORE must not overtake a LOAD with matching address
WAW
If an issuing STORE has the same address as a pending STORE
If LOAD FORWARDING is being employed:
All LOADS from that memory address have already received their value
Only purpose of the pending STORE is to update the memory location
Cancel (remove from queue) pending STORE

14

SLIDE 16

Load Store Queue (LSQ)

STORE Q LOAD Q

CDB

MEM MEM MEM DISPATCH ISSUE WRITE

Useful to give LOADs priority over STOREs
Use LOAD FORWARDING to bypass memory
WAR to memory (earlier LD completing after later SD) will not occur
Use WRITE MERGING to collapse several writes to the same location

15

SLIDE 17

Implementation Issues with Tomasulo’s Algorithm

Instructions tagged with id of Reservation Station they are issued to
Unique RS for each issued instruction
TAG serves to identify the instruction uniquely
Tag the destination register with TAG
Tag all source fields of later instructions that match this destination with TAG
TAG carried with the instruction to the WB stage so that the completing instruction is identified

in the CDB broadcast

Results of completing instructions are broadcast on a bus known as Common Data Bus (CDB)
REGISTER FILE, RESERVATION STATIONS, LOAD/STORE BUFFERS monitor CDB
TAG of completing instruction broadcast along with value
Broadcasts on bus must be serialized
Distributed Control
Registers and RS and LSQ entries self-identify whether or not they are the target of the

broadcast by comparing their tag with broadcast TAG

Associative Search hardware
Centralized Controller
Maintains bit vector of destinations for each possible outstanding tag value
Routes data broadcast on CDB to units waiting on TAG

16

SLIDE 18

Summary of Pipelined Architectures

Pipeline

Issue Execute Write

Simple 5-stage

In-order In-order In-order

Multi-cycle 5-stage

In-order In-order Out-of-order

Scoreboard

In-order Out-of-order Out-of-order

Tomasulo

In-order Out-of-order Out-of-order

Maintaining precise interrupts :
Complicated when instructions can complete (write) out of order.
Earlier instruction may raise interrupt long after later instructions have completed.
Solutions:
1. Allow imprecise interrupts

Not permitted by IEEE Standard

2. Stall pipeline as needed to ensure in-order writes

Reduces potential parallelism

3. Trap handlers restore machine to consistent state using saved information

Ad-hoc (machine dependent solution)

4. Reorder Buffer:
Buffer the results of completing instructions reorder them and writethem in order
Idea of reorder buffer can be used to implement aggressive branch speculation

17