[PPT] - CSCI-2500: Computer Organization Processor Design Datapath n The PowerPoint Presentation

SLIDE 1

CSCI-2500: Computer Organization

Processor Design

SLIDE 2

CSCI-2500 SPRING 2016, Processor Design, Chapter 4

Datapath

n The datapath is the interconnection of

the components that make up the processor.

n The datapath must provide connections

for moving bits between memory, registers and the ALU.

SLIDE 3

CSCI-2500 SPRING 2016, Processor Design, Chapter 4

Control

n The control is a collection of signals

that enable/disable the inputs/outputs

f the various components.

n You can think of the control as the

brain, and the datapath as the body.

n the datapath does only what the brain tells

it to do.

SLIDE 4

CSCI-2500 SPRING 2016, Processor Design, Chapter 4

Processor Design

The sequencing and execution of instructions

n We already know about many of the

individual components that are necessary:

n ALU, Multiplexors, Decoders, Flip-Flops

n We need to discuss how to use a clock n We need to think about registers and

memory.

SLIDE 5

CSCI-2500 SPRING 2016, Processor Design, Chapter 4

The Clock

The clock generates a never-ending sequence of alternating 1s and 0s. All operations are synchronized to the clock.

SLIDE 6

CSCI-2500 SPRING 2016, Processor Design, Chapter 4

Clocking Methodology

n Determines when (relative to the clock)

a signal can be read and written.

n Read: signal value is used by some

component.

n Written: a signal value is generated by

some component.

SLIDE 7

CSCI-2500 SPRING 2016, Processor Design, Chapter 4

Simple Example: Enabled AND

n We want an AND gate that holds it’s

utput value constant until the clock

switches from 0 (lo) to 1 (hi).

n We can use a flip-flop to hold the inputs

to the AND gate constant during the time we want the output constant.

n We use a clocked flip-flop to make

things happen when the clock changes.

SLIDE 8

CSCI-2500 SPRING 2016, Processor Design, Chapter 4

D Flip-Flop Reminder

D Q Q Clock

The output (Q) changes to reflect D only when the Clock is a 1.

SLIDE 9

CSCI-2500 SPRING 2016, Processor Design, Chapter 4

D Flip-Flop Timing

Q D C

1 1 1

SLIDE 10

CSCI-2500 SPRING 2016, Processor Design, Chapter 4

Clocked AND gate

D flip-flop D flip-flop

D C Q D C Q

A B

A•B (clocked)

Clock

SLIDE 11

CSCI-2500 SPRING 2016, Processor Design, Chapter 4

Edge-triggered Clocking

n Values stored are updated (can change)

nly on a clock edge.

n When the clock switches from 0 to 1

everybody allows signals in.

n everybody means state elements n combinational elements always do the same

thing, they don’t care about the clock (that’s why we added the flip-flops to our AND gate).

SLIDE 12

CSCI-2500 SPRING 2016, Processor Design, Chapter 4

State Elements

n Any component that stores one or more

values is a state element.

n The entire processor can be viewed as a

circuit that moves from one state (collection of all the state elements) to another state.

n At time i a component uses values

generated at time i-1.

SLIDE 13

CSCI-2500 SPRING 2016, Processor Design, Chapter 4

Register File

Read register number 1 Read data 1 Read data 2 Read register number 2

Register file

Write register Write data Write

Contains multiple registers

each holds 32 bits
Two output ports (read ports)
One input port (write port)
To change the value of a register:
supply register number
supply data
clock (the Write control signal)

SLIDE 14

CSCI-2500 SPRING 2016, Processor Design, Chapter 4

Implementation of Read Ports

Figure B.19

M u x Register 0 Register 1 Register n – 1 Register n M u x Read data 1 Read data 2 Read register number 1 Read register number 2

SLIDE 15

CSCI-2500 SPRING 2016, Processor Design, Chapter 4

Implementation of Write

n - t o - 1 d e c o d e r R e g i s t e r R e g i s t e r 1 R e g i s t e r n – 1 C C D D R e g i s t e r n C C D D R e g i s t e r n u m b e r W r i t e R e g i s t e r d a t a 1 n – 1 n

SLIDE 16

CSCI-2500 SPRING 2016, Processor Design, Chapter 4

Memory

n Memory is similar to a very large register file:

n single read port (output) n chip select input signal n output enable input signal n write enable input signal n address lines (determine which memory element) n data input lines (used to write a memory element)

SLIDE 17

CSCI-2500 SPRING 2016, Processor Design, Chapter 4

4 x 2 Memory (SRAM)

D latch Q D C Enable D latch Q D C Enable D latch Q D C Enable D latch Q D C Enable D latch Q D C Enable D latch Q D C Enable D latch Q D C Enable D latch Q D C Enable 2-to-4 decoder Write enable Address Din[0] Din[1] Dout[1] Dout[0] 1 2 3

SLIDE 18

CSCI-2500 SPRING 2016, Processor Design, Chapter 4

Memory Usage

n For now, we treat memory as a single

component that supports 2 operations:

n write (we change the value stored in a

memory location)

n read (we get the value currently stored in a

memory location).

n We can only do one operation at a time!

SLIDE 19

CSCI-2500 SPRING 2016, Processor Design, Chapter 4

Instruction & Data Memory

n It is useful to treat the memory that

holds instructions as a separate component.

n instruction memory is read-only

n Typically there is really one memory

that holds both instructions and data.

n as we will see when we talk more about

memory, the processor often has two interfaces to the memory, one for instructions and one for data!

SLIDE 20

CSCI-2500 SPRING 2016, Processor Design, Chapter 4

Designing a Datapath for MIPS

n We start by looking at the datapaths

needed to support a simple subset of MIPS instructions:

n a few arithmetic and logical instructions n load and store word

n beq and j instructions

SLIDE 21

CSCI-2500 SPRING 2016, Processor Design, Chapter 4

Functions for MIPS Instructions

n We can generalize the functions we

need to:

n using the PC register as the address, read a

value from the memory (read the instruction)

n Read one or two register values (depends on

the specific instruction).

n ALU Operation , Memory read or write, … n Possibly change the value of a register.

SLIDE 22

CSCI-2500 SPRING 2016, Processor Design, Chapter 4

Fetching the next instruction

n PC Register holds the address n Memory holds the instruction

n we need to read from memory.

n Need to update the PC

n add 4 to current value

SLIDE 23

CSCI-2500 SPRING 2016, Processor Design, Chapter 4

Instruction Fetch DataPath

PC Instruction memory Read address Instruction 4 Add

SLIDE 24

CSCI-2500 SPRING 2016, Processor Design, Chapter 4

Supporting R-format instructions

Includes add, sub, slt, and & or instructions. Generalization:

n read 2 registers and send to ALU. n perform ALU operation n store result in a register

rs

p

rt rd shamt funct

6 bits 5 bits 6 bits 5 bits 5 bits 5 bits

SLIDE 25

CSCI-2500 SPRING 2016, Processor Design, Chapter 4

MIPS Registers

n MIPS has 32 general purpose registers. n Register File holds all 32 registers

n need 5 bits to select a register n rs, rt & rd fields in R-format instructions.

n MIPS Register File has 2 read ports.

n can get at both source registers at the

same time.

SLIDE 26

CSCI-2500 SPRING 2016, Processor Design, Chapter 4

Datapath for R-format Instructions

Instruction Registers Write register Read data 1 Read data 2 Read register 1 Read register 2 Write data ALU result ALU Zero RegWrite ALU operation 3

SLIDE 27

CSCI-2500 SPRING 2016, Processor Design, Chapter 4

Load and Store Instructions

Need to compute the address

n offset (part of the

instruction)

n base (stored in a register).

For Load:

n read from memory n store in a register

For Store:

n read from register n write to memory

SLIDE 28

CSCI-2500 SPRING 2016, Processor Design, Chapter 4

Computing the address

n 16 bit signed offset is part of the

instruction.

n We have a 32 bit ALU.

n need to sign extend the offset (to 32 bits).

n Feed the 32 bit offset and the contents

f a register to the ALU

n Tell the ALU to “add”.

SLIDE 29

CSCI-2500 SPRING 2016, Processor Design, Chapter 4

Load/Store Datapath

Instruction 16 32 Registers Write register Read data 1 Read data 2 Read register 1 Read register 2 Data memory Write data Read data Write data Sign e xtend ALU result Zero ALU Address MemRead MemWrite RegWrite ALU operation 3

SLIDE 30

CSCI-2500 SPRING 2016, Processor Design, Chapter 4

Supporting beq

n 2 registers compared for equality n 16 bit offset used to compute target

address.

n signed offset is relative to the PC n offset is in words not in bytes!

n Might branch, might not (need to

decide).

SLIDE 31

CSCI-2500 SPRING 2016, Processor Design, Chapter 4

Computing target address

n Recall that the offset is actually

relative to the address of the next instruction.

n we always add 4 to the PC, we must make

sure we use this value as the base.

n Word vs. Byte offset

n we just need to shift the 16 bit offset 2

bits to the right (fill with 2 zeros).

SLIDE 32

CSCI-2500 SPRING 2016, Processor Design, Chapter 4

Branch Datapath

1 6 3 2 S ig n ex te nd Z er o A L U S u m S hift le ft 2 T o b ra n c h c o ntro l lo gic B ra nc h ta rg e t P C + 4 fro m in struc ti o n da ta p a t h I n str u ctio n A d d R e giste rs W rite re gi ster Re a d d a ta 1 Re a d d a ta 2 R e a d re gi ster 1 R e a d re gi ster 2 W rite d a t a R e g W rit e A L U op er a tio n 3

SLIDE 33

Control & DataPath

Ref: Chapter 4

SLIDE 34

CSCI-2500 SPRING 2016, Processor Design, Chapter 4

Datapath

n The datapath is the interconnection of

the components that make up the processor.

n The datapath must provide connections

for moving bits between memory, registers and the ALU.

SLIDE 35

CSCI-2500 SPRING 2016, Processor Design, Chapter 4

Control

n The control is a collection of signals

that enable/disable the inputs/outputs

f the various components.

n You can think of the control as the

brain, and the datapath as the body.

n the datapath does only what the brain tells

it to do.

SLIDE 36

CSCI-2500 SPRING 2016, Processor Design, Chapter 4

Datapaths

We looked at individual datapaths that support:

1.

Fetching Instructions

2.

Arithmetic/Logical Instructions

3.

Load & Store Instructions

4.

Conditional branch

We need to combine these in to a single datapath.

SLIDE 37

CSCI-2500 SPRING 2016, Processor Design, Chapter 4

Issues

n When designing one datapath that can

be used for any operation:

n the goal is to be able to handle one

instruction per cycle.

n must make sure no datapath resource needs

to be used more than once at the same time.

n if so – we need to provide more than one!

SLIDE 38

CSCI-2500 SPRING 2016, Processor Design, Chapter 4

Sharing Resources

n We can share datapath resources by

adding a multiplexor (and a control line).

n for example, the second input to the ALU

could come from either:

n a register (as in an arithmetic instruction) n from the instruction (as in a load/store – when

computing the memory address).

SLIDE 39

CSCI-2500 SPRING 2016, Processor Design, Chapter 4

Sharing with a Multiplexor Example

Control ADD Operand 1 B Operand 2 C A A+B (Control==0) A+C (Control==1)

SLIDE 40

CSCI-2500 SPRING 2016, Processor Design, Chapter 4

Combining Datapaths for memory instructions and arithmetic instructions

n Need to share the ALU

n For memory instructions used to compute

the address in memory.

n For Arithmetic/Logical instructions used to

perform arithmetic/logical operation.

SLIDE 41

CSCI-2500 SPRING 2016, Processor Design, Chapter 4

I n s t r u ct i o n 1 6 3 2 R e g i s t e r s W r it e r e g i s t e r R e a d d a t a 1 R e a d d a t a 2 R e a d r e g i s t e r 1 R e a d r e g i s t e r 2 D a t a m e m o r y W r it e d a t a R e a d d a t a M u x M u x W r it e d a t a S i g n e x t e n d A L U r e s u lt Z e r o A L U A d d r e s s

R e g W r it e A L U o p e r a t i o n

3

M e m R e a d M e m W rit e A L U S r c M e m t o R e g

Sharing Multiplexors New Controls

SLIDE 42

CSCI-2500 SPRING 2016, Processor Design, Chapter 4

Adding the Instruction Fetch

n One memory for instructions, separate

memory for data.

n otherwise we might need to use the memory

twice in the same instruction.

n Dedicated Adder for updating the PC

n otherwise we might need to use the ALU

twice in the same instruction.

SLIDE 43

CSCI-2500 SPRING 2016, Processor Design, Chapter 4

P C I n s t r u c t i o n m e m o r y R e a d a d d r e s s I n s t r u c ti o n 1 6 3 2 R e g i s t e r s W r it e r e g i s t e r W r it e d a t a R e a d d a t a 1 R e a d d a t a 2 R e a d r e g i s t e r 1 R e a d r e g i s t e r 2 S i g n e x t e n d A L U r e s u l t Z e r o D a t a m e m o r y A d d r e s s W ri t e d a t a R e a d d a t a M u x 4 A d d M u x A L U R e g W rit e A L U o p e r a ti o n 3 M e m R e a d M e m W ri t e A L U S r c M e m t o R e g

Two Memory Units Dedicated Adder

SLIDE 44

CSCI-2500 SPRING 2016, Processor Design, Chapter 4

Need to add datapath for beq

n Register comparison (requires ALU). n Another adder to compute target

address.

n One input to adder is sign extended offset,

shifted by 2 bits.

n Other input to adder is PC+4

SLIDE 45

CSCI-2500 SPRING 2016, Processor Design, Chapter 4

PC Instruction memory Read address Instruction 16 32 Add ALU result M u x Registers Write register Write data Read data 1 Read data 2 Read register 1 Read register 2 Shift left 2 4 M u x ALU operation 3 RegWrite MemRead MemWrite PCSrc ALUSrc MemtoReg ALU result Zero ALU Data memory Address Write data Read data M u x Sign extend Add

New adder and mux

SLIDE 46

CSCI-2500 SPRING 2016, Processor Design, Chapter 4

Whew!

n Keep in mind that the datapath we

now have supports just a few MIPS instructions!

n Things get worse (more complex) as

we support other instructions:

j jal jr addi

n We won’t worry about them now…

SLIDE 47

CSCI-2500 SPRING 2016, Processor Design, Chapter 4

Control Unit

n We need something that can generate

the controls in the datapath.

n Depending on what kind of instruction

we are executing, different controls should be turned on (asserted) and off (deasserted).

n We need to treat each control

individually (as a separate boolean function).

SLIDE 48

CSCI-2500 SPRING 2016, Processor Design, Chapter 4

Controls

n Our datapath includes a bunch of

controls:

n ALU operation (3 bits) n RegWrite n ALUSrc n MemWrite n MemtoReg n MemRead n PCSrc

SLIDE 49

CSCI-2500 SPRING 2016, Processor Design, Chapter 4

ALU Operation Control

n A 3 bit control (assumes the ALU

designed in chapter 4):

ALU Control Input Operation

000 AND 001 OR 010 add 110 subtract 111 slt

SLIDE 50

CSCI-2500 SPRING 2016, Processor Design, Chapter 4

ALU Functions for other instructions

lw , sw (load/store): addition beq: subtraction add, sub, and, or, slt (arithmetic/logical): All R-format instructions

SLIDE 51

CSCI-2500 SPRING 2016, Processor Design, Chapter 4

R-Format Instructions

rs

p

rt rd shamt funct

6 bits 5 bits 6 bits 5 bits 5 bits 5 bits

Operation is specified by some bits in the funct field in the instruction.

SLIDE 52

CSCI-2500 SPRING 2016, Processor Design, Chapter 4

MIPS Instruction OPCODEs

n The MS 6 bits are an OPCODE that

identifies the instruction.

n R-Format: always 000000

n (funct identifies the operation)

lw sw beq 100011 101011 000100

varies depending on instruction

p

6 bits

SLIDE 53

CSCI-2500 SPRING 2016, Processor Design, Chapter 4

Generating ALU Controls

We can view the 3 bit ALU control as 3 boolean functions. Inputs are:

n the op field (OPCODE)

n funct field (for R-format instructions

nly)

SLIDE 54

CSCI-2500 SPRING 2016, Processor Design, Chapter 4

Simplifying The Opcode

For building the ALU Operation Controls, we are interested in only 4 different

pcodes.

We can simplify things by first reducing the 6 bit op field to a 2 bit value we will call ALUOp

SLIDE 55

CSCI-2500 SPRING 2016, Processor Design, Chapter 4

Instruction ALUOp funct ALU action ALU controls lw 00 ?????? add 010 sw 00 ?????? add 010 beq 01 ?????? subtract 110 add 10 100000 add 010 sub 10 100010 subtract 110 and 10 100100 and 000

r

10 100101

r

001 slt 10 101010 slt 111

SLIDE 56

CSCI-2500 SPRING 2016, Processor Design, Chapter 4

Build a Truth Table

n We can now build a truth table for the 3

bit ALU control.

n Inputs are:

n 2 bit ALUOp n 6 bit funct field

n Abbreviated Truth Table: only show the

rows we care about!

SLIDE 57

CSCI-2500 SPRING 2016, Processor Design, Chapter 4

ALUOp funct ALU Control

x x x x x x 010 x 1 x x x x x x 110 1 x x x 010 1 x x x 1 110 1 x x x 1 000 1 x x x 1 1 001 1 x x x 1 1 111

x means “don’t care”

SLIDE 58

CSCI-2500 SPRING 2016, Processor Design, Chapter 4

Adding the ALU Control

n We can now add the ALU control to the

datapath:

n inputs to this control come from the

instruction and from ALUOp

n If we try to show all the details the

picture becomes too complex:

n just plop in an “ALU Control” box.

SLIDE 59

CSCI-2500 SPRING 2016, Processor Design, Chapter 4

M e mto R e g M e m R ea d M e m Write A L U O p A L U Src R e g D st P C I nstru ctio n m e mo ry R e a d a d d re s s In str u ctio n [31 – 0] In struc tio n [2 0 – 1 6] In struc tio n [2 5 – 2 1] A d d I nstru ctio n [5 – 0 ] R e g Write 4 16 3 2 In stru ctio n [15 – 0 ] Re gisters Write re g iste r Write d a ta Write da ta R e ad d a ta 1 R e ad d a ta 2 R e a d re g iste r 1 R e a d re g iste r 2 S ig n e x te nd A L U re sult Ze ro D a ta me m ory A d d re s s R e a d d ata M u x 1 M u x 1 M u x 1 M u x 1 In struc tio n [1 5 – 1 1] A L U co ntrol S hift le ft 2 P C S rc A L U A dd A L U re s ult

Shows which bits from the instruction are fed to register file inputs ALU Control

SLIDE 60

CSCI-2500 SPRING 2016, Processor Design, Chapter 4

Implementing Other Controls

n The other controls in out datapath must

also be specified as functions.

n We need to determine the inputs to all

the functions.

n primarily the inputs are part of the

instructions, but there are exceptions.

n Need to define precisely what

conditions should turn on each control.

SLIDE 61

CSCI-2500 SPRING 2016, Processor Design, Chapter 4

RegDst Control Line

n Controls a multiplexor that selects on

f the fields rt or rd from an R-

format or I-format instruction.

n I-Format is used for load and store. n sw needs to write to the register rt.

rs

p

rt rd shamt funct rs

p

rt address R-format I-format

SLIDE 62

CSCI-2500 SPRING 2016, Processor Design, Chapter 4

RegDst usage

n RegDst should be

n 0 to send rt to the write register # input. n 1 to send rd to the write register # input.

n RegDst is a function of the opcode

field:

n If instruction is sw, RegDst should be 0 n For all other instructions RegDst should be

1

SLIDE 63

CSCI-2500 SPRING 2016, Processor Design, Chapter 4

RegWrite Control

n a 1 tells the register file to write a

register.

n whatever register is specified by the write

register # input is written with the data on the write register data inputs.

n Should be a 1 for arithmetic/logical

instructions and for a store.

n Should be a 0 for load or beq.

SLIDE 64

CSCI-2500 SPRING 2016, Processor Design, Chapter 4

ALUSrc Control

n MUX that selects the source for the

second ALU operand.

n 1 means select the second register file

utput (read data 2).

n 0 means select the sign-extended 16 bit

ffset (part of the instruction).

n Should be a 1 for load and store. n Should be a 0 for everything else.

SLIDE 65

CSCI-2500 SPRING 2016, Processor Design, Chapter 4

MemRead Control

A 1 tells the memory to put the contents of the

memory location (specified by the address lines) on the Read data output.

Should be a 1 for load.
Should be a 0 for everything else.

SLIDE 66

CSCI-2500 SPRING 2016, Processor Design, Chapter 4

MemWrite Control

1 means that memory location (specified by

memory address lines) should get the value specified on the memory Write Data input.

Should be a 1 for store.
Should be a 0 for everything else.

SLIDE 67

CSCI-2500 SPRING 2016, Processor Design, Chapter 4

MemToReg Control

MUX that selects the value to be stored in a

register (that goes to register write data input).

– 1 means select the value coming from the memory data output. – 0 means select value coming from the ALU output.

Should be a 1 for load and any

arithmetic/logical instructions.

Should be a 0 for everything else (sw, beq).

SLIDE 68

CSCI-2500 SPRING 2016, Processor Design, Chapter 4

PCSrc Control

MUX that selects the source for the value

written in to the PC register.

– 1 means select the output of the Adder used to compute the relative address for a branch. – 0 means select the output of the PC+4 adder.

Should be a 1 for beq if registers are equal!
Should be a 0 for other instructions or if

registers are different.

SLIDE 69

CSCI-2500 SPRING 2016, Processor Design, Chapter 4

PCSrc depends on result of ALU operation!

n This control line can’t be simply a

function of the instruction (all the

thers can).

n PCSrc should be a 1 only when:

n beq AND ALU zero output is a 1

n We will generate a signal called “branch”

that we can AND with the ALU zero

utput.

SLIDE 70

CSCI-2500 SPRING 2016, Processor Design, Chapter 4

Truth Table for Control

Instructi

n

RegDst ALUSrc Memto- Reg Reg Write Mem Read Mem Write Branch ALUOp

R- format

1 1 10 lw 1 1 1 1 00 sw x 1 x 1 00 beq x x 1 01

SLIDE 71

CSCI-2500 SPRING 2016, Processor Design, Chapter 4

P C I n str u ctio n m e m o ry R e a d a d dr e s s I n s tr u ctio n [ 3 1 – 0 ] I n stru ctio n [2 0 1 6] I n stru ctio n [2 5 2 1] A d d I ns tru ctio n [5 0 ] M e mto R e g A L U O p M e m Wr it e R e g Write M e m R e a d B r a nc h R e g D st A L U S rc I n stru ctio n [ 3 1 2 6] 4 1 6 3 2 I n stru ctio n [ 1 5 0 ] M u x 1 C o n tr o l A d d A L U re s u lt M u x 1 R e gister s W r ite r e g i ster W r ite d a t a R e a d d a t a 1 R e a d d a t a 2 R e a d r e g i ster 1 R e a d r e g i ster 2 S ig n e x t e n d M u x 1 A L U r e s ult Z er o P C S r c D at a m e m or y W rite d a t a R e ad da ta M u x 1 I n stru ctio n [ 1 5 1 1] A L U co n tr ol S hift l e ft 2 A L U A d d r e ss

SLIDE 72

CSCI-2500 SPRING 2016, Processor Design, Chapter 4

Single Cycle Instructions

n View the entire datapath as a

combinational circuit.

n We can follow the flow of an instruction

through the datapath.

n single cycle instruction means that there

are not really any steps – everything just happens and becomes finalized when the clock cycle is over.

SLIDE 73

CSCI-2500 SPRING 2016, Processor Design, Chapter 4

add $t1,$t2,$t3

n Control Lines:

n ALU Controls specify an ALU add operation. n RegWrite will be a 1 so that when the clock

cycle ends the value on the Register Write Input lines will be written to a register.

n all other control lines are 0.

SLIDE 74

CSCI-2500 SPRING 2016, Processor Design, Chapter 4

lw $t1,offset($t2)

n Control Lines:

n ALU Control set for an add operation. n ALUSrc is set to 1 to indicate the second

perand is sign extended offset.

n MemRead would be a 1. n RegDst would select the correct bits from

the instruction to specify the dest. register.

n RegWrite would be a 1.

SLIDE 75

CSCI-2500 SPRING 2016, Processor Design, Chapter 4

Disadvantage of single cycle operation

If we have instructions execute in a single cycle, then the cycle time must be long enough for the slowest instruction.

n all instructions take the same time as the

slowest.

SLIDE 76

CSCI-2500 SPRING 2016, Processor Design, Chapter 4

Multicycle Implementation

n Chop up the processing of instructions in

to discrete stages.

n Each stage takes one clock cycle.

n we can implement each stage as a big

combinational circuit (like we just did for the whole thing).

n provide some way to sequence through the

stages.

SLIDE 77

CSCI-2500 SPRING 2016, Processor Design, Chapter 4

Advantages of Multicycle

n Only need those stages required by an

instruction.

n the control unit is more complex, but

instructions only take as long as necessary.

n We can share components

n perhaps 2 different stages can use the

same ALU.

n We don’t need to duplicate resources.

SLIDE 78

CSCI-2500 SPRING 2016, Processor Design, Chapter 4

Additional Resources for Multicycle

n To implement a multicycle

implementation we need some additional registers that can be used to hold intermediate values.

n instruction n computed address n result of ALU operation n …

SLIDE 79

CSCI-2500 SPRING 2016, Processor Design, Chapter 4

PC Memory Address Instruction

r data

Data Instruction register Registers Register # Data Register # Register # ALU Memory data register A B ALUOut

Multicycle Datapath

SLIDE 80

CSCI-2500 SPRING 2016, Processor Design, Chapter 4

Multicycle Datapath for MIPS

SLIDE 81

CSCI-2500 SPRING 2016, Processor Design, Chapter 4

MC DP with Control

SLIDE 82

CSCI-2500 SPRING 2016, Processor Design, Chapter 4

Instruction Stages

n Instruction Fetch n Instruction decode/register fetch n ALU operation/address computation n Memory Access n Register Write

SLIDE 83

CSCI-2500 SPRING 2016, Processor Design, Chapter 4

Complete Multicyle Datapath & Control

SLIDE 84

CSCI-2500 SPRING 2016, Processor Design, Chapter 4

Instruction Fetch/Decode (IF/ID) State Machine

SLIDE 85

CSCI-2500 SPRING 2016, Processor Design, Chapter 4

Memory Reference State Machine

SLIDE 86

CSCI-2500 SPRING 2016, Processor Design, Chapter 4

R-type Instruction State Machine

SLIDE 87

CSCI-2500 SPRING 2016, Processor Design, Chapter 4

Branch/Jump State Machine

SLIDE 88

CSCI-2500 SPRING 2016, Processor Design, Chapter 4

Put it all together!

SLIDE 89

CSCI-2500 SPRING 2016, Processor Design, Chapter 4

Control for Multicycle

n Need to define the controls n Need to come up with some way to

sequence the controls

n Two techniques

n finite state machine n microprogramming

SLIDE 90

CSCI-2500 SPRING 2016, Processor Design, Chapter 4

Finite State Machine

SLIDE 91

CSCI-2500 SPRING 2016, Processor Design, Chapter 4

MicroProgramming (sec. 5.7)

n The idea is to build a (very small)

processor to generate the controls signals at the right time.

n At each stage (cycle) one

microinstruction is executed – the result changes the value of the control signals.

n Somebody writes the microinstructions

that make up each MIPS instruction.

SLIDE 92

CSCI-2500 SPRING 2016, Processor Design, Chapter 4

Example microinstructions

Fetch next instruction:

n turn on instruction memory read n feed PC to memory address input n write memory data output in to a holding

register.

Compute Address:

n route contents of base register to ALU n route sign-extended offset to ALU n perform ALU add n write ALU output in to a holding register.

Control Signals microinstruction

SLIDE 93

CSCI-2500 SPRING 2016, Processor Design, Chapter 4

Sequencing

n In addition to setting some control

signals, each microinstruction must specify the next microinstruction that should be executed.

n 3 Options:

n execute next microinstruction (default) n start next MIPS instruction (Fetch) n Dispatch (depends on control unit inputs).

SLIDE 94

CSCI-2500 SPRING 2016, Processor Design, Chapter 4

Microinstruction Format

n A bunch of bits – one for each control

line needed by the control unit.

n bits specify the values of the control lines

directly.

n Some bits that are used to determine

the next microinstruction executed.

SLIDE 95

CSCI-2500 SPRING 2016, Processor Design, Chapter 4

Dispatch Sequencing

n Can be implemented as a table lookup.

n bits in the microinstruction tell what row in

the table.

n inputs to the control unit tell what column. n value stored in table determines the

microaddress of the next microinstruction.

n This is a simplified description (called a

microdescription)

SLIDE 96

CSCI-2500 SPRING 2016, Processor Design, Chapter 4

Exceptions & Interrupts

n Hardest part of control is implementing

exceptions and interrupts – i.e., events that change the normal flow of instruction execution.

n MIPS convention

n Exception refers to any unexpected change in

control flow w/o knowing if the cause is internal or external.

n Interrupts refer to only events who are externally

caused.

n Ex. Interrupts: I/O device request (ignore for

now)

n Ex. Exceptions: undefined instruction, arithmetic

verflow

SLIDE 97

CSCI-2500 SPRING 2016, Processor Design, Chapter 4

Handling Exceptions

n

Let’s implemented exceptions for handling

n

Undefined instruction

n

Overflow

n

Basic actions

n

Save the offending instruction address in the Exception Program Counter (EPC).

n

Transfer control to the OS at some specified address

n

Once exception is handled by OS, then either terminate the program or continue on using the EPC to determine where to restart.

n

OS actions are determined based on what caused the exception.

n

So, OS needs a Cause register which determines which path w/i the exception

n

Alternative implementation – Vectored Interrupts – where each cause of an exception or interrupt is given a specific OS address to jump to.

n

We’ll use the first method.

SLIDE 98

CSCI-2500 SPRING 2016, Processor Design, Chapter 4

Extending the Multicycle D&C

n What datapath elements to add?

n EPC: a 32-bit register used to hold the address of

the affected instruction.

n Cause: A 32-bit register used to record the cause

f the exception. (undef instruction = 0 and
verflow = 1).

n What control lines to add?

n EPCWrite and Cause write control signals to allow

regs to be written.

n IntCause (1-bit) control signal to set the low-order

bit of the cause register to the appropriate value.

SLIDE 99

CSCI-2500 SPRING 2016, Processor Design, Chapter 4

Revised Datapath & Control

SLIDE 100

CSCI-2500 SPRING 2016, Processor Design, Chapter 4

Final FSM w/ exception handling

SLIDE 101

Pipelining

SLIDE 102

CSCI-2500 SPRING 2016, Processor Design, Chapter 4

Multicycle Instructions

n Chop each instruction in to stages. n Each stage takes one cycle. n We need to provide some way to

sequence through the stages:

n microinstructions

n Stages can share resources (ALU,

Memory).

SLIDE 103

CSCI-2500 SPRING 2016, Processor Design, Chapter 4

Pipelining

n We can overlap the execution of

multiple instructions.

n At any time, there are multiple

instructions being executed – each in a different stage.

n So much for sharing resources ?!?

SLIDE 104

CSCI-2500 SPRING 2016, Processor Design, Chapter 4

The Laundry Analogy

Non-pipelined approach:

1.

run 1 load of clothes through washer

2.

run load through dryer

3.

fold the clothes (optional step for students)

4.

put the clothes away (also optional).

Two loads? Start all over.

SLIDE 105

CSCI-2500 SPRING 2016, Processor Design, Chapter 4

Pipelined Laundry

n While the first load is drying, put the second

load in the washing machine.

n When the first load is being folded and the

second load is in the dryer, put the third load in the washing machine.

n Admittedly unrealistic scenario for CS

students, as most only own 1 load of clothes…

SLIDE 106

CSCI-2500 SPRING 2016, Processor Design, Chapter 4

Time 7 6 PM 8 9 10 11 12 1 2 AM A B C D Time 7 6 PM 8 9 10 11 12 1 2 AM A B C D Task

rder

Task

rder

SLIDE 107

CSCI-2500 SPRING 2016, Processor Design, Chapter 4

Laundry Performance

n For 4 loads:

n non-pipelined approach takes 16 units of

time.

n pipelined approach takes 7 units of time.

n For 816 loads:

n non-pipelined approach takes 3264 units of

time.

n pipelined approach takes 819 units of time.

SLIDE 108

CSCI-2500 SPRING 2016, Processor Design, Chapter 4

Execution Time vs. Throughput

n It still takes the same amount of time

to get your favorite pair of socks clean, pipelining won’t help.

n However, the total time spent away

from CompOrg homework is reduced. It's the classic “Socks vs. CompOrg” issue.

SLIDE 109

CSCI-2500 SPRING 2016, Processor Design, Chapter 4

Instruction Pipelining

First we need to break instruction execution into discrete stages:

1.

Instruction Fetch

2.

Instruction Decode/ Register Fetch

3.

ALU Operation

4.

Data Memory access

5.

Write result into register

SLIDE 110

CSCI-2500 SPRING 2016, Processor Design, Chapter 4

Operation Timings

n Some estimated timings for each of

the stages:

Instruction Fetch 200 ps Register Read 100 ps ALU Operation 200 ps Data Memory 200 ps Register Write 100 ps

SLIDE 111

CSCI-2500 SPRING 2016, Processor Design, Chapter 4

Comparison

I n s t r u c ti o n f e t c h R e g A L U D a t a a c c e s s R e g

800 p s

I n s t r u c t i o n f e t c h R e g A L U D a t a a c c e s s R e g

800 p s

I n s t r u c t i o n f e t c h

800 p s T i m e l w $ 1 , 1 0 0 ( $ 0 ) l w $ 2 , 2 0 0 ( $ 0 ) l w $ 3 , 3 0 0 ( $ 0 ) 2 4 6 8 1 0 1 2 1 4 1 6 1 8 2 4 6 8 1 0 1 2 1 4

. . .

P r o g r a m e x e c u t i o n

r d e r

(i n i n s t r u c ti o n s )

I n s t r u c t i o n f e t c h R e g A L U D a t a a c c e s s R e g

T i m e l w $ 1 , 1 0 0 ( $ 0 ) l w $ 2 , 2 0 0 ( $ 0 ) l w $ 3 , 3 0 0 ( $ 0 ) 200 p s

I n s t r u c ti o n f e t c h R e g A L U D a t a a c c e s s R e g

200 p s

I n s t r u c t i o n f e t c h R e g A L U D a t a a c c e s s R e g

200 p s 200 p s 200 p s 200 p s 200 p s P r o g r a m e x e c u t i o n

r d e r

( i n i n s t r u c t i o n s )

SLIDE 112

CSCI-2500 SPRING 2016, Processor Design, Chapter 4

RISC and Pipelining

n One of the major advantages of RISC

instruction sets is the relative simplicity of a pipeline implementation.

n It’s much more complex in a CISC processor!!

n RISC (MIPS) design features that make

pipelining easy include:

n single length instruction (always 1 word) n relatively few instruction formats n load/store instruction set n operands must be aligned in memory (a single

data transfer instruction requires a single memory operation).

SLIDE 113

CSCI-2500 SPRING 2016, Processor Design, Chapter 4

Pipeline Hazard

n Something happens that means the next

instruction cannot execute in the following clock cycle.

n Three kinds of hazards:

n structural hazard n control hazard n data hazard

SLIDE 114

CSCI-2500 SPRING 2016, Processor Design, Chapter 4

Structural Hazards

n Two stages require the same resource.

n What if we only had enough electricity to

run either the washer or the dryer at any given time?

n What if MIPS datapath had only one

memory unit instead of separate instruction and data memory?

SLIDE 115

CSCI-2500 SPRING 2016, Processor Design, Chapter 4

Avoiding Structural Hazards

n Design the pipeline carefully. n Might need to duplicate resources

n an Adder to update PC, and ALU to perform

ther operations.

n Detecting structural hazards at

execution time (and delaying execution) is not something we want to do (structural hazards are minimized in the design phase).

SLIDE 116

CSCI-2500 SPRING 2016, Processor Design, Chapter 4

Control Hazards

n When one instruction needs to make a

decision based on the results of another instruction that has not yet finished.

n Example: conditional branch

n The instruction that is fed to the pipeline

right after a beq depends on whether or not the branch is taken.

SLIDE 117

CSCI-2500 SPRING 2016, Processor Design, Chapter 4

beq Control Hazard

slt $t0,$s0,$s1 beq $t0,$zero,skip addi $s0,$s0,1 skip: lw $s3,0($t3)

slt beq ??? The instruction to follow the beq could be either the addi or the lw, it depends on the result of the beq instruction.

a = b+c; if (x!=0) y++; ...

SLIDE 118

CSCI-2500 SPRING 2016, Processor Design, Chapter 4

One possible solution - stall

n We can include in the control unit the

ability to stall (to keep new instructions from entering the pipeline until we know which one).

n Unfortunately conditional branches are

very common operations, and this would slow things down considerably.

SLIDE 119

CSCI-2500 SPRING 2016, Processor Design, Chapter 4

A Stall

I n s t r u c t i o n f e t c h R e g A L U D a t a a c c e s s R e g

T i m e b e q $ 1 , $ 2 , 4 0 a d d $ 4 , $ 5 , $ 6 l w $ 3 , 3 0 0 ( $ 0 ) 4 n s

I n s t r u c t i o n f e t c h R e g A L U D a t a a c c e s s R e g

2 n s

I n s t r u c t i o n f e t c h R e g A L U D a t a a c c e s s R e g

2 n s 2 4 6 8 1 0 1 2 1 4 1 6 P r o g r a m e x e c u t i o n

r d e r

( i n i n s t r u c t i o n s )

To achieve a 1 cycle stall (as shown above), we need to modify the implementation of the beq instruction so that the decision is made by the end of the second stage.

SLIDE 120

CSCI-2500 SPRING 2016, Processor Design, Chapter 4

Another strategy

n Predict whether or not the branch

will be taken.

n Go ahead with the predicted

instruction (feed it into the pipeline next).

n If your prediction is right, you don't

lose any time.

n If your prediction is wrong, you need

to undo some things and start the correct instruction

SLIDE 121

CSCI-2500 SPRING 2016, Processor Design, Chapter 4

Predicting branch not taken

Instruction fetch Reg ALU Data access Reg

Time beq $1, $2, 40 add $4, $5, $6 lw $3, 300($0)

Instruction fetch Reg ALU Data access Reg

2 ns

Instruction fetch Reg ALU Data access Reg

2 ns Program execution

rder

(in instructions)

Instruction fetch Reg ALU Data access Reg

Time beq $1, $2, 40 add $4, $5 ,$6

r $7, $8, $9

Instruction fetch Reg ALU Data access Reg

2 4 6 8 10 12 14 2 4 6 8 10 12 14

Instruction fetch Reg ALU Data access Reg

2 ns 4 ns

bubble bubble bubble bubble bubble

Program execution

rder

(in instructions)

SLIDE 122

CSCI-2500 SPRING 2016, Processor Design, Chapter 4

Dynamic Branch Prediction

n The idea is to build hardware that will

come up with a prediction based on the past history of the specific branch instruction.

n Predict the branch will be taken if it has

been taken more often than not in the recent past.

n This works great for loops! (90% + correct). n We’ll talk more about this …

SLIDE 123

CSCI-2500 SPRING 2016, Processor Design, Chapter 4

Yet another strategy: delayed branch

n The compiler rearranges instructions so

that the branch actually occurs delayed by one instruction from where its execution starts

n This gives the hardware time to

compute the address of the next instruction.

n The new instruction is hopefully useful

whether or not the branch is taken (this is tricky - compilers must be careful!).

SLIDE 124

CSCI-2500 SPRING 2016, Processor Design, Chapter 4

Delayed Branch

add $s2,$s3,$s4 beq $t0,$zero,skip addi $s0,$s0,1 skip: lw $s3,0($t3)

beq add Order reversed! The compiler must generate code that differs from what you would expect.

a = b+c; if (x!=0) y++; ...

SLIDE 125

CSCI-2500 SPRING 2016, Processor Design, Chapter 4

Data Hazard

n One of the values needed by an

instruction is not yet available (the instruction that computes it isn't done yet).

n This will cause a data hazard:

add $t0,$s1,$s2 addi $t0,$t0,17

SLIDE 126

CSCI-2500 SPRING 2016, Processor Design, Chapter 4

IF Reg ALU Data Access Reg IF Reg ALU Data Access Reg

add $t0,$s1,$s2 addi $t0,$t0,17 selects $s1 and $s2 for ALU op adds $s1 and $s2 stores sum in $t0 selects $t0 for ALU op

time

SLIDE 127

CSCI-2500 SPRING 2016, Processor Design, Chapter 4

Handling Data Hazards

n We can hope that the compiler can

arrange instructions so that data hazards never appear.

n this doesn't work, as programs generally

need to use previously computed values for everything!

n Some data hazards aren't real - the

value needed is available, just not in the right place.

SLIDE 128

CSCI-2500 SPRING 2016, Processor Design, Chapter 4

IF Reg ALU Data Access Reg IF Reg ALU Data Access Reg

add $t0,$s1,$s2 addi $t0,$t0,17 ALU has finished computing sum ALU needs sum from the previous ALU operation

time

The sum is available when needed!

SLIDE 129

CSCI-2500 SPRING 2016, Processor Design, Chapter 4

Forwarding

n It's possible to forward the value

directly from one resource to another (in time).

n Hardware needs to detect (and handle)

these situations automatically!

n This is difficult, but necessary.

SLIDE 130

CSCI-2500 SPRING 2016, Processor Design, Chapter 4

add $s0, $t0, $t1 sub $t2, $s0, $t3 Program execution

rder

(in instructions) IF ID WB EX IF ID MEM EX Time 2 4 6 8 10 MEM WB MEM

Picture of Forwarding

SLIDE 131

CSCI-2500 SPRING 2016, Processor Design, Chapter 4

Another Example

Ti m e 2 4 6 8 1 0 1 2 14 lw $s0, 2 0( $t1) sub $t2, $s0, $t3 Prog ra m exec utio n

rd er

(in in structio n s) IF ID W B M E M E X I F I D W B M E M E X

bu b ble bu bble bu b ble bu bble bu bble

SLIDE 132

CSCI-2500 SPRING 2016, Processor Design, Chapter 4

Pipelining and CPI

n If we keep the pipeline full, one

instruction completes every cycle.

n Another way of saying this: the average

time per instruction is 1 cycle.

n even though each instruction actually takes

5 cycles (5 stage pipeline).

CPI=1

SLIDE 133

CSCI-2500 SPRING 2016, Processor Design, Chapter 4

Correctness

Pipeline and compiler designers must be careful to ensure that the various schemes to avoid stalling do not change what the program does!

n only when and how it does it. n It's impossible to test all possible

combinations of instructions (to make sure the hardware does what is expected).

n It's impossible to test all combinations even

without pipelining!

SLIDE 134

CSCI-2500 SPRING 2016, Processor Design, Chapter 4

Pipelined Datapath

We need to use a multicycle datapath.

n includes registers that store the result of

each stage (to pass on to the next stage).

n can't have a single resource used by more

than one stage at time.

SLIDE 135

CSCI-2500 SPRING 2016, Processor Design, Chapter 4

Instruction memory Address 4 32 Add Add result Shift left 2 Instruction IF/ID EX/MEM MEM/WB M u x 1 Add PC Write data M u x 1 Registers Read data 1 Read data 2 Read register 1 Read register 2 16 Sign extend Write register Write data Read data 1 ALU result M u x ALU Zero ID/EX Data memory Address

Pipelined Datapath – 5 stages

SLIDE 136

CSCI-2500 SPRING 2016, Processor Design, Chapter 4

lw and pipelined datapath

n We can trace the execution of a load

word instruction through the datapath.

n We need to keep in mind that other

instructions are using the stages not in use by our lw instruction!

SLIDE 137

CSCI-2500 SPRING 2016, Processor Design, Chapter 4

I n s tr u ct i o n m e m o r y A d d r e s s 4 3 2 A d d A d d r e s ul t S h if t l e ft 2 I n s t ru ct io n I F/I D E X/ M E M M E M / W B M u x 1 A d d P C W ri t e d a t a M u x 1 R e g i st e r s R e a d d at a 1 R e a d d at a 2 R e a d r e g is t er 1 R e a d r e g is t er 2 1 6 S i g n e xt e n d W ri te r e g is t er W ri te d a t a R e a d d a t a 1 A L U r e s u lt M u x A L U Z e r o ID / E X

I n s t r u c t i o n f e t c h l w

A d d r e s s D at a m e m or y

Stage 1: Instruction Fetch (IF)

SLIDE 138

CSCI-2500 SPRING 2016, Processor Design, Chapter 4

I n st r u c t i o n m e m o r y A d d r e s s 4 3 2 A d d A d d r e s ult S h if t l e ft 2 In s t ru c t i o n I F /I D E X / M E M M u x 1 A d d P C W ri t e d at a M u x 1 R e g i st e r s R e a d d at a 1 R e a d d at a 2 R e a d r e g is t er 1 R e a d r e g is t er 2 1 6 S i g n e x t e n d W rit e r e g is t er W rit e d at a R e a d d a t a 1 A L U r e s u lt M u x A L U Z e r o I D / E X M E M / W B

I n s t r u c t i o n d e c o d e l w

A d d r e ss D at a m e m o ry

Stage 2: Instruction Decode (ID)

SLIDE 139

CSCI-2500 SPRING 2016, Processor Design, Chapter 4

I n st r u c t i o n m e m o r y A d d r e s s 4 3 2 A d d A d d r e s ult S h if t l e ft 2 In s t ru c t i o n I F /I D E X / M E M M u x 1 A d d P C W ri t e d at a M u x 1 R e g i st e r s R e a d d at a 1 R e a d d at a 2 R e a d r e g is t er 1 R e a d r e g is t er 2 1 6 S i g n e x t e n d W rit e r e g is t er W rit e d at a R e a d d a t a 1 A L U r e s u lt M u x A L U Z e r o I D / E X M E M / W B

I n s t r u c t i o n d e c o d e l w

A d d r e ss D at a m e m o ry

Stage 2: Instruction Decode (ID)

SLIDE 140

CSCI-2500 SPRING 2016, Processor Design, Chapter 4

I n st r u c ti o n m e m o r y A d d r e s s 4 3 2 A d d A d d r e s ult S h if t l e ft 2 In s tru c tio n I F /I D E X/ M E M M u x 1 A d d P C W ri t e d a t a M u x 1 R e g i s t e r s R e a d d at a 1 R e a d d at a 2 R e a d re g i s t e r 1 R e a d re g i s t e r 2 1 6 S i g n e x t e n d W ri t e re g i s t e r W ri t e d a t a R e a d d a t a 1 A L U r e s u lt M u x A L U Z e r o I D / E X M E M / W B

E x e c u t i o n l w

A d d r e s s D at a m e m o r y

Stage 3: Execute (EX)

SLIDE 141

CSCI-2500 SPRING 2016, Processor Design, Chapter 4

I n s tr u ct i o n m e m o r y A d d r e s s 4 3 2 A d d A d d r e s u l t S h ift l e ft 2 I n st ru c t i o n I F /I D E X / M E M M u x 1 A d d P C W ri t e d at a M u x 1 R e g i st e r s R e a d d a t a 1 R e a d d a t a 2 R e a d r e g i s t e r 1 R e a d r e g i s t e r 2 1 6 S i g n e x t e n d W r it e r e g i s t e r W r it e d a t a R e a d d a t a D a t a m e m o r y 1 A L U r e s u l t M u x A L U Z e r o I D / E X M E M / W B

M e m o r y l w

A d d r e s s

Stage 4: Memory Access (MEM)

SLIDE 142

CSCI-2500 SPRING 2016, Processor Design, Chapter 4

In stru cti on m e m o r y A d d r e s s 4 3 2 A d d A d d r e su l t S h ift l e ft 2 In s t ru ct io n I F /I D E X / M E M M u x 1 A d d P C W rit e d at a M u x 1 R e g i st e r s R e a d d at a 1 R e a d d at a 2 R e a d re g is t er 1 R e a d re g is t er 2 1 6 S ig n e x t e n d W rit e d a t a R e a d d a t a D a t a m e m o r y 1 A L U r e s u lt M u x A L U Z e r o I D / E X M E M / W B

W r it e b a c k l w

W rit e re g is t er A d d r e s s

Stage 5: WriteBack (WB)

SLIDE 143

CSCI-2500 SPRING 2016, Processor Design, Chapter 4

A Bug!

n When the value read from memory is

written back to the register file, the inputs to the register file (write register #) are from a different instruction!

n To fix the bug we need to save the part

f the lw instruction (5 bits of it

specify which register should get the value from memory).

SLIDE 144

CSCI-2500 SPRING 2016, Processor Design, Chapter 4

New Datapath

Instruction memory Address 4 32 A dd A dd result Shift left 2 Instruction IF/ID EX/MEM MEM/WB M u x 1 Add PC A ddress Wr i te data M u x 1 Registers Read data 1 Read data 2 Read register 1 Read register 2 16 S i gn extend Write register Write data Read data Data memory 1 A LU result M u x A LU Zero ID/EX

Figure 4.41

SLIDE 145

CSCI-2500 SPRING 2016, Processor Design, Chapter 4

Store Word (sw) Data Path Flow (EX)

SLIDE 146

CSCI-2500 SPRING 2016, Processor Design, Chapter 4

SW Data Path (cont.)

SLIDE 147

CSCI-2500 SPRING 2016, Processor Design, Chapter 4

Final Corrected Datapath

SLIDE 148

CSCI-2500 SPRING 2016, Processor Design, Chapter 4

Ex. With 5 instructions

SLIDE 149

CSCI-2500 SPRING 2016, Processor Design, Chapter 4

Ex: Alt View

SLIDE 150

Pipeline Control

SLIDE 151

CSCI-2500 SPRING 2016, Processor Design, Chapter 4

Pipelined DP w/ signals

SLIDE 152

CSCI-2500 SPRING 2016, Processor Design, Chapter 4

Control lines for pipeline stages

SLIDE 153

CSCI-2500 SPRING 2016, Processor Design, Chapter 4

Pipelined DP w/ Control

SLIDE 154

CSCI-2500 SPRING 2016, Processor Design, Chapter 4

Pipelined Dependencies

SLIDE 155

CSCI-2500 SPRING 2016, Processor Design, Chapter 4

Pipeline w/ Forwarding Values

SLIDE 156

CSCI-2500 SPRING 2016, Processor Design, Chapter 4

ALU & Regs: B4, After Fwding

SLIDE 157

CSCI-2500 SPRING 2016, Processor Design, Chapter 4

Datapath w/ forwarding

SLIDE 158

CSCI-2500 SPRING 2016, Processor Design, Chapter 4

Forwarding Control Table

ForwardA = 00 ID/EX 1st ALU op from reg file ForwardA= 10 EX/MEM 1st ALU op fwd from prior ALU result ForwardA = 01 MEM/WB 1st ALU op fwd from data mem

r earlier

result MUX Control Source Reason

SLIDE 159

CSCI-2500 SPRING 2016, Processor Design, Chapter 4

Forwarding Control Table (cont.)

ForwardB = 00 ID/EX 2nd ALU op from reg file ForwardB= 10 EX/MEM 2nd ALU op fwd from prior ALU result ForwardB = 01 MEM/WB 2nd ALU op fwd from data mem

r earlier result

MUX Control Source Reason

SLIDE 160

CSCI-2500 SPRING 2016, Processor Design, Chapter 4

Resolution

n if( EX/MEM.RegWrite &&

EX/MEM.RegisterRd != 0 && EX/MEM.RegisterRd == ID/EX.RegisterRs )

ForwardA = 10

n if( EX/MEM.RegWrite &&

EX/MEM.RegisterRd != 0 && EX/MEM.RegisterRd == ID/EX.RegisterRt )

ForwardB = 10

SLIDE 161

CSCI-2500 SPRING 2016, Processor Design, Chapter 4

Mem Stage Hazard Detection & Resolution

n if( MEM/WB.RegWrite &&

MEM/WB.RegisterRd != 0 && EX/MEM.RegisterRd != ID/EX.RegisterRs && MEM/WB.RegisterRd = ID/EX.RegisterRs)

ForwardA = 01

n if( MEM/WB.RegWrite &&

MEM/WB.RegisterRd != 0 && EX/MEM.RegisterRd != ID/EX.RegisterRt && MEM/WB.RegisterRd = ID/EX.RegisterRt)

ForwardB = 01

SLIDE 162

CSCI-2500 SPRING 2016, Processor Design, Chapter 4

Data Hazards & Stalls

n Need Hazard detection unit in addition

to forwarding unit.

n Check for Load Instructions based on…

n if( ID/EX.MemRead &&

(ID/EX.RegisterRt==IF/ID.RegisterRs || ID/EX.RegisterRt==IF/ID.RegisterRt)) StallThePipeline

SLIDE 163

CSCI-2500 SPRING 2016, Processor Design, Chapter 4

Where Forwarding Fails…must stall

SLIDE 164

CSCI-2500 SPRING 2016, Processor Design, Chapter 4

How Stalls Are Inserted

SLIDE 165

CSCI-2500 SPRING 2016, Processor Design, Chapter 4

Pipelined control w/ fwding & hazard detection

SLIDE 166

CSCI-2500 SPRING 2016, Processor Design, Chapter 4

What about those crazy branches?

Problem: if the branch is taken, PC goes to addr 72, but don’t know until after 3 other instructions are processed

SLIDE 167

CSCI-2500 SPRING 2016, Processor Design, Chapter 4

Branch Hazards: Assume Branch Not Taken

n Recall stalling until branch is complete is too

ssssssllllooooowwww!!

n So, assume the branch is not taken… n If taken, instructions fetched/decoded must be

discarded or “squashed”

n discard instructions, just change the original control

values to 0’s (similar to load-use hazard),

n BIG DIFFERENCE: must flush the pipeline in the IF, ID

and EX stages

n How can we reduce the “flush” costs when a branch is

taken?

SLIDE 168

CSCI-2500 SPRING 2016, Processor Design, Chapter 4

Reducing the Delay of Branches

n Let’s move the branch execution earlier

in the pipeline.

n EFFECT: fewer instructions need to be

flushed.

n NEED two actions:

n Compute branch target address (EASY –

can do on IF/ID stage).

n Eval of branch decision (HARD)

SLIDE 169

CSCI-2500 SPRING 2016, Processor Design, Chapter 4

Faster Branch Decision

n Recall, for BEQ instruction, we would

compare two regs during the ID stage and test for equality.

n Equality can be tested by XORing the

two regs. (a.k.a. equality unit)

n Need additional ID stage forwarding

and hazard detection hardware

n This has 2 complicating factors…

SLIDE 170

CSCI-2500 SPRING 2016, Processor Design, Chapter 4

Faster Branch Decison: Complex Factors

1.

In ID stage, now we need to decide whether a “bypass” path to the “equality” unit is needed.

ALU forwarding logic is not sufficient, and so we

need new forwarding logic for the equality unit.

2.

Can stall due to a data hazard.

if an r-type instruction comes before the branch

who operands are used in the comparision in the branch, a stall is needed

SLIDE 171

CSCI-2500 SPRING 2016, Processor Design, Chapter 4

Example Pipelined Branch

36 sub $10, $4, $8 40 beq $1, $3, 7

44

and $12, $2, $5

48

r $13, $2, $6

52

and $14, $4, $2

56

slt $15, $6, $7 …….. 72 lw $4, 50($7)

SLIDE 172

CSCI-2500 SPRING 2016, Processor Design, Chapter 4

Branch Processing Example

SLIDE 173

CSCI-2500 SPRING 2016, Processor Design, Chapter 4

Dynamic Branch Prediction

n From the phase “There is no such thing as a typical

program”, this implies that programs will branch is different ways and so there is no “one size fits all” branch algorithm.

n Alt approach: keep a history (1 bit) on each branch

instruction and see if it was last taken or not.

n Implementation: branch prediction buffer or branch

history table.

n Index based on lower part of branch address n Single bit indicates if branch at address was last taken or

not. (1 or 0)

SLIDE 174

CSCI-2500 SPRING 2016, Processor Design, Chapter 4

Problem with 1-bit Branch Predictors

n Consider a loop branch

n Suppose it occurs 9 times in a row, then is

not taken.

n What’s the branch prediction accuracy? n ANSWER: 1-bit predictor will mispredict

the entry and exit points of the loop.

n Yields only an 80% accuracy when there is

potential for 90% (i.e., you have to guess wrong on the exit of the loop).

SLIDE 175

CSCI-2500 SPRING 2016, Processor Design, Chapter 4

Solution: 2-bit Branch Predictor

Must be wrong twice before changing prediction Learns if the branch is more biased towards “taken” or “not taken”

SLIDE 176

CSCI-2500 SPRING 2016, Processor Design, Chapter 4

Performance: Single vs Multicycle vs. PL

n Assume: 200 ps for memory access, 100 ps for

ALU ops, 50 ps for register access

n Single-cycle clock cycle:

n 600 ps: 200 + 50 + 100 + 200 + 50

n Futher assume instruction mix

n 25% loads, 10% stores, 11% branches, 2% jumps, 52%

ALU instructions

n Assume CPI for multi-cycle is 3.50 n Multicycle clock cycle: must be longest unit which is

200 ps

n Total time for an “avg” instruction is 3.5 * 200 ps =

700ps

SLIDE 177

CSCI-2500 SPRING 2016, Processor Design, Chapter 4

Pipeline performance (cont)

n For pipelined design…

n Loads take 1 cycle when no load-use dependence

and 2 cycles when there is yielding an average of 1.5 cycles per load.

n Stores and ALU instructions take 1 cycle. n Branches take 1 cycle when predicted correctly and

2 cycles when not. Assume 75% accuracy, average branch cycles is 1.25.

n Jumps are 2 cycles. n Avg CPI then is:

1.5 x 25% + 1 x 10% + 1 x 52% + 1.25 x 11% + 2 x 2% = 1.17

n Longest stage is 200 ps, so 200 x 1.17 = 234

ps

SLIDE 178

CSCI-2500 SPRING 2016, Processor Design, Chapter 4

Even more performance…

n Ultimately we want greater and greater

Instruction Level Parallelism (ILP)

n How? n Multiple instruction issue.

n Results in CPI’s less than one. n Here, instructions are grouped into “issue slots”. n So, we usually talk about IPC (instructions per

cycle)

n Static: uses the compiler to assist with grouping

instructions and hazard resolution. Compiler MUST remove ALL hazards.

n Dynamic: (i.e., superscalar) hardware creates the

instruction schedule based on dynamically detected hazards

SLIDE 179

CSCI-2500 SPRING 2016, Processor Design, Chapter 4

Example Static 2-issue Datapath

Additions include:

32 bits from intr. Mem
Two read, 1 write ports
n reg file
1 more ALU (top

handles address calc)

SLIDE 180

CSCI-2500 SPRING 2016, Processor Design, Chapter 4

Ex. 2-Issue Code Schedule

Loop: lw $t0, 0($s1) #t0=array element addiu $t0, $t0, $s2 #add scalar in $s2 sw $t0, 0($s1) #store result addi $s1, $s1, -4 # dec pointer bne $s1, $zero, Loop # branch $s1!=0

ALU/Branch Data Xfer Inst. Cycles Loop: lw $t0, 0($s1) 1

addi $s1, $s1, -4

2

addu $t0, $t0, $s2

3

bne $s1, $zero, Loop sw $t0, 4($s1)

4 It take 4 clock cycles for 5 instructions or IPC of 1.25

SLIDE 181

CSCI-2500 SPRING 2016, Processor Design, Chapter 4

More Performance: Loop Unrolling

n Technique where multiple copies of the loop body are

made.

n Make more ILP available by removing dependencies. n How? Complier introduces additional registers via

“register renaming”.

n This removes “name” or “anti” dependence

n where an instruction order is purely a consequence of the

reuse of a register and not a real data dependence.

n Ex. lw $t0, 0($s1), addu $t0, $t0, $s2 and sw $t0, 4($s1) n No data values flow between one pair and the next pair n Let’s assume we unroll a block of 4 interations of the loop..

SLIDE 182

CSCI-2500 SPRING 2016, Processor Design, Chapter 4

Loop Unrolling Schedule

ALU/Branch Instructions Data Xfer Cycles Loop addi $s1, $s1, -16 lw $t0, 0($s1) 1 lw $t1, 12($s1) 2 addu $t0, $t0, $s2 lw $t2, 8($s1) 3 addu $t1, $t1, $s2 lw $t3, 4($s1) 4 addu $t2, $t2, $s2 sw $t0, 16($s1) 5 addu $t3, $t3, $s2 sw $t1, 12($s1) 6 sw $t2, 8($s1) 7 bne $s1, $zero, loop sw $t3, 4($s1) 8

SLIDE 183

CSCI-2500 SPRING 2016, Processor Design, Chapter 4

Performance of Instruction Schedule

n 12 of 14 instructions execute in a pair. n Takes 8 clock cycles for 4 loop

iterations

n Yields 2 clock cycles per iteration n CPI = 8/14 è 0.57 n Cost of improvement: 4 temp regs + lots

f additional code

SLIDE 184

CSCI-2500 SPRING 2016, Processor Design, Chapter 4

Dynamic Scheduled Pipeline

SLIDE 185

CSCI-2500 SPRING 2016, Processor Design, Chapter 4

Intel P4 Dynamic Pipeline

SLIDE 186

CSCI-2500 SPRING 2016, Processor Design, Chapter 4