[PPT] - Von Neumann Execution Model Fetch: send PC to memory transfer PowerPoint Presentation

SLIDE 1

Winter 2006 CSE 548 - Dataflow Machines 1

Von Neumann Execution Model

Fetch:

send PC to memory
transfer instruction from memory to CPU
increment PC

Decode & read ALU input sources Execute

an ALU operation
memory operation
branch target calculation

Store the result in a register

from the ALU or memory

SLIDE 2

Winter 2006 CSE 548 - Dataflow Machines 2

Von Neumann Execution Model

Program is a linear series of addressable instructions

send PC to memory
next instruction to execute depends on what happened during the

execution of the current instruction Next instruction to be executed is pointed to by the PC Operands reside in a centralized, global memory (GPRs)

SLIDE 3

Winter 2006 CSE 548 - Dataflow Machines 3

Dataflow Execution Model

Instructions are already in the processor: Operands arrive from a producer instruction Check to see if all an instruction’s operands are there Execute

an ALU operation
memory operation
branch target calculation

Send the result

to the consumer instructions or memory

SLIDE 4

Winter 2006 CSE 548 - Dataflow Machines 4

Dataflow Execution Model

Execution is driven by the availability of input operands

perands are consumed
utput is generated
no PC

Result operands are passed directly to consumer instructions

no register file

SLIDE 5

Winter 2006 CSE 548 - Dataflow Machines 5

Dataflow Computers

Motivation:

exploit instruction-level parallelism on a massive scale
more fully utilize all processing elements

Believed this was possible if:

expose instruction-level parallelism by using a functional-style

programming language

no side effects; only restrictions were producer-consumer
scheduled code for execution on the hardware greedily
hardware support for data-driven execution

SLIDE 6

Winter 2006 CSE 548 - Dataflow Machines 6

Instruction-Level Parallelism (ILP)

Fine-grained parallelism Obtained by: – instruction overlap (later, as in a pipeline) – executing instructions in parallel (later, with multiple instruction issue) In contrast to: – loop-level parallelism (medium-grained) – process-level or task-level or thread-level parallelism (coarse- grained)

SLIDE 7

Winter 2006 CSE 548 - Dataflow Machines 7

Instruction-Level Parallelism (ILP)

Can be exploited when instruction operands are independent of each

ther, for example,

– two instructions are independent if their operands are different – an example of independent instructions Each thread (program) has a fair amount of potential ILP – very little can be exploited on today’s computers – researchers trying to increase it ld R1, 0(R2)

r R7, R3, R8

SLIDE 8

Winter 2006 CSE 548 - Dataflow Machines 8

Dependences

data dependence: arises from the flow of values through programs – consumer instruction gets a value from a producer instruction – determines the order in which instructions can be executed name dependence: instructions use the same register but no flow of data between them – antidependence – output dependence ld R1, 32(R3) add R3, R1, R8 ld R1, 32(R3) add R3, R1, R8 ld R1, 16 (R3)

SLIDE 9

Winter 2006 CSE 548 - Dataflow Machines 9

Dependences

control dependence

arises from the flow of control
instructions after a branch depend on the value of the branch’s

condition variable Dependences inhibit ILP beqz R2, target lw r1, 0(r3) target: add r1, ...

SLIDE 10

Winter 2006 CSE 548 - Dataflow Machines 10

Dataflow Execution

All computation is data-driven.

binary represented as a directed graph
nodes are operations
values travel on arcs
WaveScalar instruction

+ b a+b a

pcode destination1 destination2

…

SLIDE 11

Winter 2006 CSE 548 - Dataflow Machines 11

Dataflow Execution

Data-dependent operations are connected, producer to consumer Code & initial values loaded into memory Execute according to the dataflow firing rule

when operands of an instruction have arrived on all input arcs,

instruction may execute

value on input arcs is removed
computed value placed on output arc

+

a+b

a b

SLIDE 12

Winter 2006 CSE 548 - Dataflow Machines 12

Dataflow Example

A[j + ii] = i; b = A[ij]; * Load Store + j i * b A + +

SLIDE 13

Winter 2006 CSE 548 - Dataflow Machines 13

Dataflow Example

A[j + ii] = i; b = A[ij]; * Load Store + j i * b A + +

SLIDE 14

Winter 2006 CSE 548 - Dataflow Machines 14

Dataflow Example

A[j + ii] = i; b = A[ij]; * Load Store + j i * b A + +

SLIDE 15

Winter 2006 CSE 548 - Dataflow Machines 15

Dataflow Execution

Control

Split (steer)

merge (φ)

convert control dependence to data dependence with value-

steering instructions

execute one path after condition variable is known (split)
r
execute both paths & pass values at end (merge)

+ predicate T path F path value + predicate T path F path value

SLIDE 16

Winter 2006 CSE 548 - Dataflow Machines 16

WaveScalar Control

steer φ

SLIDE 17

Winter 2006 CSE 548 - Dataflow Machines 17

Dataflow Computer ISA

Instructions

peration
destination instructions

Data packets, called Tokens

value
tag to identify the operand instance & match it with its fellow
perands in the same dynamic instruction instance
architecture dependent
instruction number
iteration number
activation/context number (for functions, especially recursive)
thread number
Dataflow computer executes a program by receiving, matching &

sending out tokens.

SLIDE 18

Winter 2006 CSE 548 - Dataflow Machines 18

Types of Dataflow Computers

static:

ne copy of each instruction
no simultaneously active iterations, no recursion

dynamic

multiple copies of each instruction
better performance
gate counting technique to prevent instruction explosion:

k-bounding

extra instruction with K tokens on its input arc; passes a token

to 1st instruction of loop body

1st instruction of loop body consumes a token (needs one extra
perand to execute)
last instruction in loop body produces another token at end of

iteration

limits active iterations to k

SLIDE 19

Winter 2006 CSE 548 - Dataflow Machines 19

Prototypical Early Dataflow Computer

Original implementations were centralized. Performance cost

large token store (long access)
long wires
arbitration for PEs and return of result

data packets processing elements token store instructions instruction packets

SLIDE 20

Winter 2006 CSE 548 - Dataflow Machines 20

Problems with Dataflow Computers

Language compatibility

dataflow cannot guarantee a global ordering of memory operations
dataflow computer programmers could not use mainstream

programming languages, such as C

developed special languages in which order didn’t matter

Scalability: large token store

side-effect-free programming language with no mutable data

structures

each update creates a new data structure
1000 tokens for 1000 data items even if the same value
associative search impossible; accessed with slower hash function
aggravated by the state of processor technology at the time

More minor issues

PE stalled for operand arrival
Lack of operand locality

SLIDE 21

Winter 2006 CSE 548 - Dataflow Machines 21

Partial Solutions

Data representation in memory

I-structures:
write once; read many times
early reads are deferred until the write
M-structures:
multiple reads & writes, but they must alternate
reusable structures which could hold multiple values

Local (register) storage for back-to-back instructions in a single thread Cycle-level multithreading

SLIDE 22

Winter 2006 CSE 548 - Dataflow Machines 22

Partial Solutions

Frames of sequential instruction execution

create “frames”, each of which stored the data for one iteration or
ne thread
not have to search entire token store (offset to frame)
dataflow execution among coarse-grain threads

Von Neumann Execution Model

Fetch:

Decode & read ALU input sources Execute

Store the result in a register

Von Neumann Execution Model

Program is a linear series of addressable instructions

execution of the current instruction Next instruction to be executed is pointed to by the PC Operands reside in a centralized, global memory (GPRs)

Dataflow Execution Model

Instructions are already in the processor: Operands arrive from a producer instruction Check to see if all an instruction’s operands are there Execute

Send the result

Dataflow Execution Model

Execution is driven by the availability of input operands

Result operands are passed directly to consumer instructions

Dataflow Computers

Motivation:

Believed this was possible if:

programming language

Instruction-Level Parallelism (ILP)

Instruction-Level Parallelism (ILP)

Can be exploited when instruction operands are independent of each

– two instructions are independent if their operands are different – an example of independent instructions Each thread (program) has a fair amount of potential ILP – very little can be exploited on today’s computers – researchers trying to increase it ld R1, 0(R2)

Dependences

Dependences

control dependence

condition variable Dependences inhibit ILP beqz R2, target lw r1, 0(r3) target: add r1, ...

Dataflow Execution

All computation is data-driven.

+ b a+b a

…

Dataflow Execution

Data-dependent operations are connected, producer to consumer Code & initial values loaded into memory Execute according to the dataflow firing rule

instruction may execute

+

a+b

a b

Dataflow Example

A[j + i*i] = i; b = A[i*j]; * Load Store + j i * b A + +

Dataflow Example

A[j + i*i] = i; b = A[i*j]; * Load Store + j i * b A + +

Dataflow Example

A[j + i*i] = i; b = A[i*j]; * Load Store + j i * b A + +

Dataflow Execution

Control

merge (φ)

steering instructions

+ predicate T path F path value + predicate T path F path value

WaveScalar Control

steer φ

Dataflow Computer ISA

Instructions

Data packets, called Tokens

sending out tokens.

Types of Dataflow Computers

static:

dynamic

k-bounding

to 1st instruction of loop body

iteration

Prototypical Early Dataflow Computer

Original implementations were centralized. Performance cost

data packets processing elements token store instructions instruction packets

Problems with Dataflow Computers

Language compatibility

programming languages, such as C

Scalability: large token store

structures

More minor issues

Partial Solutions

Data representation in memory

Local (register) storage for back-to-back instructions in a single thread Cycle-level multithreading

Partial Solutions

Frames of sequential instruction execution

Partition token store & place each partition with a PE Many solutions led away from pure dataflow execution

A[j + ii] = i; b = A[ij]; * Load Store + j i * b A + +

A[j + ii] = i; b = A[ij]; * Load Store + j i * b A + +

A[j + ii] = i; b = A[ij]; * Load Store + j i * b A + +