WaveScalar Dataflow machine good at exploiting ILP dataflow - - PowerPoint PPT Presentation

▶

Aug 07, 2023 298 likes •584 views

WaveScalar Dataflow machine good at exploiting ILP dataflow parallelism + traditional coarser-grain parallelism cheap thread management memory ordering enforced through wave-ordered memory Winter 2006 CSE 548 - WaveScalar 1

SLIDE 1

Winter 2006 CSE 548 - WaveScalar 1

WaveScalar

Dataflow machine

good at exploiting ILP
dataflow parallelism + traditional coarser-grain parallelism
cheap thread management
memory ordering enforced through wave-ordered memory

SLIDE 2

Winter 2006 CSE 548 - WaveScalar 2

WaveScalar

Motivation:

increasing disparity between computation (fast transistors) &

communication (long wires)

increasing circuit complexity
decreasing fabrication reliability

SLIDE 3

Winter 2006 CSE 548 - WaveScalar 3

Monolithic von Neumann Processors

A phenomenal success today. But in 2016?

 Performance

Centralized processing & control, e.g., operand broadcast networks

 Complexity

40-75% of “design” time is design verification  Defect tolerance 1 flaw -> paperweight

SLIDE 4

Winter 2006 CSE 548 - WaveScalar 4

WaveScalar Executive Summary

Distributed microarchitecture 

hundreds of PEs
dataflow execution – no centralized control
short point-to-point communication
rganized hierarchically for fast communication between

neighboring PEs

defect tolerance – route around a bad PE

Low design complexity through simple, identical PEs 

design one & stamp out thousands

SLIDE 5

Winter 2006 CSE 548 - WaveScalar 5

Processing Element

distributed tag matching 2 PEs in a pod

SLIDE 6

Winter 2006 CSE 548 - WaveScalar 6

Domain

SLIDE 7

Winter 2006 CSE 548 - WaveScalar 7

Cluster

SLIDE 8

Winter 2006 CSE 548 - WaveScalar 8

Whole Chip

Can hold 32K instructions
Long distance communication
Dynamic routing
Grid-based network
2-cycle hop/cluster
Normal memory hierarchy
Traditional directory-based

cache coherence

SLIDE 9

Winter 2006 CSE 548 - WaveScalar 9

WaveScalar Execution Model

Dataflow Place instructions in PEs to maximize data locality & instruction-level parallelism.

Instruction placement algorithm based on a performance model

that captures the conflicting goals

Depth-first traversal of dataflow graph to make chains of

dependent instructions

Broken into segments
Snakes segments across the chip on demand
K-loop bounding to prevent instruction “explosion”

Instructions communicate values directly (point-to-point).

SLIDE 10

Winter 2006 CSE 548 - WaveScalar 10

WaveScalar Instruction Placement

SLIDE 11

Winter 2006 CSE 548 - WaveScalar 11

WaveScalar Example

A[j + ii] = i; b = A[ij]; * Load Store + j i * b A + +

SLIDE 12

Winter 2006 CSE 548 - WaveScalar 12

WaveScalar Example

A[j + ii] = i; b = A[ij]; * Load Store + j i * b A + +

SLIDE 13

Winter 2006 CSE 548 - WaveScalar 13

WaveScalar Example

A[j + ii] = i; b = A[ij]; * Load Store + j i * b A + +

SLIDE 14

Winter 2006 CSE 548 - WaveScalar 14

WaveScalar Example

A[j + ii] = i; b = A[ij]; * Load Store + j i * b A + +

SLIDE 15

Winter 2006 CSE 548 - WaveScalar 15

WaveScalar Example

A[j + ii] = i; b = A[ij]; * Load Store + j i * b A + +

SLIDE 16

Winter 2006 CSE 548 - WaveScalar 16

WaveScalar Example

A[j + ii] = i; b = A[ij]; Global load-store ordering issue * Load Store + j i * b A + +

SLIDE 17

Winter 2006 CSE 548 - WaveScalar 17

Wave-ordered Memory

Compiler annotates memory
perations
Send memory requests

in any order

Hardware reconstructs the

correct order

Load Store Load Store Load Store 3 4 8 5 6 7

 Sequence #

4 ? 9 6 8 8

 Successor

2 3 ? 4 5 4

 Predecessor

SLIDE 18

Winter 2006 CSE 548 - WaveScalar 18

Store buffer Wave-ordering Example 4 ? 3 7 8 4 8 9 ? Load Store Load Store Load Store 5 6 6 8 3 4 2 8 9 ? 4 5 7 8 4 4 ? 3 3 4 2

SLIDE 19

Winter 2006 CSE 548 - WaveScalar 19

Wave-ordered Memory

Waves are loop-free sections of the dataflow graph Each dynamic wave has a wave number Wave number is incremented between waves Ordering memory:

wave-numbers
sequence number within a wave

SLIDE 20

Winter 2006 CSE 548 - WaveScalar 20

WaveScalar Tag-matching

WaveScalar tag

thread identifier
wave number

Token: tag & value

<ThreadID:Wave#> . value + <2:5>.3 <2:5>.6 <2:5>.9

SLIDE 21

Winter 2006 CSE 548 - WaveScalar 21

Single-thread Performance

Performance per area 0.01 0.02 0.03 0.04 0.05

ammp art equake gzip mcf twolf djpeg mpeg2encode rawdaudio average

AIPC/mm

WS OOO

SLIDE 22

Winter 2006 CSE 548 - WaveScalar 22

Multithreading the WaveCache

Architectural-support for WaveScalar threads

instructions to start & stop memory orderings, i.e., threads
memory-free synchronization to allow exclusive access to data

(TC)

fence instruction to allow other threads to see this one’s memory
ps

Combine to build threads with multiple granularities

coarse-grain threads: 25-168X over a single thread; 2-16X over

CMP, 5-11X over SMT

fine-grain, dataflow-style threads: 18-242X over single thread
combine the two in the same application: 1.6X or 7.9X -> 9X

SLIDE 23

Winter 2006 CSE 548 - WaveScalar 23

Creating & Terminating a Thread

SLIDE 24

Winter 2006 CSE 548 - WaveScalar 24

Thread Creation Overhead

SLIDE 25

Winter 2006 CSE 548 - WaveScalar 25

Performance of Coarse-grain Parallelism

SLIDE 26

Winter 2006 CSE 548 - WaveScalar 26

Performance of Fine-grain Parallelism

SLIDE 27

Winter 2006 CSE 548 - WaveScalar 27

Building the WaveCache

RTL-level implementation

some didn’t believe it could be built in a normal-sized chip
some didn’t believe it could achieve a decent cycle time and load-

use latencies

Verilog & Synopsis CAD tools

Different WaveCache’s for different applications

1 cluster: low-cost, low power, single-thread or embedded
52 mm2 in 90 nm process technology, 3.5 AIPC on Splash2
16 clusters: multiple threads, higher performance: 436 mm2 , 15

AIPC board-level FPGA implementation

OS & real application simulations

WaveScalar

Dataflow machine

WaveScalar

Motivation:

communication (long wires)

Monolithic von Neumann Processors

A phenomenal success today. But in 2016?

 Performance

Centralized processing & control, e.g., operand broadcast networks

 Complexity

40-75% of “design” time is design verification  Defect tolerance 1 flaw -> paperweight

WaveScalar Executive Summary

Distributed microarchitecture 

neighboring PEs

Low design complexity through simple, identical PEs 

Processing Element

distributed tag matching 2 PEs in a pod

Domain

Cluster

Whole Chip

cache coherence

WaveScalar Execution Model

Dataflow Place instructions in PEs to maximize data locality & instruction-level parallelism.

that captures the conflicting goals

dependent instructions

Instructions communicate values directly (point-to-point).

WaveScalar Instruction Placement

WaveScalar Example

A[j + i*i] = i; b = A[i*j]; * Load Store + j i * b A + +

WaveScalar Example

A[j + i*i] = i; b = A[i*j]; * Load Store + j i * b A + +

WaveScalar Example

A[j + i*i] = i; b = A[i*j]; * Load Store + j i * b A + +

WaveScalar Example

A[j + i*i] = i; b = A[i*j]; * Load Store + j i * b A + +

WaveScalar Example

A[j + i*i] = i; b = A[i*j]; * Load Store + j i * b A + +

WaveScalar Example

A[j + i*i] = i; b = A[i*j]; Global load-store ordering issue * Load Store + j i * b A + +

Wave-ordered Memory

in any order

correct order

Load Store Load Store Load Store 3 4 8 5 6 7

 Sequence #

4 ? 9 6 8 8

 Successor

2 3 ? 4 5 4

 Predecessor

Store buffer Wave-ordering Example 4 ? 3 7 8 4 8 9 ? Load Store Load Store Load Store 5 6 6 8 3 4 2 8 9 ? 4 5 7 8 4 4 ? 3 3 4 2

Wave-ordered Memory

Waves are loop-free sections of the dataflow graph Each dynamic wave has a wave number Wave number is incremented between waves Ordering memory:

WaveScalar Tag-matching

WaveScalar tag

Token: tag & value

<ThreadID:Wave#> . value + <2:5>.3 <2:5>.6 <2:5>.9

Single-thread Performance

Multithreading the WaveCache

Architectural-support for WaveScalar threads

(TC)

Combine to build threads with multiple granularities

CMP, 5-11X over SMT

Creating & Terminating a Thread

Thread Creation Overhead

Performance of Coarse-grain Parallelism

Performance of Fine-grain Parallelism

Building the WaveCache

RTL-level implementation

use latencies

Different WaveCache’s for different applications

AIPC board-level FPGA implementation

A[j + ii] = i; b = A[ij]; * Load Store + j i * b A + +

A[j + ii] = i; b = A[ij]; * Load Store + j i * b A + +

A[j + ii] = i; b = A[ij]; * Load Store + j i * b A + +

A[j + ii] = i; b = A[ij]; * Load Store + j i * b A + +

A[j + ii] = i; b = A[ij]; * Load Store + j i * b A + +

A[j + ii] = i; b = A[ij]; Global load-store ordering issue * Load Store + j i * b A + +