WaveScalar Dataflow machine good at exploiting ILP dataflow - - PowerPoint PPT Presentation

wavescalar
SMART_READER_LITE
LIVE PREVIEW

WaveScalar Dataflow machine good at exploiting ILP dataflow - - PowerPoint PPT Presentation

WaveScalar Dataflow machine good at exploiting ILP dataflow parallelism + traditional coarser-grain parallelism cheap thread management memory ordering enforced through wave-ordered memory Winter 2006 CSE 548 - WaveScalar 1


slide-1
SLIDE 1

Winter 2006 CSE 548 - WaveScalar 1

WaveScalar

Dataflow machine

  • good at exploiting ILP
  • dataflow parallelism + traditional coarser-grain parallelism
  • cheap thread management
  • memory ordering enforced through wave-ordered memory
slide-2
SLIDE 2

Winter 2006 CSE 548 - WaveScalar 2

WaveScalar

Motivation:

  • increasing disparity between computation (fast transistors) &

communication (long wires)

  • increasing circuit complexity
  • decreasing fabrication reliability
slide-3
SLIDE 3

Winter 2006 CSE 548 - WaveScalar 3

Monolithic von Neumann Processors

A phenomenal success today. But in 2016?

 Performance

Centralized processing & control, e.g., operand broadcast networks

 Complexity

40-75% of “design” time is design verification  Defect tolerance 1 flaw -> paperweight

slide-4
SLIDE 4

Winter 2006 CSE 548 - WaveScalar 4

WaveScalar Executive Summary

Distributed microarchitecture 

  • hundreds of PEs
  • dataflow execution – no centralized control
  • short point-to-point communication
  • rganized hierarchically for fast communication between

neighboring PEs

  • defect tolerance – route around a bad PE

Low design complexity through simple, identical PEs 

  • design one & stamp out thousands
slide-5
SLIDE 5

Winter 2006 CSE 548 - WaveScalar 5

Processing Element

distributed tag matching 2 PEs in a pod

slide-6
SLIDE 6

Winter 2006 CSE 548 - WaveScalar 6

Domain

slide-7
SLIDE 7

Winter 2006 CSE 548 - WaveScalar 7

Cluster

slide-8
SLIDE 8

Winter 2006 CSE 548 - WaveScalar 8

Whole Chip

  • Can hold 32K instructions
  • Long distance communication
  • Dynamic routing
  • Grid-based network
  • 2-cycle hop/cluster
  • Normal memory hierarchy
  • Traditional directory-based

cache coherence

slide-9
SLIDE 9

Winter 2006 CSE 548 - WaveScalar 9

WaveScalar Execution Model

Dataflow Place instructions in PEs to maximize data locality & instruction-level parallelism.

  • Instruction placement algorithm based on a performance model

that captures the conflicting goals

  • Depth-first traversal of dataflow graph to make chains of

dependent instructions

  • Broken into segments
  • Snakes segments across the chip on demand
  • K-loop bounding to prevent instruction “explosion”

Instructions communicate values directly (point-to-point).

slide-10
SLIDE 10

Winter 2006 CSE 548 - WaveScalar 10

WaveScalar Instruction Placement

slide-11
SLIDE 11

Winter 2006 CSE 548 - WaveScalar 11

WaveScalar Example

A[j + i*i] = i; b = A[i*j]; * Load Store + j i * b A + +

slide-12
SLIDE 12

Winter 2006 CSE 548 - WaveScalar 12

WaveScalar Example

A[j + i*i] = i; b = A[i*j]; * Load Store + j i * b A + +

slide-13
SLIDE 13

Winter 2006 CSE 548 - WaveScalar 13

WaveScalar Example

A[j + i*i] = i; b = A[i*j]; * Load Store + j i * b A + +

slide-14
SLIDE 14

Winter 2006 CSE 548 - WaveScalar 14

WaveScalar Example

A[j + i*i] = i; b = A[i*j]; * Load Store + j i * b A + +

slide-15
SLIDE 15

Winter 2006 CSE 548 - WaveScalar 15

WaveScalar Example

A[j + i*i] = i; b = A[i*j]; * Load Store + j i * b A + +

slide-16
SLIDE 16

Winter 2006 CSE 548 - WaveScalar 16

WaveScalar Example

A[j + i*i] = i; b = A[i*j]; Global load-store ordering issue * Load Store + j i * b A + +

slide-17
SLIDE 17

Winter 2006 CSE 548 - WaveScalar 17

Wave-ordered Memory

  • Compiler annotates memory
  • perations
  • Send memory requests

in any order

  • Hardware reconstructs the

correct order

Load Store Load Store Load Store 3 4 8 5 6 7

 Sequence #

4 ? 9 6 8 8

 Successor

2 3 ? 4 5 4

 Predecessor

slide-18
SLIDE 18

Winter 2006 CSE 548 - WaveScalar 18

Store buffer Wave-ordering Example 4 ? 3 7 8 4 8 9 ? Load Store Load Store Load Store 5 6 6 8 3 4 2 8 9 ? 4 5 7 8 4 4 ? 3 3 4 2

slide-19
SLIDE 19

Winter 2006 CSE 548 - WaveScalar 19

Wave-ordered Memory

Waves are loop-free sections of the dataflow graph Each dynamic wave has a wave number Wave number is incremented between waves Ordering memory:

  • wave-numbers
  • sequence number within a wave
slide-20
SLIDE 20

Winter 2006 CSE 548 - WaveScalar 20

WaveScalar Tag-matching

WaveScalar tag

  • thread identifier
  • wave number

Token: tag & value

<ThreadID:Wave#> . value + <2:5>.3 <2:5>.6 <2:5>.9

slide-21
SLIDE 21

Winter 2006 CSE 548 - WaveScalar 21

Single-thread Performance

Performance per area 0.01 0.02 0.03 0.04 0.05

ammp art equake gzip mcf twolf djpeg mpeg2encode rawdaudio average

AIPC/mm

2

WS OOO

slide-22
SLIDE 22

Winter 2006 CSE 548 - WaveScalar 22

Multithreading the WaveCache

Architectural-support for WaveScalar threads

  • instructions to start & stop memory orderings, i.e., threads
  • memory-free synchronization to allow exclusive access to data

(TC)

  • fence instruction to allow other threads to see this one’s memory
  • ps

Combine to build threads with multiple granularities

  • coarse-grain threads: 25-168X over a single thread; 2-16X over

CMP, 5-11X over SMT

  • fine-grain, dataflow-style threads: 18-242X over single thread
  • combine the two in the same application: 1.6X or 7.9X -> 9X
slide-23
SLIDE 23

Winter 2006 CSE 548 - WaveScalar 23

Creating & Terminating a Thread

slide-24
SLIDE 24

Winter 2006 CSE 548 - WaveScalar 24

Thread Creation Overhead

slide-25
SLIDE 25

Winter 2006 CSE 548 - WaveScalar 25

Performance of Coarse-grain Parallelism

slide-26
SLIDE 26

Winter 2006 CSE 548 - WaveScalar 26

Performance of Fine-grain Parallelism

slide-27
SLIDE 27

Winter 2006 CSE 548 - WaveScalar 27

Building the WaveCache

RTL-level implementation

  • some didn’t believe it could be built in a normal-sized chip
  • some didn’t believe it could achieve a decent cycle time and load-

use latencies

  • Verilog & Synopsis CAD tools

Different WaveCache’s for different applications

  • 1 cluster: low-cost, low power, single-thread or embedded
  • 52 mm2 in 90 nm process technology, 3.5 AIPC on Splash2
  • 16 clusters: multiple threads, higher performance: 436 mm2 , 15

AIPC board-level FPGA implementation

  • OS & real application simulations