[PPT] - Data Flow Computing James Spooner, VP of Acceleration QCon, Finance PowerPoint Presentation

SLIDE 1

James Spooner, VP of Acceleration

QCon, Finance Track, 08 March 2012

Acceleration in the Wild, with Data Flow Computing

SLIDE 2

Acceleration in the Wild with Data Flow

Deliberate, focused approach to improving

application speed

– Involves adding Data Flow Engines (DFEs) – Makes some of the program faster – Will be programmed intentionally and be architecture specific – Will exploit as much available parallelism as possible – May require transformations to expose parallelism – May have multiple implementations

2

Maxeler is a acceleration specialist, delivering end-to-end performance for a range of clients in the banking and oil/gas exploration industries.

SLIDE 3

Making efficient use of Silicon

SLIDE 4

J. P, Eckert, Jr (Co-Inventor of ENIAC)

Credit: Prof. Paul H.J. Kelly

Computing History…

SLIDE 5

“The parallel approach to computing does require that some original thinking be done about numerical analysis and data management in order to secure efficient use. In an environment which has represented the absence of the need to think as the highest virtue this is a decided disadvantage.”

Daniel Slotnick (Chief Architect of ILLIAC IV), 1967

Computing History…

Credit: Prof. Michael J. Flynn

SLIDE 6

Eckert (and Amdahl) were right, Slotnik was wrong, until…
Serial computing hit the wall(s) last decade:

– The memory wall; the increasing gap between processor and memory

speeds. This effect pushes cache sizes larger in order to mask the

latency of memory. This helps only to the extent that memory bandwidth is not the bottleneck in performance. – The ILP wall; the increasing difficulty of finding enough parallelism in a single instruction stream to keep a high-performance single-core processor busy. – The power wall; the trend of consuming exponentially increasing power with each factorial increase of operating frequency. This increase can be mitigated by "shrinking" the processor by using smaller traces for the same logic. The power wall poses manufacturing, system design and deployment problems that have not been justified in the face of the diminished gains in performance due to the memory wall and ILP wall.

6

So what happened?

Source: Wikipedia

f V C P

DD load avg

  

2

SLIDE 7

Using silicon efficiently - parallelism

Level of Parallelism Examples Costs Coarse Grained

Multi-Node, Multi-chip, multi-core
Process / thread level parallism
Developing a distributed

Distributed system

Locks, mutexes, queues, etc.

Fine Grained

Instruction level parallelism (ILP)
Out-of-order execution, superscalar,

instruction pipelining, speculative execution

Data level parallelism
SIMD / SSE
Lots of silicon
Compiler can do some work

upfront Ultra Fine Grained

Data Flow architectures
Massively parallel, lock free, hazard

free, streaming datapaths

Resolve once

SLIDE 8

How is modern silicon used?

8

Intel 6-Core X5680 “Westmere”

SLIDE 9

How is modern silicon used?

Intel 6-Core X5680 “Westmere”

9 Computation Support Logic for fine grained parallelism

SLIDE 10

What is Dataflow Computing?

10

Computing with control flow processors Computing with dataflow engines (DFEs)

vs.

SLIDE 11

MPC-X1000

8 vectis dataflow engines (DFEs)
192GB of DFE RAM
Dynamic allocation of DFEs to

conventional CPU servers

– Zero-copy RDMA between CPUs and DFEs over Infiniband

Equivalent performance to

40-60 x86 servers

11

1U dataflow cloud providing dynamically scalable compute capability over Infiniband

SLIDE 12

Dataflow Programming

SLIDE 13

Application Components

13

SLiC MaxelerOS Memory CPU DataFlow Memory

PCI Express

Kernels

*

+ +

Manager Host application

SLIDE 14

Programming with MaxCompiler

14

C / C++ / Fortran MaxJ

SLiC

SLIDE 15

for (int i =0; i < DATA_SIZE; i++) y[i]= x[i] * x[i] + 30;

Main Memory CPU CPU Code CPU Code (.c)

MaxCompiler Development Process

15 int *x, *y;

30   

i i i

x x y

SLIDE 16

for (int i =0; i < DATA_SIZE; i++) y[i]= x[i] * x[i] + 30; PCI Express

Manager

Chip

Memory Manager (.java)

x x +

30

x

Manager m = new Manager(“Calc”); Kernel k = new MyKernel(); m.setKernel(k); m.setIO( link(“x", PCIE), m.addMode(modeDefault()); m.build(); link(“y", PCIE)); #include “MaxSLiCInterface.h” #include “Calc.max” Calc(x, y, DATA_SIZE)

Main Memory CPU CPU Code CPU Code (.c)

MaxCompiler Development Process

16

SLiC MaxelerOS

HWVar x = io.input("x", hwInt(32)); HWVar result = x * x + 30; io.output("y", result, hwInt(32));

MyKernel (.java)

int *x, *y;

y x x +

30

y x

SLIDE 17

PCI Express

Manager

Chip

Memory Manager (.java)

Manager m = new Manager(); Kernel k = new MyKernel(); m.setKernel(k); m.setIO( link(“x", PCIE), m.addMode(modeDefault()); m.build(); device = max_open_device(maxfile, "/dev/maxeler0"); Calc(x, DATA_SIZE)

Main Memory CPU Host Code CPUCode (.c)

MaxCompiler Development Process

17

SLiC MaxelerOS

HWVar x = io.input("x", hwInt(32)); HWVar result = x * x + 30; io.output("y", result, hwInt(32));

MyKernel (.java)

#include “MaxSLiCInterface.h” #include “Calc.max” int *x, *y;

x y

link(“y", DRAM_LINEAR1D));

x x +

30

x

x x +

30

y

SLIDE 18

x x +

30

y

28

5 4 3 2 1 0 30 31 34 39 46 55

SLIDE 29

Data flow graph as generated by MaxCompiler 4866 nodes; about 250x100

SLIDE 30

How we approach Acceleration

SLIDE 31

Messy code
Complicated build

dependences

Confused control-flow
Impenetrable data

access

Pointer-intensive data

structures

Premature
ptimization

31

What always makes Acceleration hard?

x y z x y z p x y z x y z x y z x y z r θ x y z q x y z p x y z

for (i=0; i<N; ++i) { points[i]->incx(); }

SLIDE 32

Some well-motivated

software structures have real value, but make acceleration harder

Examples:

– Virtual method calls inside a loop – Collections with non- uniform type – Substructure sharing

32

Conflicting Goals

x y z x y z p x y z x y z x y z x y z r θ x y z q x y z p x y z

for (i=0; i<N; ++i) { points[i]->incx(); }

SLIDE 33

Self-evident data

dependences

Computing on large

collections of uniform data

Appropriate representation

hiding

Getting the abstraction right

33

What makes Acceleration easier?

x x x x x x x x y y y y y y y y z z z z z z z z

SLIDE 34

Maximum Performance Computing

Identify parallelism and take advantage of it

– Fully understand data dependencies

Minimize memory bandwidth

– Data reuse and representation

Regularize the computation and data

– Minimize control flow complexity

Find optimal balance for underlying architecture

– Memory hierarchy bandwidth(s) and size(s) and latency(s) – Communication bandwidth(s) and latency(s) – Math performance – Branch cost (control divergence) – Axes of Parallelism

34

SLIDE 35

Run the code with

profiling tools

Understand data and

loop structures and data access patterns

Investigate

transformation options for these structures and access patterns

Decide which parts of the

code need acceleration

Implement and validate

35

Maxeler Acceleration Process

Analysis Code Transformation Partitioning Implementation Result

Sets theoretical performance bounds Achieve performance

SLIDE 36

Application Analysis

36

SLIDE 37

Partitioning Options

37

Data Access Plans Code Partitioning

Transformations

Pareto Optimal Options

Runtime Development Time

Try to minimise runtime and development time, while maximising flexibility and precision.

SLIDE 38

Credit Derivatives Valuation & Risk

Compute value of

complex financial derivatives (CDOs)

Typically run overnight,

but beneficial to compute in real-time

Many independent jobs
Speedup: 220-270x
Power consumption per

node drops from 250W to 235W/node

38

SLIDE 39

Discovering the Dataflow of an Application

SLIDE 40

Developed in-house to make deciphering complex

code easier

MaxSpot is a tool to profile, analyse, and visualise

the dynamic behaviour of applications

Extensible analysis framework
Determines control-flow and data-flow
Build loop graphs
Runs on application binaries

– Independent of original programming languages(s) – Execute MaxSpot with one (or more) test data-sets and

bserve code paths

40

MaxSpot

SLIDE 41

Control Flow: Matrix Multiply

41

void mm5(A,B,C) FLOATTYPE A[SZ][SZ], B[SZ][SZ], C[SZ][SZ]; { int i, j, k; FLOATTYPE r; for (k = 0; k < SZ; k++){ for (i = 0; i < SZ; i++){ for (j = 0; j < SZ; j++){ C[i][j] += A[i][k] * B[k][j]; } } } }

SLIDE 42

Data Flow: Matrix Multiply

42

SLIDE 43

Performance and Profiling of Accelerated Systems

SLIDE 44

Top measures % of time CPU is running
Maxtop monitors % of time the DFE is running

44

Measuring Utilization

MaxTop Tool 2011.2 Found 2 Maxeler card(s) running MaxelerOS 2011.2 Card 0: MAX3A (P/N: 13424) S/N: 219270088 Mem: 24GB DFE(s): 1 /dev/maxeler0 Card 1: MAX3A (P/N: 13424) S/N: 000025559 Mem: 24GB DFE(s): 1 /dev/maxeler1 DEVICE %DFE TEMP BITSTREAM PID USER TIME COMMAND maxeler0 66.6% 57.1C 9d9de1... 12333 jspooner 00:00:39 model maxeler1 0.0% 54.6C 9d9de1... - - - -

SLIDE 45

CPU and DFE can (and should!) process in parallel

– Runtime always limited by longest running part

45

Overlapping CPU + DFE

CPU DFE

Sequential Run-time

CPU DFE

Overlapped Run-time

SLIDE 46

Performance Profiling

46

SLIDE 47

Maxeler University Program Members

47

SLIDE 48

The challenge is to make the best use of Silicon we

can

Frequency Scaling is over, it’s time to start thinking in

parallel

Heterogeneous system design allows us to tailor

systems to the applications

Ultra-fine-grained parallelism in Dataflow computing

benefits throughput and latency

48