Data Flow Computing James Spooner, VP of Acceleration QCon, Finance - - PowerPoint PPT Presentation

data flow computing
SMART_READER_LITE
LIVE PREVIEW

Data Flow Computing James Spooner, VP of Acceleration QCon, Finance - - PowerPoint PPT Presentation

Acceleration in the Wild, with Data Flow Computing James Spooner, VP of Acceleration QCon, Finance Track, 08 March 2012 Acceleration in the Wild with Data Flow Deliberate, focused approach to improving application speed Involves adding


slide-1
SLIDE 1

James Spooner, VP of Acceleration

QCon, Finance Track, 08 March 2012

Acceleration in the Wild, with Data Flow Computing

slide-2
SLIDE 2

Acceleration in the Wild with Data Flow

  • Deliberate, focused approach to improving

application speed

– Involves adding Data Flow Engines (DFEs) – Makes some of the program faster – Will be programmed intentionally and be architecture specific – Will exploit as much available parallelism as possible – May require transformations to expose parallelism – May have multiple implementations

2

Maxeler is a acceleration specialist, delivering end-to-end performance for a range of clients in the banking and oil/gas exploration industries.

slide-3
SLIDE 3

Making efficient use of Silicon

slide-4
SLIDE 4
  • J. P, Eckert, Jr (Co-Inventor of ENIAC)

Credit: Prof. Paul H.J. Kelly

Computing History…

slide-5
SLIDE 5

“The parallel approach to computing does require that some original thinking be done about numerical analysis and data management in order to secure efficient use. In an environment which has represented the absence of the need to think as the highest virtue this is a decided disadvantage.”

  • Daniel Slotnick (Chief Architect of ILLIAC IV), 1967

Computing History…

Credit: Prof. Michael J. Flynn

slide-6
SLIDE 6
  • Eckert (and Amdahl) were right, Slotnik was wrong, until…
  • Serial computing hit the wall(s) last decade:

– The memory wall; the increasing gap between processor and memory

  • speeds. This effect pushes cache sizes larger in order to mask the

latency of memory. This helps only to the extent that memory bandwidth is not the bottleneck in performance. – The ILP wall; the increasing difficulty of finding enough parallelism in a single instruction stream to keep a high-performance single-core processor busy. – The power wall; the trend of consuming exponentially increasing power with each factorial increase of operating frequency. This increase can be mitigated by "shrinking" the processor by using smaller traces for the same logic. The power wall poses manufacturing, system design and deployment problems that have not been justified in the face of the diminished gains in performance due to the memory wall and ILP wall.

6

So what happened?

Source: Wikipedia

f V C P

DD load avg

  

2

slide-7
SLIDE 7

Using silicon efficiently - parallelism

Level of Parallelism Examples Costs Coarse Grained

  • Multi-Node, Multi-chip, multi-core
  • Process / thread level parallism
  • Developing a distributed

Distributed system

  • Locks, mutexes, queues, etc.

Fine Grained

  • Instruction level parallelism (ILP)
  • Out-of-order execution, superscalar,

instruction pipelining, speculative execution

  • Data level parallelism
  • SIMD / SSE
  • Lots of silicon
  • Compiler can do some work

upfront Ultra Fine Grained

  • Data Flow architectures
  • Massively parallel, lock free, hazard

free, streaming datapaths

  • Resolve once
slide-8
SLIDE 8

How is modern silicon used?

8

Intel 6-Core X5680 “Westmere”

slide-9
SLIDE 9

How is modern silicon used?

Intel 6-Core X5680 “Westmere”

9 Computation Support Logic for fine grained parallelism

slide-10
SLIDE 10

What is Dataflow Computing?

10

Computing with control flow processors Computing with dataflow engines (DFEs)

vs.

slide-11
SLIDE 11

MPC-X1000

  • 8 vectis dataflow engines (DFEs)
  • 192GB of DFE RAM
  • Dynamic allocation of DFEs to

conventional CPU servers

– Zero-copy RDMA between CPUs and DFEs over Infiniband

  • Equivalent performance to

40-60 x86 servers

11

1U dataflow cloud providing dynamically scalable compute capability over Infiniband

slide-12
SLIDE 12

Dataflow Programming

slide-13
SLIDE 13

Application Components

13

SLiC MaxelerOS Memory CPU DataFlow Memory

PCI Express

Kernels

*

+ +

Manager Host application

slide-14
SLIDE 14

Programming with MaxCompiler

14

C / C++ / Fortran MaxJ

SLiC

slide-15
SLIDE 15

for (int i =0; i < DATA_SIZE; i++) y[i]= x[i] * x[i] + 30;

Main Memory CPU CPU Code CPU Code (.c)

MaxCompiler Development Process

15 int *x, *y;

30   

i i i

x x y

slide-16
SLIDE 16

for (int i =0; i < DATA_SIZE; i++) y[i]= x[i] * x[i] + 30; PCI Express

Manager

Chip

Memory Manager (.java)

x x +

30

x

Manager m = new Manager(“Calc”); Kernel k = new MyKernel(); m.setKernel(k); m.setIO( link(“x", PCIE), m.addMode(modeDefault()); m.build(); link(“y", PCIE)); #include “MaxSLiCInterface.h” #include “Calc.max” Calc(x, y, DATA_SIZE)

Main Memory CPU CPU Code CPU Code (.c)

MaxCompiler Development Process

16

SLiC MaxelerOS

HWVar x = io.input("x", hwInt(32)); HWVar result = x * x + 30; io.output("y", result, hwInt(32));

MyKernel (.java)

int *x, *y;

y x x +

30

y x

slide-17
SLIDE 17

PCI Express

Manager

Chip

Memory Manager (.java)

Manager m = new Manager(); Kernel k = new MyKernel(); m.setKernel(k); m.setIO( link(“x", PCIE), m.addMode(modeDefault()); m.build(); device = max_open_device(maxfile, "/dev/maxeler0"); Calc(x, DATA_SIZE)

Main Memory CPU Host Code CPUCode (.c)

MaxCompiler Development Process

17

SLiC MaxelerOS

HWVar x = io.input("x", hwInt(32)); HWVar result = x * x + 30; io.output("y", result, hwInt(32));

MyKernel (.java)

#include “MaxSLiCInterface.h” #include “Calc.max” int *x, *y;

x y

link(“y", DRAM_LINEAR1D));

x x +

30

x

x x +

30

y

slide-18
SLIDE 18

x x +

30

y

public class MyKernel extends Kernel { public MyKernel (KernelParameters parameters) { super(parameters); HWVar x = io.input("x", hwInt(32)); HWVar result = x * x + 30; io.output("y", result, hwInt(32)); } } 18

The Full Kernel

slide-19
SLIDE 19

x x +

30

y

Kernel Streaming: In Hardware

19

5 4 3 2 1 0

slide-20
SLIDE 20

x x +

30

y

Kernel Streaming: In Hardware

20

5 4 3 2 1 0

slide-21
SLIDE 21

x x +

30

y

Kernel Streaming: In Hardware

21

5 4 3 2 1 0 1

slide-22
SLIDE 22

x x +

30

y

Kernel Streaming: In Hardware

22

30 1 2 5 4 3 2 1 0

slide-23
SLIDE 23

x x +

30

y

Kernel Streaming: In Hardware

23

5 4 3 2 1 0 31 4 3 30

slide-24
SLIDE 24

x x +

30

y

Kernel Streaming: In Hardware

24

5 4 3 2 1 0 34 9 4 30 31

slide-25
SLIDE 25

x x +

30

y

Kernel Streaming: In Hardware

25

5 4 3 2 1 0 39 16 5 30 31 34

slide-26
SLIDE 26

x x +

30

y

Kernel Streaming: In Hardware

26

5 4 3 2 1 0 46 25 30 31 34 39

slide-27
SLIDE 27

x x +

30

y

Kernel Streaming: In Hardware

27

5 4 3 2 1 0 55 30 31 34 39 46

slide-28
SLIDE 28

x x +

30

y

Kernel Streaming: In Hardware

28

5 4 3 2 1 0 30 31 34 39 46 55

slide-29
SLIDE 29

Data flow graph as generated by MaxCompiler 4866 nodes; about 250x100

slide-30
SLIDE 30

How we approach Acceleration

slide-31
SLIDE 31
  • Messy code
  • Complicated build

dependences

  • Confused control-flow
  • Impenetrable data

access

  • Pointer-intensive data

structures

  • Premature
  • ptimization

31

What always makes Acceleration hard?

x y z x y z p x y z x y z x y z x y z r θ x y z q x y z p x y z

for (i=0; i<N; ++i) { points[i]->incx(); }

slide-32
SLIDE 32
  • Some well-motivated

software structures have real value, but make acceleration harder

  • Examples:

– Virtual method calls inside a loop – Collections with non- uniform type – Substructure sharing

32

Conflicting Goals

x y z x y z p x y z x y z x y z x y z r θ x y z q x y z p x y z

for (i=0; i<N; ++i) { points[i]->incx(); }

slide-33
SLIDE 33
  • Self-evident data

dependences

  • Computing on large

collections of uniform data

  • Appropriate representation

hiding

  • Getting the abstraction right

33

What makes Acceleration easier?

x x x x x x x x y y y y y y y y z z z z z z z z

slide-34
SLIDE 34

Maximum Performance Computing

  • Identify parallelism and take advantage of it

– Fully understand data dependencies

  • Minimize memory bandwidth

– Data reuse and representation

  • Regularize the computation and data

– Minimize control flow complexity

  • Find optimal balance for underlying architecture

– Memory hierarchy bandwidth(s) and size(s) and latency(s) – Communication bandwidth(s) and latency(s) – Math performance – Branch cost (control divergence) – Axes of Parallelism

34

slide-35
SLIDE 35
  • Run the code with

profiling tools

  • Understand data and

loop structures and data access patterns

  • Investigate

transformation options for these structures and access patterns

  • Decide which parts of the

code need acceleration

  • Implement and validate

35

Maxeler Acceleration Process

Analysis Code Transformation Partitioning Implementation Result

Sets theoretical performance bounds Achieve performance

slide-36
SLIDE 36

Application Analysis

36

slide-37
SLIDE 37

Partitioning Options

37

Data Access Plans Code Partitioning

Transformations

Pareto Optimal Options

Runtime Development Time

Try to minimise runtime and development time, while maximising flexibility and precision.

slide-38
SLIDE 38

Credit Derivatives Valuation & Risk

  • Compute value of

complex financial derivatives (CDOs)

  • Typically run overnight,

but beneficial to compute in real-time

  • Many independent jobs
  • Speedup: 220-270x
  • Power consumption per

node drops from 250W to 235W/node

38

slide-39
SLIDE 39

Discovering the Dataflow of an Application

slide-40
SLIDE 40
  • Developed in-house to make deciphering complex

code easier

  • MaxSpot is a tool to profile, analyse, and visualise

the dynamic behaviour of applications

  • Extensible analysis framework
  • Determines control-flow and data-flow
  • Build loop graphs
  • Runs on application binaries

– Independent of original programming languages(s) – Execute MaxSpot with one (or more) test data-sets and

  • bserve code paths

40

MaxSpot

slide-41
SLIDE 41

Control Flow: Matrix Multiply

41

void mm5(A,B,C) FLOATTYPE A[SZ][SZ], B[SZ][SZ], C[SZ][SZ]; { int i, j, k; FLOATTYPE r; for (k = 0; k < SZ; k++){ for (i = 0; i < SZ; i++){ for (j = 0; j < SZ; j++){ C[i][j] += A[i][k] * B[k][j]; } } } }

slide-42
SLIDE 42

Data Flow: Matrix Multiply

42

slide-43
SLIDE 43

Performance and Profiling of Accelerated Systems

slide-44
SLIDE 44
  • Top measures % of time CPU is running
  • Maxtop monitors % of time the DFE is running

44

Measuring Utilization

MaxTop Tool 2011.2 Found 2 Maxeler card(s) running MaxelerOS 2011.2 Card 0: MAX3A (P/N: 13424) S/N: 219270088 Mem: 24GB DFE(s): 1 /dev/maxeler0 Card 1: MAX3A (P/N: 13424) S/N: 000025559 Mem: 24GB DFE(s): 1 /dev/maxeler1 DEVICE %DFE TEMP BITSTREAM PID USER TIME COMMAND maxeler0 66.6% 57.1C 9d9de1... 12333 jspooner 00:00:39 model maxeler1 0.0% 54.6C 9d9de1... - - - -

slide-45
SLIDE 45
  • CPU and DFE can (and should!) process in parallel

– Runtime always limited by longest running part

45

Overlapping CPU + DFE

CPU DFE

Sequential Run-time

CPU DFE

Overlapped Run-time

slide-46
SLIDE 46

Performance Profiling

46

slide-47
SLIDE 47

Maxeler University Program Members

47

slide-48
SLIDE 48
  • The challenge is to make the best use of Silicon we

can

  • Frequency Scaling is over, it’s time to start thinking in

parallel

  • Heterogeneous system design allows us to tailor

systems to the applications

  • Ultra-fine-grained parallelism in Dataflow computing

benefits throughput and latency

48

Conclusions