Data Flow Computing James Spooner, VP of Acceleration QCon, Finance - - PowerPoint PPT Presentation
Data Flow Computing James Spooner, VP of Acceleration QCon, Finance - - PowerPoint PPT Presentation
Acceleration in the Wild, with Data Flow Computing James Spooner, VP of Acceleration QCon, Finance Track, 08 March 2012 Acceleration in the Wild with Data Flow Deliberate, focused approach to improving application speed Involves adding
Acceleration in the Wild with Data Flow
- Deliberate, focused approach to improving
application speed
– Involves adding Data Flow Engines (DFEs) – Makes some of the program faster – Will be programmed intentionally and be architecture specific – Will exploit as much available parallelism as possible – May require transformations to expose parallelism – May have multiple implementations
2
Maxeler is a acceleration specialist, delivering end-to-end performance for a range of clients in the banking and oil/gas exploration industries.
Making efficient use of Silicon
- J. P, Eckert, Jr (Co-Inventor of ENIAC)
Credit: Prof. Paul H.J. Kelly
Computing History…
“The parallel approach to computing does require that some original thinking be done about numerical analysis and data management in order to secure efficient use. In an environment which has represented the absence of the need to think as the highest virtue this is a decided disadvantage.”
- Daniel Slotnick (Chief Architect of ILLIAC IV), 1967
Computing History…
Credit: Prof. Michael J. Flynn
- Eckert (and Amdahl) were right, Slotnik was wrong, until…
- Serial computing hit the wall(s) last decade:
– The memory wall; the increasing gap between processor and memory
- speeds. This effect pushes cache sizes larger in order to mask the
latency of memory. This helps only to the extent that memory bandwidth is not the bottleneck in performance. – The ILP wall; the increasing difficulty of finding enough parallelism in a single instruction stream to keep a high-performance single-core processor busy. – The power wall; the trend of consuming exponentially increasing power with each factorial increase of operating frequency. This increase can be mitigated by "shrinking" the processor by using smaller traces for the same logic. The power wall poses manufacturing, system design and deployment problems that have not been justified in the face of the diminished gains in performance due to the memory wall and ILP wall.
6
So what happened?
Source: Wikipedia
f V C P
DD load avg
2
Using silicon efficiently - parallelism
Level of Parallelism Examples Costs Coarse Grained
- Multi-Node, Multi-chip, multi-core
- Process / thread level parallism
- Developing a distributed
Distributed system
- Locks, mutexes, queues, etc.
Fine Grained
- Instruction level parallelism (ILP)
- Out-of-order execution, superscalar,
instruction pipelining, speculative execution
- Data level parallelism
- SIMD / SSE
- Lots of silicon
- Compiler can do some work
upfront Ultra Fine Grained
- Data Flow architectures
- Massively parallel, lock free, hazard
free, streaming datapaths
- Resolve once
How is modern silicon used?
8
Intel 6-Core X5680 “Westmere”
How is modern silicon used?
Intel 6-Core X5680 “Westmere”
9 Computation Support Logic for fine grained parallelism
What is Dataflow Computing?
10
Computing with control flow processors Computing with dataflow engines (DFEs)
vs.
MPC-X1000
- 8 vectis dataflow engines (DFEs)
- 192GB of DFE RAM
- Dynamic allocation of DFEs to
conventional CPU servers
– Zero-copy RDMA between CPUs and DFEs over Infiniband
- Equivalent performance to
40-60 x86 servers
11
1U dataflow cloud providing dynamically scalable compute capability over Infiniband
Dataflow Programming
Application Components
13
SLiC MaxelerOS Memory CPU DataFlow Memory
PCI Express
Kernels
*
+ +
Manager Host application
Programming with MaxCompiler
14
C / C++ / Fortran MaxJ
SLiC
for (int i =0; i < DATA_SIZE; i++) y[i]= x[i] * x[i] + 30;
Main Memory CPU CPU Code CPU Code (.c)
MaxCompiler Development Process
15 int *x, *y;
30
i i i
x x y
for (int i =0; i < DATA_SIZE; i++) y[i]= x[i] * x[i] + 30; PCI Express
Manager
Chip
Memory Manager (.java)
x x +
30
x
Manager m = new Manager(“Calc”); Kernel k = new MyKernel(); m.setKernel(k); m.setIO( link(“x", PCIE), m.addMode(modeDefault()); m.build(); link(“y", PCIE)); #include “MaxSLiCInterface.h” #include “Calc.max” Calc(x, y, DATA_SIZE)
Main Memory CPU CPU Code CPU Code (.c)
MaxCompiler Development Process
16
SLiC MaxelerOS
HWVar x = io.input("x", hwInt(32)); HWVar result = x * x + 30; io.output("y", result, hwInt(32));
MyKernel (.java)
int *x, *y;
y x x +
30
y x
PCI Express
Manager
Chip
Memory Manager (.java)
Manager m = new Manager(); Kernel k = new MyKernel(); m.setKernel(k); m.setIO( link(“x", PCIE), m.addMode(modeDefault()); m.build(); device = max_open_device(maxfile, "/dev/maxeler0"); Calc(x, DATA_SIZE)
Main Memory CPU Host Code CPUCode (.c)
MaxCompiler Development Process
17
SLiC MaxelerOS
HWVar x = io.input("x", hwInt(32)); HWVar result = x * x + 30; io.output("y", result, hwInt(32));
MyKernel (.java)
#include “MaxSLiCInterface.h” #include “Calc.max” int *x, *y;
x y
link(“y", DRAM_LINEAR1D));
x x +
30
x
x x +
30
y
x x +
30
y
public class MyKernel extends Kernel { public MyKernel (KernelParameters parameters) { super(parameters); HWVar x = io.input("x", hwInt(32)); HWVar result = x * x + 30; io.output("y", result, hwInt(32)); } } 18
The Full Kernel
x x +
30
y
Kernel Streaming: In Hardware
19
5 4 3 2 1 0
x x +
30
y
Kernel Streaming: In Hardware
20
5 4 3 2 1 0
x x +
30
y
Kernel Streaming: In Hardware
21
5 4 3 2 1 0 1
x x +
30
y
Kernel Streaming: In Hardware
22
30 1 2 5 4 3 2 1 0
x x +
30
y
Kernel Streaming: In Hardware
23
5 4 3 2 1 0 31 4 3 30
x x +
30
y
Kernel Streaming: In Hardware
24
5 4 3 2 1 0 34 9 4 30 31
x x +
30
y
Kernel Streaming: In Hardware
25
5 4 3 2 1 0 39 16 5 30 31 34
x x +
30
y
Kernel Streaming: In Hardware
26
5 4 3 2 1 0 46 25 30 31 34 39
x x +
30
y
Kernel Streaming: In Hardware
27
5 4 3 2 1 0 55 30 31 34 39 46
x x +
30
y
Kernel Streaming: In Hardware
28
5 4 3 2 1 0 30 31 34 39 46 55
Data flow graph as generated by MaxCompiler 4866 nodes; about 250x100
How we approach Acceleration
- Messy code
- Complicated build
dependences
- Confused control-flow
- Impenetrable data
access
- Pointer-intensive data
structures
- Premature
- ptimization
31
What always makes Acceleration hard?
x y z x y z p x y z x y z x y z x y z r θ x y z q x y z p x y z
for (i=0; i<N; ++i) { points[i]->incx(); }
- Some well-motivated
software structures have real value, but make acceleration harder
- Examples:
– Virtual method calls inside a loop – Collections with non- uniform type – Substructure sharing
32
Conflicting Goals
x y z x y z p x y z x y z x y z x y z r θ x y z q x y z p x y z
for (i=0; i<N; ++i) { points[i]->incx(); }
- Self-evident data
dependences
- Computing on large
collections of uniform data
- Appropriate representation
hiding
- Getting the abstraction right
33
What makes Acceleration easier?
x x x x x x x x y y y y y y y y z z z z z z z z
Maximum Performance Computing
- Identify parallelism and take advantage of it
– Fully understand data dependencies
- Minimize memory bandwidth
– Data reuse and representation
- Regularize the computation and data
– Minimize control flow complexity
- Find optimal balance for underlying architecture
– Memory hierarchy bandwidth(s) and size(s) and latency(s) – Communication bandwidth(s) and latency(s) – Math performance – Branch cost (control divergence) – Axes of Parallelism
34
- Run the code with
profiling tools
- Understand data and
loop structures and data access patterns
- Investigate
transformation options for these structures and access patterns
- Decide which parts of the
code need acceleration
- Implement and validate
35
Maxeler Acceleration Process
Analysis Code Transformation Partitioning Implementation Result
Sets theoretical performance bounds Achieve performance
Application Analysis
36
Partitioning Options
37
Data Access Plans Code Partitioning
Transformations
Pareto Optimal Options
Runtime Development Time
Try to minimise runtime and development time, while maximising flexibility and precision.
Credit Derivatives Valuation & Risk
- Compute value of
complex financial derivatives (CDOs)
- Typically run overnight,
but beneficial to compute in real-time
- Many independent jobs
- Speedup: 220-270x
- Power consumption per
node drops from 250W to 235W/node
38
Discovering the Dataflow of an Application
- Developed in-house to make deciphering complex
code easier
- MaxSpot is a tool to profile, analyse, and visualise
the dynamic behaviour of applications
- Extensible analysis framework
- Determines control-flow and data-flow
- Build loop graphs
- Runs on application binaries
– Independent of original programming languages(s) – Execute MaxSpot with one (or more) test data-sets and
- bserve code paths
40
MaxSpot
Control Flow: Matrix Multiply
41
void mm5(A,B,C) FLOATTYPE A[SZ][SZ], B[SZ][SZ], C[SZ][SZ]; { int i, j, k; FLOATTYPE r; for (k = 0; k < SZ; k++){ for (i = 0; i < SZ; i++){ for (j = 0; j < SZ; j++){ C[i][j] += A[i][k] * B[k][j]; } } } }
Data Flow: Matrix Multiply
42
Performance and Profiling of Accelerated Systems
- Top measures % of time CPU is running
- Maxtop monitors % of time the DFE is running
44
Measuring Utilization
MaxTop Tool 2011.2 Found 2 Maxeler card(s) running MaxelerOS 2011.2 Card 0: MAX3A (P/N: 13424) S/N: 219270088 Mem: 24GB DFE(s): 1 /dev/maxeler0 Card 1: MAX3A (P/N: 13424) S/N: 000025559 Mem: 24GB DFE(s): 1 /dev/maxeler1 DEVICE %DFE TEMP BITSTREAM PID USER TIME COMMAND maxeler0 66.6% 57.1C 9d9de1... 12333 jspooner 00:00:39 model maxeler1 0.0% 54.6C 9d9de1... - - - -
- CPU and DFE can (and should!) process in parallel
– Runtime always limited by longest running part
45
Overlapping CPU + DFE
CPU DFE
Sequential Run-time
CPU DFE
Overlapped Run-time
Performance Profiling
46
Maxeler University Program Members
47
- The challenge is to make the best use of Silicon we
can
- Frequency Scaling is over, it’s time to start thinking in
parallel
- Heterogeneous system design allows us to tailor
systems to the applications
- Ultra-fine-grained parallelism in Dataflow computing
benefits throughput and latency
48