Optimising SpMV for FEM on FPGAs
Paul Grigoras, Pavel Burovskiy, Wayne Luk, Spencer Sherwin
1
Optimising SpMV for FEM on FPGAs Paul Grigoras, Pavel Burovskiy, - - PowerPoint PPT Presentation
Optimising SpMV for FEM on FPGAs Paul Grigoras, Pavel Burovskiy, Wayne Luk, Spencer Sherwin 1 2 Finite Element Methods - Solve PDEs over large, unstructured geometries PDEs: Incompressible Navier Stokes, Shallow Water etc.
1
2
3
Mesh over unstructured domain
4 Source: www.nektar.info
Mesh over unstructured domain Mesh elements
5 Source: www.nektar.info
Mesh over unstructured domain Mesh elements
6
Sparse Matrix Assembly
Source: www.nektar.info
Mesh over unstructured domain Mesh elements
7
Sparse Matrix Assembly PDE Solver
Source: www.nektar.info
Mesh over unstructured domain
8
CFD Simulation
Source: www.nektar.info Source: www.nektar.info
Mesh over unstructured domain Mesh elements
9
Sparse Matrix Assembly PDE Solver
Source: www.nektar.info
Mesh over unstructured domain Mesh elements
10
Sparse Matrix Assembly PDE Solver Linear Solver
Source: www.nektar.info
Mesh over unstructured domain Mesh elements
11
Sparse Matrix Assembly PDE Solver Iterative Linear Solver ⇒ SpMV
Source: www.nektar.info
Mesh over unstructured domain Mesh elements
12
Sparse Matrix Assembly PDE Solver Vector Gather/Scatter [Burovskiy FPL15] Block Diagonal SpMV (this work)
Source: www.nektar.info
○ block diagonal SpMV (this work) vs generic SpMV (prior work)
13
to one element
structured computation
14
○ block diagonal SpMV (this work) vs generic SpMV (prior work)
15
○ block diagonal SpMV (this work) vs generic SpMV (prior work)
○ Optimised architecture and implementation for block diagonal SpMV ○ Resource constrained performance model for the proposed architecture ○ Automated method to customise the architecture based on mesh parameters
16
○ block diagonal SpMV (this work) vs generic SpMV (prior work)
○ Optimised architecture and implementation for block diagonal SpMV ○ Resource constrained performance model for the proposed architecture ○ Automated method to customise the architecture based on mesh parameters
○ Maximise throughput/area ⇒ fit larger meshes & improve performance
17
18
○ Independent memory channel ○ Customisable precision datapath ○ Variable depth FIFO - support block variations at runtime
○ Independent memory channel ○ Customisable precision datapath ○ Variable depth FIFO - support block variations at runtime
○ Parametric: NMPEs, MPEwidth ○ Task vs Data Parellelism tradeoff ○ ⇒ Mesh specific optimal config.
19
○ Independent memory channel ○ Customisable precision datapath ○ Variable depth FIFO - support block variations at runtime
○ Parametric: NMPEs, MPEwidth ○ Task vs Data Parellelism tradeoff ○ ⇒ Mesh specific optimal config.
⇒ Simplified control (format decoding) ⇒ Reduced metadata ⇒ Simplified reduction circuit
20
number of blocks for each element
architectures
architecture
21
22
integrated in existing FEM software packages
access pattern in DRAM ○ Maximise throughput ○ Minimise control logic
23
24
Offline tuning: build a repository of customised architectures from a set of mesh instances
25
Offline tuning: build a repository of customised architectures from a set of mesh instances
26
Runtime: select the
input mesh instance
27
○ Design: MaxComplier + MaxJ dataflow language ○ FPGA Server: Maxeler Max 4 Maia (Stratix VSG, 48GB DRAM, per board) ○ Software: C++14, G++ 5.2 ○ CPU Server: Dual Intel Xeon E5-2640, 64GB DRAM, Infiniband QSFP ○ Place and route with Altera Quartus 14.1 ○ Available as extension to the CASK framework [Grigoras et al, FPGA 16]: ■ http://caskorg.github.io/cask/
○ [Burovskiy et al, FPL 15], Nektar++ Accelerated FEM
28
a. Fixed mesh (NACA 1L, [Burovskiy et al, FPL 2015]) - optimal architecture
29
30
Compute efficiency is maximised for smaller MPE Width
31
Compute efficiency is maximised for smaller MPE Width
32
Achieved DRAM bandwidth is maximised for larger MPE Width
Compute efficiency is maximised for smaller MPE Width
33
Achieved DRAM bandwidth is maximised for larger MPE Width ⇒ aggressive tuning (max MPE Width) - not resource efficient
a. Fixed mesh (NACA 1L, [Burovskiy et al, FPL 2015]) - optimal architecture
34
a. Fixed mesh (NACA 1L, [Burovskiy et al, FPL 2015]) - optimal architecture
b. Fixed architecture, variable mesh, data parallel vs task parallel
35
36
37
A1 ~ 2X Better A2 ~ 2X Better
38
A1 ~ 2X Better A2 ~ 2X Better +Mem. Chan., -Vector lanes, 1058 BRAMs - Mem. Chan., + Vector lanes, 686 BRAMs
39
A1 ~ 2X Better A2 ~ 2X Better +Mem. Chan., -Vector lanes, 1058 BRAMs - Mem. Chan., + Vector lanes, 686 BRAMs ⇒ Good for small blocks ⇒ Good for large blocks
40
A1 ~ 2X Better A2 ~ 2X Better +Mem. Chan., -Vector lanes, 1058 BRAMs - Mem. Chan., + Vector lanes, 686 BRAMs ⇒ Good for small blocks ⇒ Good for large blocks
a. Fixed mesh (NACA 1L, [Burovskiy et al, FPL 2015]) - optimal architecture
b. Fixed architecture, variable mesh, data parallel vs task parallel
41
a. Fixed mesh (NACA 1L, [Burovskiy et al, FPL 2015]) - optimal architecture
b. Fixed architecture, variable mesh, data parallel vs task parallel
a. Baseline, Nektar++ implementation from [Burovskiy et al, FPL 2015]
42
43
a. Fixed mesh (NACA 1L, [Burovskiy et al, FPL 2015]) - optimal architecture
b. Fixed architecture, variable mesh, data parallel vs task parallel
a. Baseline, Nektar++ implementation from [Burovskiy et al, FPL 2015]
44
45
46