[PPT] - Optimising SpMV for FEM on FPGAs Paul Grigoras, Pavel Burovskiy, PowerPoint Presentation

SLIDE 1

Optimising SpMV for FEM on FPGAs

Paul Grigoras, Pavel Burovskiy, Wayne Luk, Spencer Sherwin

1

SLIDE 2

2

SLIDE 3

Finite Element Methods - Solve PDEs over large, unstructured geometries

PDEs: Incompressible Navier Stokes, Shallow Water etc.
Applications: computational fluid dynamics, biomedicine, geoscience, etc.

3

SLIDE 4

Finite Element Methods

Mesh over unstructured domain

4 Source: www.nektar.info

SLIDE 5

Finite Element Methods

Mesh over unstructured domain Mesh elements

5 Source: www.nektar.info

SLIDE 6

Finite Element Methods

Mesh over unstructured domain Mesh elements

6

Sparse Matrix Assembly

Source: www.nektar.info

SLIDE 7

Finite Element Methods

Mesh over unstructured domain Mesh elements

7

Sparse Matrix Assembly PDE Solver

Source: www.nektar.info

SLIDE 8

Finite Element Methods

Mesh over unstructured domain

8

CFD Simulation

Source: www.nektar.info Source: www.nektar.info

SLIDE 9

Finite Element Methods

Mesh over unstructured domain Mesh elements

9

Sparse Matrix Assembly PDE Solver

Source: www.nektar.info

SLIDE 10

Finite Element Methods

Mesh over unstructured domain Mesh elements

10

Sparse Matrix Assembly PDE Solver Linear Solver

Source: www.nektar.info

SLIDE 11

Finite Element Methods

Mesh over unstructured domain Mesh elements

11

Sparse Matrix Assembly PDE Solver Iterative Linear Solver ⇒ SpMV

Source: www.nektar.info

SLIDE 12

Finite Element Methods

Mesh over unstructured domain Mesh elements

12

Sparse Matrix Assembly PDE Solver Vector Gather/Scatter [Burovskiy FPL15] Block Diagonal SpMV (this work)

Source: www.nektar.info

SLIDE 13

Overview

Point of departure: focus on high order, spectral HP FEM, with local assembly

○ block diagonal SpMV (this work) vs generic SpMV (prior work)

13

SLIDE 14

Block SpMV

Each dense block corresponds

to one element

Larger dense blocks ⇒ More

structured computation

14

SLIDE 15

Overview

Point of departure: focus on high order, spectral HP FEM, with local assembly

○ block diagonal SpMV (this work) vs generic SpMV (prior work)

15

SLIDE 16

Overview

Point of departure: focus on high order, spectral HP FEM, with local assembly

○ block diagonal SpMV (this work) vs generic SpMV (prior work)

Contributions:

○ Optimised architecture and implementation for block diagonal SpMV ○ Resource constrained performance model for the proposed architecture ○ Automated method to customise the architecture based on mesh parameters

16

SLIDE 17

Overview

Point of departure: focus on high order, spectral HP FEM, with local assembly

○ block diagonal SpMV (this work) vs generic SpMV (prior work)

Contributions:

○ Optimised architecture and implementation for block diagonal SpMV ○ Resource constrained performance model for the proposed architecture ○ Automated method to customise the architecture based on mesh parameters

Result: a custom, mesh-specific architecture generator

○ Maximise throughput/area ⇒ fit larger meshes & improve performance

17

SLIDE 18

Architecture

18

Each MPE has

○ Independent memory channel ○ Customisable precision datapath ○ Variable depth FIFO - support block variations at runtime

SLIDE 19

Architecture

Each MPE has

○ Independent memory channel ○ Customisable precision datapath ○ Variable depth FIFO - support block variations at runtime

Design:

○ Parametric: NMPEs, MPEwidth ○ Task vs Data Parellelism tradeoff ○ ⇒ Mesh specific optimal config.

19

SLIDE 20

Architecture

Each MPE has

○ Independent memory channel ○ Customisable precision datapath ○ Variable depth FIFO - support block variations at runtime

Design:

○ Parametric: NMPEs, MPEwidth ○ Task vs Data Parellelism tradeoff ○ ⇒ Mesh specific optimal config.

Block SpMV advantages:

⇒ Simplified control (format decoding) ⇒ Reduced metadata ⇒ Simplified reduction circuit

20

SLIDE 21

Parameter Extraction

Assume matrix is block diagonal
Extract mesh parameters: size &

number of blocks for each element

In DSE: find and synthesise optimal

architectures

At runtime: select the appropriate

architecture

21

SLIDE 22

Performance Model

22

Mesh parameters ⇒ optimal architecture parameters
Resource usage:
Performance:
Functional, hardware constraints ⇒ See paper for details

SLIDE 23

Runtime

Software layer - can be

integrated in existing FEM software packages

Reorder to enforce linear

access pattern in DRAM ○ Maximise throughput ○ Minimise control logic

23

SLIDE 24

Putting it Together

24

SLIDE 25

Offline tuning: build a repository of customised architectures from a set of mesh instances

25

Putting it Together

SLIDE 26

Offline tuning: build a repository of customised architectures from a set of mesh instances

26

Putting it Together

Runtime: select the

ptimal architecture for an

input mesh instance

SLIDE 27

Evaluation

27

SLIDE 28

Evaluation

Implementation

○ Design: MaxComplier + MaxJ dataflow language ○ FPGA Server: Maxeler Max 4 Maia (Stratix VSG, 48GB DRAM, per board) ○ Software: C++14, G++ 5.2 ○ CPU Server: Dual Intel Xeon E5-2640, 64GB DRAM, Infiniband QSFP ○ Place and route with Altera Quartus 14.1 ○ Available as extension to the CASK framework [Grigoras et al, FPGA 16]: ■ http://caskorg.github.io/cask/

Reference software - Nektar++ FEM Package, http://www.nektar.info/
Reference hardware

○ [Burovskiy et al, FPL 15], Nektar++ Accelerated FEM

28

SLIDE 29

Experiments

1. What is the benefit of tuning architecture based on mesh properties?

a. Fixed mesh (NACA 1L, [Burovskiy et al, FPL 2015]) - optimal architecture

29

SLIDE 30

30

SLIDE 31

Compute efficiency is maximised for smaller MPE Width

31

SLIDE 32

Compute efficiency is maximised for smaller MPE Width

32

Achieved DRAM bandwidth is maximised for larger MPE Width

SLIDE 33

Compute efficiency is maximised for smaller MPE Width

33

Achieved DRAM bandwidth is maximised for larger MPE Width ⇒ aggressive tuning (max MPE Width) - not resource efficient

SLIDE 34

Experiments

1. What is the benefit of tuning architecture based on mesh properties?

a. Fixed mesh (NACA 1L, [Burovskiy et al, FPL 2015]) - optimal architecture

⇒ find architecture with good efficiency ⇒ improve performance s.t. resource usage

34

SLIDE 35

Experiments

1. What is the benefit of tuning architecture based on mesh properties?

a. Fixed mesh (NACA 1L, [Burovskiy et al, FPL 2015]) - optimal architecture

⇒ find architecture with good efficiency ⇒ improve performance s.t. resource usage

b. Fixed architecture, variable mesh, data parallel vs task parallel

35

SLIDE 36

36

SLIDE 37

37

A1 ~ 2X Better A2 ~ 2X Better

SLIDE 38

38

A1 ~ 2X Better A2 ~ 2X Better +Mem. Chan., -Vector lanes, 1058 BRAMs - Mem. Chan., + Vector lanes, 686 BRAMs

SLIDE 39

39

A1 ~ 2X Better A2 ~ 2X Better +Mem. Chan., -Vector lanes, 1058 BRAMs - Mem. Chan., + Vector lanes, 686 BRAMs ⇒ Good for small blocks ⇒ Good for large blocks

SLIDE 40

40

A1 ~ 2X Better A2 ~ 2X Better +Mem. Chan., -Vector lanes, 1058 BRAMs - Mem. Chan., + Vector lanes, 686 BRAMs ⇒ Good for small blocks ⇒ Good for large blocks

SLIDE 41

Experiments

1. What is the benefit of tuning architecture based on mesh properties?

a. Fixed mesh (NACA 1L, [Burovskiy et al, FPL 2015]) - optimal architecture

⇒ find architecture with good efficiency ⇒ improve performance s.t. resource usage

b. Fixed architecture, variable mesh, data parallel vs task parallel

⇒ select optimal MPE Width and N Mpe for given mesh ⇒ improve performance, reduce resource usage

41

SLIDE 42

Experiments

1. What is the benefit of tuning architecture based on mesh properties?

a. Fixed mesh (NACA 1L, [Burovskiy et al, FPL 2015]) - optimal architecture

⇒ find architecture with good efficiency ⇒ improve performance s.t. resource usage

b. Fixed architecture, variable mesh, data parallel vs task parallel

⇒ select optimal MPE Width and N Mpe for given mesh ⇒ improve performance, reduce resource usage 2. What is the expected benefit for a full FEM implementation?

a. Baseline, Nektar++ implementation from [Burovskiy et al, FPL 2015]

42

SLIDE 43

43

SLIDE 44

Experiments

1. What is the benefit of tuning architecture based on mesh properties?

a. Fixed mesh (NACA 1L, [Burovskiy et al, FPL 2015]) - optimal architecture

⇒ find architecture with good efficiency ⇒ improve performance s.t. resource usage

b. Fixed architecture, variable mesh, data parallel vs task parallel

⇒ select optimal MPE Width and N Mpe for given mesh ⇒ improve performance, reduce resource usage 2. What is the expected benefit for a full FEM implementation?

a. Baseline, Nektar++ implementation from [Burovskiy et al, FPL 2015]

⇒ enabling larger problem sizes, not supported by previous work. ⇒ enable a good proportion of the projected speedup (3X over CPU)

44

SLIDE 45

Conclusion

1. Proposed: a. FPGA architecture optimised for variable-size block diagonal SpMV b. method to extract customisation parameters directly from mesh instance c. software to integrate with existing FEM package, Nektar++ 2. Achieved: a. Fit larger FEM problems on a single FPGA b. 3X speedup over optimised CPU 3. Future: exploration of additional trade-offs & parameters

45

SLIDE 46

That’s it folks! Thank you!

46