Optimising SpMV for FEM on FPGAs Paul Grigoras, Pavel Burovskiy, - - PowerPoint PPT Presentation

optimising spmv for fem on fpgas
SMART_READER_LITE
LIVE PREVIEW

Optimising SpMV for FEM on FPGAs Paul Grigoras, Pavel Burovskiy, - - PowerPoint PPT Presentation

Optimising SpMV for FEM on FPGAs Paul Grigoras, Pavel Burovskiy, Wayne Luk, Spencer Sherwin 1 2 Finite Element Methods - Solve PDEs over large, unstructured geometries PDEs: Incompressible Navier Stokes, Shallow Water etc.


slide-1
SLIDE 1

Optimising SpMV for FEM on FPGAs

Paul Grigoras, Pavel Burovskiy, Wayne Luk, Spencer Sherwin

1

slide-2
SLIDE 2

2

slide-3
SLIDE 3

Finite Element Methods - Solve PDEs over large, unstructured geometries

  • PDEs: Incompressible Navier Stokes, Shallow Water etc.
  • Applications: computational fluid dynamics, biomedicine, geoscience, etc.

3

slide-4
SLIDE 4

Finite Element Methods

Mesh over unstructured domain

4 Source: www.nektar.info

slide-5
SLIDE 5

Finite Element Methods

Mesh over unstructured domain Mesh elements

5 Source: www.nektar.info

slide-6
SLIDE 6

Finite Element Methods

Mesh over unstructured domain Mesh elements

6

Sparse Matrix Assembly

Source: www.nektar.info

slide-7
SLIDE 7

Finite Element Methods

Mesh over unstructured domain Mesh elements

7

Sparse Matrix Assembly PDE Solver

Source: www.nektar.info

slide-8
SLIDE 8

Finite Element Methods

Mesh over unstructured domain

8

CFD Simulation

Source: www.nektar.info Source: www.nektar.info

slide-9
SLIDE 9

Finite Element Methods

Mesh over unstructured domain Mesh elements

9

Sparse Matrix Assembly PDE Solver

Source: www.nektar.info

slide-10
SLIDE 10

Finite Element Methods

Mesh over unstructured domain Mesh elements

10

Sparse Matrix Assembly PDE Solver Linear Solver

Source: www.nektar.info

slide-11
SLIDE 11

Finite Element Methods

Mesh over unstructured domain Mesh elements

11

Sparse Matrix Assembly PDE Solver Iterative Linear Solver ⇒ SpMV

Source: www.nektar.info

slide-12
SLIDE 12

Finite Element Methods

Mesh over unstructured domain Mesh elements

12

Sparse Matrix Assembly PDE Solver Vector Gather/Scatter [Burovskiy FPL15] Block Diagonal SpMV (this work)

Source: www.nektar.info

slide-13
SLIDE 13

Overview

  • Point of departure: focus on high order, spectral HP FEM, with local assembly

○ block diagonal SpMV (this work) vs generic SpMV (prior work)

13

slide-14
SLIDE 14

Block SpMV

  • Each dense block corresponds

to one element

  • Larger dense blocks ⇒ More

structured computation

14

slide-15
SLIDE 15

Overview

  • Point of departure: focus on high order, spectral HP FEM, with local assembly

○ block diagonal SpMV (this work) vs generic SpMV (prior work)

15

slide-16
SLIDE 16

Overview

  • Point of departure: focus on high order, spectral HP FEM, with local assembly

○ block diagonal SpMV (this work) vs generic SpMV (prior work)

  • Contributions:

○ Optimised architecture and implementation for block diagonal SpMV ○ Resource constrained performance model for the proposed architecture ○ Automated method to customise the architecture based on mesh parameters

16

slide-17
SLIDE 17

Overview

  • Point of departure: focus on high order, spectral HP FEM, with local assembly

○ block diagonal SpMV (this work) vs generic SpMV (prior work)

  • Contributions:

○ Optimised architecture and implementation for block diagonal SpMV ○ Resource constrained performance model for the proposed architecture ○ Automated method to customise the architecture based on mesh parameters

  • Result: a custom, mesh-specific architecture generator

○ Maximise throughput/area ⇒ fit larger meshes & improve performance

17

slide-18
SLIDE 18

Architecture

18

  • Each MPE has

○ Independent memory channel ○ Customisable precision datapath ○ Variable depth FIFO - support block variations at runtime

slide-19
SLIDE 19

Architecture

  • Each MPE has

○ Independent memory channel ○ Customisable precision datapath ○ Variable depth FIFO - support block variations at runtime

  • Design:

○ Parametric: NMPEs, MPEwidth ○ Task vs Data Parellelism tradeoff ○ ⇒ Mesh specific optimal config.

19

slide-20
SLIDE 20

Architecture

  • Each MPE has

○ Independent memory channel ○ Customisable precision datapath ○ Variable depth FIFO - support block variations at runtime

  • Design:

○ Parametric: NMPEs, MPEwidth ○ Task vs Data Parellelism tradeoff ○ ⇒ Mesh specific optimal config.

  • Block SpMV advantages:

⇒ Simplified control (format decoding) ⇒ Reduced metadata ⇒ Simplified reduction circuit

20

slide-21
SLIDE 21

Parameter Extraction

  • Assume matrix is block diagonal
  • Extract mesh parameters: size &

number of blocks for each element

  • In DSE: find and synthesise optimal

architectures

  • At runtime: select the appropriate

architecture

21

slide-22
SLIDE 22

Performance Model

22

  • Mesh parameters ⇒ optimal architecture parameters
  • Resource usage:
  • Performance:
  • Functional, hardware constraints ⇒ See paper for details
slide-23
SLIDE 23

Runtime

  • Software layer - can be

integrated in existing FEM software packages

  • Reorder to enforce linear

access pattern in DRAM ○ Maximise throughput ○ Minimise control logic

23

slide-24
SLIDE 24

Putting it Together

24

slide-25
SLIDE 25

Offline tuning: build a repository of customised architectures from a set of mesh instances

25

Putting it Together

slide-26
SLIDE 26

Offline tuning: build a repository of customised architectures from a set of mesh instances

26

Putting it Together

Runtime: select the

  • ptimal architecture for an

input mesh instance

slide-27
SLIDE 27

Evaluation

27

slide-28
SLIDE 28

Evaluation

  • Implementation

○ Design: MaxComplier + MaxJ dataflow language ○ FPGA Server: Maxeler Max 4 Maia (Stratix VSG, 48GB DRAM, per board) ○ Software: C++14, G++ 5.2 ○ CPU Server: Dual Intel Xeon E5-2640, 64GB DRAM, Infiniband QSFP ○ Place and route with Altera Quartus 14.1 ○ Available as extension to the CASK framework [Grigoras et al, FPGA 16]: ■ http://caskorg.github.io/cask/

  • Reference software - Nektar++ FEM Package, http://www.nektar.info/
  • Reference hardware

○ [Burovskiy et al, FPL 15], Nektar++ Accelerated FEM

28

slide-29
SLIDE 29

Experiments

1. What is the benefit of tuning architecture based on mesh properties?

a. Fixed mesh (NACA 1L, [Burovskiy et al, FPL 2015]) - optimal architecture

29

slide-30
SLIDE 30

30

slide-31
SLIDE 31

Compute efficiency is maximised for smaller MPE Width

31

slide-32
SLIDE 32

Compute efficiency is maximised for smaller MPE Width

32

Achieved DRAM bandwidth is maximised for larger MPE Width

slide-33
SLIDE 33

Compute efficiency is maximised for smaller MPE Width

33

Achieved DRAM bandwidth is maximised for larger MPE Width ⇒ aggressive tuning (max MPE Width) - not resource efficient

slide-34
SLIDE 34

Experiments

1. What is the benefit of tuning architecture based on mesh properties?

a. Fixed mesh (NACA 1L, [Burovskiy et al, FPL 2015]) - optimal architecture

⇒ find architecture with good efficiency ⇒ improve performance s.t. resource usage

34

slide-35
SLIDE 35

Experiments

1. What is the benefit of tuning architecture based on mesh properties?

a. Fixed mesh (NACA 1L, [Burovskiy et al, FPL 2015]) - optimal architecture

⇒ find architecture with good efficiency ⇒ improve performance s.t. resource usage

b. Fixed architecture, variable mesh, data parallel vs task parallel

35

slide-36
SLIDE 36

36

slide-37
SLIDE 37

37

A1 ~ 2X Better A2 ~ 2X Better

slide-38
SLIDE 38

38

A1 ~ 2X Better A2 ~ 2X Better +Mem. Chan., -Vector lanes, 1058 BRAMs - Mem. Chan., + Vector lanes, 686 BRAMs

slide-39
SLIDE 39

39

A1 ~ 2X Better A2 ~ 2X Better +Mem. Chan., -Vector lanes, 1058 BRAMs - Mem. Chan., + Vector lanes, 686 BRAMs ⇒ Good for small blocks ⇒ Good for large blocks

slide-40
SLIDE 40

40

A1 ~ 2X Better A2 ~ 2X Better +Mem. Chan., -Vector lanes, 1058 BRAMs - Mem. Chan., + Vector lanes, 686 BRAMs ⇒ Good for small blocks ⇒ Good for large blocks

slide-41
SLIDE 41

Experiments

1. What is the benefit of tuning architecture based on mesh properties?

a. Fixed mesh (NACA 1L, [Burovskiy et al, FPL 2015]) - optimal architecture

⇒ find architecture with good efficiency ⇒ improve performance s.t. resource usage

b. Fixed architecture, variable mesh, data parallel vs task parallel

⇒ select optimal MPE Width and N Mpe for given mesh ⇒ improve performance, reduce resource usage

41

slide-42
SLIDE 42

Experiments

1. What is the benefit of tuning architecture based on mesh properties?

a. Fixed mesh (NACA 1L, [Burovskiy et al, FPL 2015]) - optimal architecture

⇒ find architecture with good efficiency ⇒ improve performance s.t. resource usage

b. Fixed architecture, variable mesh, data parallel vs task parallel

⇒ select optimal MPE Width and N Mpe for given mesh ⇒ improve performance, reduce resource usage 2. What is the expected benefit for a full FEM implementation?

a. Baseline, Nektar++ implementation from [Burovskiy et al, FPL 2015]

42

slide-43
SLIDE 43

43

slide-44
SLIDE 44

Experiments

1. What is the benefit of tuning architecture based on mesh properties?

a. Fixed mesh (NACA 1L, [Burovskiy et al, FPL 2015]) - optimal architecture

⇒ find architecture with good efficiency ⇒ improve performance s.t. resource usage

b. Fixed architecture, variable mesh, data parallel vs task parallel

⇒ select optimal MPE Width and N Mpe for given mesh ⇒ improve performance, reduce resource usage 2. What is the expected benefit for a full FEM implementation?

a. Baseline, Nektar++ implementation from [Burovskiy et al, FPL 2015]

⇒ enabling larger problem sizes, not supported by previous work. ⇒ enable a good proportion of the projected speedup (3X over CPU)

44

slide-45
SLIDE 45

Conclusion

1. Proposed: a. FPGA architecture optimised for variable-size block diagonal SpMV b. method to extract customisation parameters directly from mesh instance c. software to integrate with existing FEM package, Nektar++ 2. Achieved: a. Fit larger FEM problems on a single FPGA b. 3X speedup over optimised CPU 3. Future: exploration of additional trade-offs & parameters

45

slide-46
SLIDE 46

That’s it folks! Thank you!

46