Towards Scalable and Efficient FPGA Stencil Accelerators el Deest 1 - - PowerPoint PPT Presentation

towards scalable and efficient fpga stencil accelerators
SMART_READER_LITE
LIVE PREVIEW

Towards Scalable and Efficient FPGA Stencil Accelerators el Deest 1 - - PowerPoint PPT Presentation

Towards Scalable and Efficient FPGA Stencil Accelerators el Deest 1 Nicolas Estibals 1 Tomofumi Yuki 2 Ga Steven Derrien 1 Sanjay Rajopadhye 3 1 IRISA / Universit 2 INRIA / LIP / ENS Lyon e de Rennes 1 / Cairn 3 Colorado State University


slide-1
SLIDE 1

Towards Scalable and Efficient FPGA Stencil Accelerators

Ga¨ el Deest1 Nicolas Estibals1 Tomofumi Yuki2 Steven Derrien1 Sanjay Rajopadhye3

1IRISA / Universit´

e de Rennes 1 / Cairn

2INRIA / LIP / ENS Lyon 3Colorado State University

January 19th, 2016

1 / 30

slide-2
SLIDE 2

Stencil Computations

Important class of algorithms

◮ Iterative grid update. ◮ Uniform dependences.

Examples:

◮ Solving partial differential equations ◮ Computer simulations (physics, seismology, etc.) ◮ (Realtime) image/video processing

Strong need for efficient hardware implementations.

2 / 30

slide-3
SLIDE 3

Application Domains

Two main application types with vastly = goals:

HPC

◮ “Be as fast as possible” ◮ No realtime constraints

Embedded Systems

◮ “Be fast enough” ◮ Realtime constraints

For now, we focus on FPGAs from the HPC perspective.

3 / 30

slide-4
SLIDE 4

FPGA As Stencil Accelerators ?

CPU: ≈ 10 cores

Cache Control ALUs DDR

≈ 10 GB/s GPU: ≈ 100 cores

GDDR

≈ 100 GB/s FPGA: ≈ 1000 cores

DDR

≈ 1 GB/s

Features:

◮ Large on-chip bandwidth ◮ Fine-grained pipelining ◮ Customizable datapath /

arithmetic Drawbacks:

◮ Small off-chip bandwidth ◮ Difficult to program ◮ Lower clock frequencies

4 / 30

slide-5
SLIDE 5

Design Challenges

At least two problems:

◮ Increase throughput with parallelization.

Examples:

◮ Multiple PEs. ◮ Pipelining.

◮ Decrease bandwidth occupation

◮ Use onchip memory to maximize reuse ◮ Choose memory mapping carefully to enable burst

accesses

5 / 30

slide-6
SLIDE 6

Stencils “Done Right” for FPGAs

Observation:

◮ Many different strategies exist:

◮ Multiple-level tiling ◮ Deep pipelining ◮ Time skewing ◮ . . .

◮ No papers put them all together.

Key features:

◮ Target one large deeply pipelined PE...

◮ ...instead of many small PEs

◮ Manage throughput/bandwidth with two-level tiling

6 / 30

slide-7
SLIDE 7

Multiple-Level Tiling

Composition of 2+ tiling transformations to account for:

◮ Memory hierarchies and locality

◮ Register, caches, RAM, disks, . . .

◮ Multiple level of parallelism

◮ Instruction-Level, Thread-Level, . . .

In this work:

  • 1. Inner tiling level: parallelism.
  • 2. Outer tiling level: communication.

7 / 30

slide-8
SLIDE 8

Overview of Our Approach

Core ideas:

  • 1. Execute inner, Datapath-Level tiles on a single,

pipelined “macro-operator”.

◮ Fire a new tile execution each cycle. ◮ Delegate operator pipelining to HLS.

  • 2. Group DL-tiles into Communication-Level Tiles to

decrease bandwidth requirements.

◮ Store intermediary results on chip. 8 / 30

slide-9
SLIDE 9

Outline

Introduction Approach Evaluation Related Work and Comparison Future Work & Conclusion

9 / 30

slide-10
SLIDE 10

Running Example: Jacobi (3-point, 1D-data)

Simplified code:

f o r ( t =1; t<T; t++) f o r ( x=1; x<N−1; x++) f [ t ] [ x ] = ( f [ t −1][ x −1] + f [ t −1][ x ] + f [ t −1][ x +1])/3;

Dependence vectors: (−1, −1), (−1, 0), (−1, 1)

10 / 30

slide-11
SLIDE 11

Datapath-Level Tiling

11 / 30

slide-12
SLIDE 12

Datapath-Level Tiling

t, x → t, x + t

11 / 30

slide-13
SLIDE 13

Datapath-Level Tiling

t, x → t, x + t

11 / 30

slide-14
SLIDE 14

Datapath-Level Tile Operator

f o r ( t = . . . ) { #pragma HLS PIPELINE I I =1 f o r ( x = . . . ) { #pragma HLS UNROLL f o r ( t t = . . . ) { #pragma HLS UNROLL f o r ( xx = . . . ) { i n t t = t+tt , x = x+xx−t ; f [ t ] [ x ] = ( f [ t −1][ x −1] + f [ t −1][ x ] + f [ t −1][ x +1])/3; } } }}

Types of parallelism:

◮ Operation-Level parallelism (exposed by unrolling). ◮ Temporal parallelism (through pipelined tile executions).

12 / 30

slide-15
SLIDE 15

Pipelined Execution

Pipelined execution requires inter-tile parallelism. Original dependences Tile-level dependences Gauss-Seidel dependences

13 / 30

slide-16
SLIDE 16

Wavefronts of Datapath-Level Tiles

14 / 30

slide-17
SLIDE 17

Wavefronts of Datapath-Level Tiles Skewing: t, x → t + x, x

14 / 30

slide-18
SLIDE 18

Wavefronts of Datapath-Level Tiles Wavefronts

14 / 30

slide-19
SLIDE 19

Managing Compute/IO Ratio

Problem

Suppose direct pipelining of 2 × 2 DL-tiles. At each clock cycle:

◮ A new tile enters the pipeline. ◮ Six 32-bit values are fetched from off-chip memory.

At 100 MHz, bandwidth usage are 19.2 GBps !

Solution

Use a second tiling level to decrease bandwidth requirements.

15 / 30

slide-20
SLIDE 20

Communication-Level Tiling

1 2 3 4 WF1 WF2 Shape constraints: Size constraints:

16 / 30

slide-21
SLIDE 21

Communication-Level Tiling

d2 d1

d1 = d2

Shape constraints:

◮ Constant-height wavefronts

◮ Enables use of simple FIFOs

for intermediary results

Size constraints:

16 / 30

slide-22
SLIDE 22

Communication-Level Tiling

≥ d Shape constraints:

◮ Constant-height wavefronts

◮ Enables use of simple FIFOs

for intermediary results

Size constraints:

◮ Tiles per WF ≥ pipeline depth

0 1 2 3 4 5 6 d = 4

16 / 30

slide-23
SLIDE 23

Communication-Level Tiling

Shape constraints:

◮ Constant-height wavefronts

◮ Enables use of simple FIFOs

for intermediary results

Size constraints:

◮ Tiles per WF ≥ pipeline depth ◮ BW requirements ≤ chip limit ◮ Size of FIFOs ≤ chip limit

16 / 30

slide-24
SLIDE 24

Communication-Level Tile Shape

Hyperparallelepipedic (rectangular) tiles satisfy all shape constraints. skew−1

17 / 30

slide-25
SLIDE 25

Communication

Two aspects: On-chip Communication

◮ Between DL-tiles ◮ Uses FIFOs

Off-chip Communication

◮ Between CL-tiles ◮ Uses memory accesses

18 / 30

slide-26
SLIDE 26

On-Chip Communication

We use Canonic Multi-Projections (Yuki and Rajopadhye, 2011). Main ideas:

◮ Communicate along canonical

axes.

◮ Project diagonal dependences on

canonical directions.

◮ Some values are redundantly stored.

buff t (in) buff x (in) buff t (out) buff x (out)

19 / 30

slide-27
SLIDE 27

Off-Chip Communication

Between CL-Tiles (assuming lexicographic ordering):

◮ Data can be reused along the innermost dimension. ◮ Data from/to other tiles must be fetched/stored off-chip. ◮ Complex shape ◮ Key for performance: use burst

accesses

◮ Maximize contiguity with clever

memory mapping

20 / 30

slide-28
SLIDE 28

Off-Chip Communication

Between CL-Tiles (assuming lexicographic ordering):

◮ Data can be reused along the innermost dimension. ◮ Data from/to other tiles must be fetched/stored off-chip. ◮ Complex shape ◮ Key for performance: use burst

accesses

◮ Maximize contiguity with clever

memory mapping

20 / 30

slide-29
SLIDE 29

Outline

Introduction Approach Evaluation Related Work and Comparison Future Work & Conclusion

21 / 30

slide-30
SLIDE 30

Metrics

◮ Hardware-related metrics

◮ Macro-operator pipeline depth ◮ Area (slices, BRAM & DSP)

◮ Performance-related metrics (at steady state)

◮ Throughput ◮ Required bandwidth 22 / 30

slide-31
SLIDE 31

Preliminary Results: Parallelism scalability

2×2 2×4 4×2 4×4 8×8 2×2×2 3×3×3 4×4×4 3.4 GFLop/s 5.8 GFLop/s 5.8 GFLop/s 11.5 GFLop/s 28.2 GFLop/s 7.2 GFLop/s 20.3 GFLop/s 38.4 GFLop/s 2% 5% 9% 8% 34% 13% 21% 44% 61 61 117 117 229 100 148 196

Datapath-level tile size

Steady-State throughput Computation resource usage Pipeline depth

Choose DL-tile to control:

◮ Computational throughput ◮ Computational resource usage ◮ Macro-operator latency and pipeline depth

23 / 30

slide-32
SLIDE 32

Preliminary Results: Bandwidth Usage Control

n×15×14 n×22×22 n×23×23 n×31×32 n×32×32 n×38×39 n×44×45 n×59×59 2.2 GB/s 1.4 GB/s 1.4 GB/s 1 GB/s 1 GB/s 0.8 GB/s 0.7 GB/s 0.5 GB/s 6% 6% 12% 12% 18% 18% 24% 42%

Communication-level tile size for 4x4x4 DL-tile Steady-State Bandwidth BRAM Usage

Enlarging CL-tiles :

◮ Does not change throughput ◮ Reduces bandwidth requirements ◮ Has a low impact on hardware resources

24 / 30

slide-33
SLIDE 33

Outline

Introduction Approach Evaluation Related Work and Comparison Future Work & Conclusion

25 / 30

slide-34
SLIDE 34

Related Work

◮ Hardware implementations:

◮ Many ad-hoc / naive architectures ◮ Systolic architectures (LSGP) ◮ PolyOpt/HLS (Pouchet et al., 2013) ◮ Tiling to control compute/IO balance ◮ Alias et al., 2012 ◮ Single, pipelined operator ◮ Innermost loop body only

◮ Tiling method:

◮ “Jagged Tiling” (Shrestha et al., 2015) 26 / 30

slide-35
SLIDE 35

Outline

Introduction Approach Evaluation Related Work and Comparison Future Work & Conclusion

27 / 30

slide-36
SLIDE 36

Future Work

◮ Finalize implementation ◮ Beyond Jacobi ◮ Exploring other number representations:

◮ Fixed-point ◮ Block floating-point ◮ Custom floating-point

◮ Hardware/software codesign ◮ . . .

28 / 30

slide-37
SLIDE 37

Conclusion

◮ Design template for FPGA stencil accelerators ◮ Two levels of control:

◮ Throughput ◮ Bandwidth requirements

◮ Maximize use of pipeline parallelism

29 / 30

slide-38
SLIDE 38

Thank You

Questions ?

30 / 30