On Portability, Performance and Scalability of a MPI OpenCL Lattice - - PowerPoint PPT Presentation

▶

Mar 31, 2023 179 likes •447 views

On Portability, Performance and Scalability of a MPI OpenCL Lattice Boltzmann Code E Calore, S F Schifano, R Tripiccione Enrico Calore INFN Ferrara, Italy 7 th Workshop on UnConventional High Performance Computing August 26, 2014 Porto,

SLIDE 1

On Portability, Performance and Scalability of a MPI OpenCL Lattice Boltzmann Code

E Calore, S F Schifano, R Tripiccione

Enrico Calore INFN Ferrara, Italy

7th Workshop on UnConventional High Performance Computing August 26, 2014 Porto, Portugal

E. Calore (INFN of Ferrara)

Portability, performance and scalability UCHPC, August 26, 2014 1 / 24

SLIDE 2

Outline

LBM at glance, D2Q37 model

OpenCL

Implementation details

Results and conclusions We addressed the issue of portability of code across several computing architectures preserving performances.

E. Calore (INFN of Ferrara)

Portability, performance and scalability UCHPC, August 26, 2014 2 / 24

SLIDE 3

The D2Q37 Lattice Boltzmann Model

Lattice Boltzmann method (LBM) is a class of computational fluid dynamics (CFD) methods simulation of synthetic dynamics described by the discrete Boltzmann equation, instead of the Navier-Stokes equations a set of virtual particles called populations arranged at edges of a discrete and regular grid interacting by propagation and collision reproduce – after appropriate averaging – the dynamics of fluids D2Q37 is a D2 model with 37 components of velocity (populations) suitable to study behaviour of compressible gas and fluids optionally in presence of combustion 1 effects correct treatment of Navier-Stokes, heat transport and perfect-gas (P = ρT) equations

1chemical reactions turning cold-mixture of reactants into hot-mixture of burnt product.

E. Calore (INFN of Ferrara)

Portability, performance and scalability UCHPC, August 26, 2014 3 / 24

SLIDE 4

Computational Scheme of LBM

foreach time−step foreach lattice−point propagate ( ) ; endfor foreach lattice−point collide ( ) ; endfor endfor

Embarassing parallelism

All sites can be processed in parallel applying in sequence propagate and collide.

Challenge

Design an efficient implementation able exploit a large fraction of available peak performance.

E. Calore (INFN of Ferrara)

Portability, performance and scalability UCHPC, August 26, 2014 4 / 24

SLIDE 5

D2Q37: propagation scheme

perform accesses to neighbour-cells at distance 1,2, and 3 generate memory-accesses with sparse addressing patterns

E. Calore (INFN of Ferrara)

Portability, performance and scalability UCHPC, August 26, 2014 5 / 24

SLIDE 6

D2Q37: boundary-conditions

After propagation, boundary conditions are enforced at top and bottom edges

f the lattice.

2D lattice with period-boundaries along X-direction at the top and the bottom boundary conditions are enforced:

◮ to adjust some values at sites y = 0 . . . 2

and y = Ny − 3 . . . Ny − 1

◮ e.g. set vertical velocity to zero

At left and and right edges we apply periodic boundary conditions.

E. Calore (INFN of Ferrara)

Portability, performance and scalability UCHPC, August 26, 2014 6 / 24

SLIDE 7

D2Q37 collision

collision is computed at each lattice-cell after computation of boundary conditions computational intensive: for the D2Q37 model requires ≈ 7500 DP floating-point operations completely local: arithmetic operations require only the populations associate to the site computation of propagate and collide kernels are kept separate after propagate but before collide we may need to perform collective

perations (e.g. divergence of of the velocity field) if we include

computations conbustion effects.

E. Calore (INFN of Ferrara)

Portability, performance and scalability UCHPC, August 26, 2014 7 / 24

SLIDE 8

Open Computing Language (OpenCL)

programming framework for heterogenous architectures: CPU+accelerators computing model:

◮ host-code plus one or more kernels running on accelerators ◮ kernels are executed by a set of work-items each processing an item

f the data-set (data-parallelism)

◮ work-items are grouped into work-groups, each executed by a

compute-unit and processing K work-items in parallel using vector instructions

◮ e.g.: on Xeon-Phi work-groups are mapped on (virtual-)cores

processing each up to 8 double-precisions floting-point data memory model identifies a hierarchy of four spaces which differ for size and access-time : private, local, global and constant memory OCL aims to guarantee portability of both code and performances across several architectures

E. Calore (INFN of Ferrara)

Portability, performance and scalability UCHPC, August 26, 2014 8 / 24

SLIDE 9

OCL Saxpy kernel

C = s · A × B, s ∈ R, A, B, C ∈ Rn

kernel void saxpy( global double ∗A , global double ∗B , global double ∗C , const double s) { / / get global thread ID int id = get_global_id ( 0 ) ; C [ id ] = s ∗ A [ id ] + B [ id ] ; }

each work-item executes the saxpy kernel computing just one data-item of the output array first it computes its unique global identifier id and then uses it to address the idth data-item of arrays A, B and C.

E. Calore (INFN of Ferrara)

Portability, performance and scalability UCHPC, August 26, 2014 9 / 24

SLIDE 10

Memory layout for LB : AoS vs SoA

/ / l a t t i c e stored as AoS: typedef struct { double p1 ; / / population 1 double p2 ; / / population 2 . . . double p37 ; / / population 37 } pop_t ; pop_t lattice2D [ SIZEX∗SIZEY ] ;

AoS: corresponding populations of different sites are interleaved, causing strided memory-access and leading to coalescing issues.

/ / l a t t i c e stored as AoS: typedef struct { double p1 [ SIZEX∗SIZEY ] ; / / population 1 array double p2 [ SIZEX∗SIZEY ] ; / / population 2 array . . . double p37 [ SIZEX∗SIZEY ] ; / / population 37 array } pop_t ; pop_t lattice2D ;

SoA: corresponding populations of different sites are allocated at contiguous memory addresses, enabling coalescing of accesses, and making use of full memory bandwidth.

E. Calore (INFN of Ferrara)

Portability, performance and scalability UCHPC, August 26, 2014 10 / 24

SLIDE 11

Grids Layout

Uni-dimensional array of NTHREADS, each thread processing one lattice site.

Example: physical lattice of 11 × 16 cells; the size of work-groups is 1 × 4.

Ly = α × Nwi, α ∈ N; (Ly × Lx)/Nwi = Nwg

E. Calore (INFN of Ferrara)

Portability, performance and scalability UCHPC, August 26, 2014 11 / 24

SLIDE 12

Hardware used: Eurora prototype

Eurora (Eurotech and Cineca)

Hot water cooling system Deliver 3,209 MFLOPs per Watt of sustained performance 1st in the Green500 of June 2013 Computing Nodes: 64 Processor Type: Intel Xeon E5-2658 @ 2.10GHz Intel Xeon E5-2687W @ 3.10GHz Accelerator Type: MIC - Intel Xeon-Phi 5120D GPU - NVIDIA Tesla K20x

E. Calore (INFN of Ferrara)

Portability, performance and scalability UCHPC, August 26, 2014 12 / 24

SLIDE 13

OpenCL Benchmark of Propagate (Xeon-Phi)

Performance of propagate as function of the number of work-items Nwi per work-group, and the number of work-groups Nwg.

E. Calore (INFN of Ferrara)

Portability, performance and scalability UCHPC, August 26, 2014 13 / 24

SLIDE 14

OpenCL Benchmark of Collide (Xeon-Phi)

Performance of collide as function of the number of work-items Nwi per work-group, and the number of work-groups Nwg.

E. Calore (INFN of Ferrara)

Portability, performance and scalability UCHPC, August 26, 2014 14 / 24

SLIDE 15

2 x NVIDIA K20s GPU

10 20 30 40 50 60 70 80 Propagate BC Collide [msec] per iteration Run time on 2 x GPU (NVIDIA K20s) CUDA OpenCL

E. Calore (INFN of Ferrara)

Portability, performance and scalability UCHPC, August 26, 2014 15 / 24

SLIDE 16

2 x Intel Xeon Phi MIC

10 20 30 40 50 60 70 80 Propagate BC Collide [msec] per iteration Run time on 2 x MIC (Intel Xeon Phi) C OpenCL

E. Calore (INFN of Ferrara)

Portability, performance and scalability UCHPC, August 26, 2014 16 / 24

SLIDE 17

Propagate

50 100 150 200 MIC GPU CPU2 CPU3 [msec] per iteration Run time (Propagate - 1920x2048 lattice) C C Opt. CUDA OpenCL

E. Calore (INFN of Ferrara)

Portability, performance and scalability UCHPC, August 26, 2014 17 / 24

SLIDE 18

Collide

100 200 300 400 500 600 MIC GPU CPU2 CPU3 [msec] per iteration Run time (Collide - 1920x2048 lattice) C C Opt. CUDA OpenCL

E. Calore (INFN of Ferrara)

Portability, performance and scalability UCHPC, August 26, 2014 18 / 24

SLIDE 19

Scalability on Eurora Nodes

Weak regime lattice size: 256 × 8192 × No_devices. Strong regime lattice size: 1024 × 8192.

E. Calore (INFN of Ferrara)

Portability, performance and scalability UCHPC, August 26, 2014 19 / 24

SLIDE 20

Limitations to strong scalability

E. Calore (INFN of Ferrara)

Portability, performance and scalability UCHPC, August 26, 2014 20 / 24

SLIDE 21

Conclusions

we have presented an OCL implementation of a fluid-dynamic simulation based on Lattice Boltzmann methods

code portability: it has been succesfully ported and run on several computing architectures, including CPU, GPU and MIC systems

performance portability: results are of the same level of codes written using more “native” programming frameworks, such as CUDA or C

the good news: this results make OpenCL a good framework to develop code, easily portable across several architecture preserving performances

the bad news: not all vendors are today commited to support this standard because considered a low-level approach.

E. Calore (INFN of Ferrara)

Portability, performance and scalability UCHPC, August 26, 2014 21 / 24

SLIDE 22

Simulation of the Rayleigh-Taylor (RT) Instability

Instability at the interface of two fluids of different densities triggered by gravity. A cold-dense fluid over a less dense and warmer fluid triggers an instability that mixes the two fluid-regions (till equilibrium is reached).

E. Calore (INFN of Ferrara)

Portability, performance and scalability UCHPC, August 26, 2014 22 / 24

SLIDE 23

Acknowledgments

Luca Biferale, Mauro Sbragaglia, Patrizio Ripesi University of Tor Vergata and INFN Roma, Italy Andrea Scagliarini University of Barcelona, Spain Filippo Mantovani BSC institute, Spain Enrico Calore, Sebastiano Fabio Schifano, Raffaele Tripiccione, University and INFN of Ferrara, Italy Federico Toschi Eindhoven University of Technology The Netherlands, and CNR-IAC, Roma Italy

This work has been performed in the framework of the INFN COKA and SUMA projects. We would like to thank CINECA (ITALY) and JSC (GERMANY) institutes for access to their systems.

E. Calore (INFN of Ferrara)

Portability, performance and scalability UCHPC, August 26, 2014 23 / 24

SLIDE 24

Thanks for Your attention

E. Calore (INFN of Ferrara)

Portability, performance and scalability UCHPC, August 26, 2014 24 / 24