Code Parallelization Fabrice Schlegel Introduction Goal: Efficient - - PowerPoint PPT Presentation

▶

Mar 08, 2023 176 likes •391 views

3D Particle Methods Code Parallelization Fabrice Schlegel Introduction Goal: Efficient parallelization and memory optimization of a CFD code used for the direct numerical simulation (DNS) of turbulent combustion. Hardware: 1) Our lab

SLIDE 1

3D Particle Methods Code Parallelization

Fabrice Schlegel

SLIDE 2

Introduction



Goal: Efficient parallelization and memory optimization of a CFD code used for the direct numerical simulation (DNS) of turbulent combustion.



Hardware: 1) Our lab cluster, Pharos, that consists of 60 Intel Xeon -Harpertown nodes, each consisting of dual quad-core CPUs (8 processors) of 2.66 GHz speed. 2) Shaheen, a 16-rack IBM Blue Gene/P system

wned by KAUST University. Four 850 MHz processors are

integrated on each Blue Gene/P chip. A standard Blue Gene/P houses 4,096 processors per rack.



Parallel library: MPI

SLIDE 3

Outline



Brief overview of Lagrangian vortex methods



Comparison of the Ring and Copy parallelization algorithm

– Numerical results in terms of speed and parallel efficiency



Previous code modifications to account for the new data structure



Applications of the ring algorithm to large transverse jet simulations

SLIDE 4

Vortex Element Methods

Vortex simulations

Vorticity ω instead of velocity u: more compact support ω u



  ω u

1 . . Re Re Gr t               

u u g

Efficient utilization of computational elements

฀  i ฀  i 1 ฀  i 1

 

An element described by a discrete node point {ci} & {Wi}

( ). ( ,t)

i i i

d t dt    W W u

( ,t)

i i

d dt    u

( , ) ( ) ( ( ))

N i i

t t f t





   



x W x

SLIDE 5

 Serial treecode [Lindsay and Krasny 2001]

– Velocity from Rosenhead-Moore kernel: – Constructs an adaptive oct-tree:  use Taylor expansion of kernel in Cartesian coordinates:

Fast summation for the velocity evaluation

Taylor coefficients, computed with recurrence relation Cell moments, stored for re-use

arg t et

( ) ( , )

N RM j j i i i

K 



 



u x x y

 

3/2 2 2

1 ( , ) 4 | |

RM 

       x y K x y x y

arg arg 1

1 ( ) ( , )( ) !

N p p t et y t et c i c i i p

D K p







  



u x x y y y

SLIDE 6

Clustering of particle for optimal load balancing

Distribution of particles 2 000 000 Particles 256 Processors/Clusters

A k-mean clustering algorithm is used.

SLIDE 7

Copy vs. Ring Algorithm

M1 M4 M3 M2 M1 M4 M3 M2

Copy Algorithm: each processor keeps a copy of all the particles involved, then they communicate their results

P1 P2 P3-4

SLIDE 8

Copy vs. Ring Algorithm

P1 P4 P3 P2 P1 P4 P3 P2

Ring: each processor keeps track of its own particles and communicates the sources with the others whenever needed

SLIDE 9

Copy vs. Ring Algorithm

P1 P4 P3 P2 P2 P1 P4 P3

SLIDE 10

Parallel Performance (Watson-Shaheen)

CPU time for advection vs. Number of processors, 2 000 000 particles

CPU Time (s) Number of processors

Strong Scaling

32 132 232 332 432 532 632 64 256 1024 ring copy

SLIDE 11

Parallel Performance (Watson-Shaheen)

Parallel Efficiency, 2 000 000 particles

Parallel efficiency Number of processors

Normalized Strong Scaling

0.2 0.4 0.6 0.8 1 1.2 64 256 1024 Ring Copy

SLIDE 12

Comparison

Copy algorithm

Less communication:

Recommended for high numbers of processors

Memory limited

Ring algorithm

Allows for bigger number of

computational points.

Too much data circulation, its

efficiency decreases for high number of processors

SLIDE 13

Parallel implementation using the ring algorithm

Resolution of the N-Body problem

Parallel implementation using the ring algorithm (Pure MPI):
Performed simulation with 60 millions particles on 1 node, i.e., 8

processors (Pharos) 16GB. Exepted results on Shaheen: 1.8 Millions particles/processors 1.8 billions on 1024 processors

New implementation of the clustering algorithm

SLIDE 14

New implementation of the clustering algorithm

Assign new Membership to each particle
Heap Sort of the particles inside for each processor in function of their

membership.

Redistribution of all the particles to their respective processors using MPI, for

a better locality

P1 P2 P4 P3

Particles redistribution

SLIDE 15

Particles redistribution

New implementation of the clustering algorithm

Assign new Membership to each particle
Heap Sort of the particles inside for each processor in function of their

membership.

Redistribution of all the particles to their respective processors using MPI, for

a better locality

P1 P2 P4 P3

SLIDE 16

New implementation of the clustering algorithm

Assign new Membership to each particle
Heap Sort of the particles inside for each processor in function of their

membership.

Redistribution of all the particles to their respective processors using MPI, for

a better locality

Particles redistribution

SLIDE 17

Transverse jet: Motivations

Wide range of applications:

Combustion: industrial burners, aircraft

engines.

Pipe tee mixers.
V/STOL aerodynamics.
Pollutant dispersion (chimney plumes,

effluent discharge).

Photographed by Otto Perry (1894--1970) Western History Department of the Denver Public Library Boeing Military Airplane Company U.S. Department of Defense

Prim ary Air Jets Dilution Air Jets Prim ary Com bustion Zone Secondary Com bustion Zone Turbine Inlet Guide Vanes Fuel Nozzle

 Higher thrust  Lower NOx, CO, UHC  Better operability Mixing in combustors for gas turbines

SLIDE 18

r = 10 Re = 245 Rej = 2450

Numerical Results

Vorticity Isosurfaces |w| = 3.5 The Ring algorithm allow for large simulations (> 5millions points), that couldn’t ne run with the copy algorithm.

SLIDE 19

Future Tasks

Towards more parallel efficiency…

Find hybrid strategies between the copy and the ring algorithm, by

spitting the particles in such a way as two maximize the use of local memory, and not splitting them by the number of processors when the memory limit is not reached yet. This will reduces the number of shifts in the ring algorithm, and increase its efficiency for large number of processors.

Other alternative: Use mixed Open MP/MPI programming, see next slide.

This will reduces the number of shifts by the number of processor per node (8 in

ur case).

SLIDE 20

Mixed MPI-OpenMP implementation

An other alternative would be to use MPI for shared memory and OpenMP locally (on each node), with the ring algorithm: Pros:

Easy implementation, not much modifications
Built-in load balancing subroutines
Fast summation will be more time efficient on a bigger

set of particles.

Will reduce the communication time, the number of

travelling cluster with ring algorithm will be reduced by the number of processor per node (8 in our case)!!!