Code Parallelization Fabrice Schlegel Introduction Goal: Efficient - - PowerPoint PPT Presentation
Code Parallelization Fabrice Schlegel Introduction Goal: Efficient - - PowerPoint PPT Presentation
3D Particle Methods Code Parallelization Fabrice Schlegel Introduction Goal: Efficient parallelization and memory optimization of a CFD code used for the direct numerical simulation (DNS) of turbulent combustion. Hardware: 1) Our lab
Introduction
Goal: Efficient parallelization and memory optimization of a CFD code used for the direct numerical simulation (DNS) of turbulent combustion.
Hardware: 1) Our lab cluster, Pharos, that consists of 60 Intel Xeon -Harpertown nodes, each consisting of dual quad-core CPUs (8 processors) of 2.66 GHz speed. 2) Shaheen, a 16-rack IBM Blue Gene/P system
- wned by KAUST University. Four 850 MHz processors are
integrated on each Blue Gene/P chip. A standard Blue Gene/P houses 4,096 processors per rack.
Parallel library: MPI
Outline
Brief overview of Lagrangian vortex methods
Comparison of the Ring and Copy parallelization algorithm
– Numerical results in terms of speed and parallel efficiency
Previous code modifications to account for the new data structure
Applications of the ring algorithm to large transverse jet simulations
Vortex Element Methods
Vortex simulations
Vorticity ω instead of velocity u: more compact support ω u
ω u
2
1 . . Re Re Gr t
r
u u g
Efficient utilization of computational elements
i i 1 i 1
An element described by a discrete node point {ci} & {Wi}
( ). ( ,t)
i i i
d t dt W W u
( ,t)
i i
d dt u
1
( , ) ( ) ( ( ))
N i i
t t f t
x W x
Serial treecode [Lindsay and Krasny 2001]
– Velocity from Rosenhead-Moore kernel: – Constructs an adaptive oct-tree: use Taylor expansion of kernel in Cartesian coordinates:
Fast summation for the velocity evaluation
Taylor coefficients, computed with recurrence relation Cell moments, stored for re-use
arg t et
x
c
y
i
y
1
( ) ( , )
N RM j j i i i
K
u x x y
3/2 2 2
1 ( , ) 4 | |
RM
x y K x y x y
arg arg 1
1 ( ) ( , )( ) !
c
N p p t et y t et c i c i i p
D K p
u x x y y y
Clustering of particle for optimal load balancing
Distribution of particles 2 000 000 Particles 256 Processors/Clusters
A k-mean clustering algorithm is used.
Copy vs. Ring Algorithm
M1 M4 M3 M2 M1 M4 M3 M2
Copy Algorithm: each processor keeps a copy of all the particles involved, then they communicate their results
P1 P2 P3-4
Copy vs. Ring Algorithm
P1 P4 P3 P2 P1 P4 P3 P2
Ring: each processor keeps track of its own particles and communicates the sources with the others whenever needed
Copy vs. Ring Algorithm
P1 P4 P3 P2 P2 P1 P4 P3
Parallel Performance (Watson-Shaheen)
CPU time for advection vs. Number of processors, 2 000 000 particles
CPU Time (s) Number of processors
Strong Scaling
32 132 232 332 432 532 632 64 256 1024 ring copy
Parallel Performance (Watson-Shaheen)
Parallel Efficiency, 2 000 000 particles
Parallel efficiency Number of processors
Normalized Strong Scaling
0.2 0.4 0.6 0.8 1 1.2 64 256 1024 Ring Copy
Comparison
Copy algorithm
- Less communication:
Recommended for high numbers of processors
- Memory limited
Ring algorithm
- Allows for bigger number of
computational points.
- Too much data circulation, its
efficiency decreases for high number of processors
Parallel implementation using the ring algorithm
Resolution of the N-Body problem
- Parallel implementation using the ring algorithm (Pure MPI):
- Performed simulation with 60 millions particles on 1 node, i.e., 8
processors (Pharos) 16GB. Exepted results on Shaheen: 1.8 Millions particles/processors 1.8 billions on 1024 processors
New implementation of the clustering algorithm
New implementation of the clustering algorithm
- Assign new Membership to each particle
- Heap Sort of the particles inside for each processor in function of their
membership.
- Redistribution of all the particles to their respective processors using MPI, for
a better locality
P1 P2 P4 P3
Particles redistribution
Particles redistribution
New implementation of the clustering algorithm
- Assign new Membership to each particle
- Heap Sort of the particles inside for each processor in function of their
membership.
- Redistribution of all the particles to their respective processors using MPI, for
a better locality
P1 P2 P4 P3
New implementation of the clustering algorithm
- Assign new Membership to each particle
- Heap Sort of the particles inside for each processor in function of their
membership.
- Redistribution of all the particles to their respective processors using MPI, for
a better locality
Particles redistribution
Transverse jet: Motivations
Wide range of applications:
- Combustion: industrial burners, aircraft
engines.
- Pipe tee mixers.
- V/STOL aerodynamics.
- Pollutant dispersion (chimney plumes,
effluent discharge).
Photographed by Otto Perry (1894--1970) Western History Department of the Denver Public Library Boeing Military Airplane Company U.S. Department of Defense
Prim ary Air Jets Dilution Air Jets Prim ary Com bustion Zone Secondary Com bustion Zone Turbine Inlet Guide Vanes Fuel Nozzle
Higher thrust Lower NOx, CO, UHC Better operability Mixing in combustors for gas turbines
.
r = 10 Re = 245 Rej = 2450
Numerical Results
Vorticity Isosurfaces |w| = 3.5 The Ring algorithm allow for large simulations (> 5millions points), that couldn’t ne run with the copy algorithm.
Future Tasks
Towards more parallel efficiency…
- Find hybrid strategies between the copy and the ring algorithm, by
spitting the particles in such a way as two maximize the use of local memory, and not splitting them by the number of processors when the memory limit is not reached yet. This will reduces the number of shifts in the ring algorithm, and increase its efficiency for large number of processors.
- Other alternative: Use mixed Open MP/MPI programming, see next slide.
This will reduces the number of shifts by the number of processor per node (8 in
- ur case).
Mixed MPI-OpenMP implementation
An other alternative would be to use MPI for shared memory and OpenMP locally (on each node), with the ring algorithm: Pros:
- Easy implementation, not much modifications
- Built-in load balancing subroutines
- Fast summation will be more time efficient on a bigger
set of particles.
- Will reduce the communication time, the number of
travelling cluster with ring algorithm will be reduced by the number of processor per node (8 in our case)!!!