[PPT] - Asynchronous Distributed-Memory Task-Parallel Algorithm for PowerPoint Presentation

SLIDE 1

Asynchronous Distributed-Memory Task-Parallel Algorithm for Compressible Flows on 3D Unstructured Grids

J. Bakosi, M. Charest, A. Pandare, J. Waltz

Los Alamos National Laboratory, Los Alamos, NM, USA October 20, 2020 LA-UR-20-28309

SLIDE 2

Project goals

◮ Large-scale Computational Fluid Dynamics (CFD) capability ◮ Simulation use cases

◮ shocked flow over surrogate reentry bodies ◮ blast loading on vehicles or other complex structures ◮ weapons effects calculations in urban environments

◮ Distinguishing characteristics

◮ external flows over complex 3D geometries ◮ high-speed compressible flow

◮ Capability requirements compared to internal flow calculations

◮ complex domain must be explicitly meshed (rather than modeled) ◮ multiple orders of magnitude larger computational meshes ◮ larger demand for HPC: O(109) cells, O(104) CPUs must be routine calculations

SLIDE 3

Quinoa::Inciter: Built on Charm++

◮ Compressible hydro (single or multiple materials) ◮ Unstructured 3D (tetrahedra only) grids ◮ Continuous and discontinuous Galerkin finite elements ◮ Adaptive: mesh refinement (WIP), polynomial-degree refinement ◮ Native Charm++ code interoperating with MPI libs ◮ Overdecomposition ◮ Parallel I/O ◮ SMP, non-SMP ◮ Automatic load balancing ◮ Open source: quinoacomputing.org

SLIDE 4

Quinoa::Inciter: ALECG hydro scheme, numerical method

◮ Edge-based finite element (or node-centered finite volume) method ◮ Compressible single-material (Euler, ideal gas) flow ∂U ∂t + ∂Fj ∂xj = 0, U =    ρ ρui ρE    , Fj =    ρuj ρuiuj + pδij uj(ρE + p)    ◮ Galerkin lumped-mass, locally conservative formulation dU v dt = − 1 V v

j

vw∈v

Dvw

j F vw j

+

vw∈v

Bvw

j

F v

j + F w j

+ Bv

j F v j

U(

x) =

v∈Ωh

Nv( x)U v, Dvw

j

= 1 2

Ωh∈vw
Ωh
Nv ∂Nw

∂xj − Nw ∂Nv ∂xj

dΩ

Bvw

j

= 1 2

Γh∈vw
Γh

NvNwnj dΓ, Bv

j =

Γh∈v
Γh

NvNvnj dΓ

SLIDE 5

Quinoa::Inciter: ALECG hydro scheme, References I

◮ [1, 2, 3]

J. Waltz, N. Morgan, T.R. Canfield, M.R.J. Charest, L.D. Risinger, and J.G. Wohlbier.

A three-dimensional finite element arbitrary Lagrangian-Eulerian method for shock hydrodynamics on unstructured grids. Computers & Fluids, 92:172–187, 2014.

J. Waltz, T.R. Canfield, N.R. Morgan, L.D. Risinger, and J.G. Wohlbier.

Verification of a three-dimensional unstructured finite element method using analytic and manufactured solutions. Computers & Fluids, 81:57 – 67, 2013.

J. Waltz, T.R. Canfield, N.R. Morgan, L.D. Risinger, and J.G. Wohlbier.

Manufactured solutions for the three-dimensional Euler equations with relevance to Inertial Confinement Fusion.

J. Comp. Phys., 267:196 – 209, 2014.

SLIDE 6

Solution verification: Vortical flow

10

3

10

2

10

1

log(h) 10

6

10

5

10

4

10

3

log(L2) ρ ρu1 ρu2 ρu3 ρE 2nd order

Figure: Left: initial (first column) and final (second column) velocity, pressure (third column), and total energy distributions (fourth column). Right: L2 errors as a function of mesh resolution.

SLIDE 7

Solution verification: Sedov

0.2 0.4 0.6 0.8 1 1.2 x 1 2 3 4 density Mesh 1 Mesh 2 Mesh 3 Mesh 4 semi-analytic 10

3

10

2

Log(h) 10

2

10

1

Log(L1) rho (Slope = 0.9592) 1st order

SLIDE 8

Solution validation: square cavity, domain and initial conditions

5 5 5 6 10 0.5 State 1 State 2

Figure: Domain and initial conditions for square cavity problem. Dimensions are in cm.

SLIDE 9

Solution validation: square cavity, solution with experimental data

Figure: Solutions with increasingly finer meshes for the square cavity problem. Lines S1, Sr1, and Sr2 denote experimental shock positions.

SLIDE 10

Solution validation: Onera M6 wing, mesh and numerical solution

Figure: Top – upper and lower surface mesh used for the ONERA M6 wing configuration. Bottom – computed pressure contours on the upper and lower surface.

SLIDE 11

Solution validation: Onera M6 wing, simulation & experiments

1
0.8
0.6
0.4
0.2

0.2 0.4 0.6 0.8 1 1.2 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Cp

x/c Surface pressure coefficient at 20% semispan experiment computation (coarse mesh) computation (finer mesh)

0.8
0.6
0.4
0.2

0.2 0.4 0.6 0.8 1 1.2 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Cp

x/c Surface pressure coefficient at 44% semispan experiment computation (coarse mesh) computation (finer mesh)

1
0.5

0.5 1 1.5 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Cp

x/c Surface pressure coefficient at 65% semispan experiment computation (coarse mesh) computation (finer mesh)

0.8
0.6
0.4
0.2

0.2 0.4 0.6 0.8 1 1.2 1.4 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Cp

x/c Surface pressure coefficient at 80% semispan experiment computation (coarse mesh) computation (finer mesh)

0.8
0.6
0.4
0.2

0.2 0.4 0.6 0.8 1 1.2 1.4 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Cp

x/c Surface pressure coefficient at 90% semispan experiment computation (coarse mesh) computation (finer mesh)

0.8
0.6
0.4
0.2

0.2 0.4 0.6 0.8 1 1.2 1.4 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Cp

x/c Surface pressure coefficient at 95% semispan experiment computation (coarse mesh) computation (finer mesh)

Figure: Comparison beween the computed and experimental surface pressure coefficient for the ONERA wing section at 20%, 44%, 65%, 80%, 90%, and 95% semispans.

SLIDE 12

Quinoa::Inciter: ALECG, on-node performance

Time step profile: µs % rhs 8482724 91 bgrad 34333 0.4 diag 48549 0.5 solve 40355 0.4 total 27830000 100 RHS profile: µs % grad 1109746 51 domain 677741 30 bnd 2565 src 413999 19 total 2183459 100

SLIDE 13

Quinoa::Inciter: ALECG, on-node performance improvements

1. Remove unnecessary code for generating unused derived data structures: 1.6x.
2. Replace a tree-based data structure with a flat one, enabling a streaming-style

(contiguous) access to normals associated to edges: 1.3x.

3. Re-write domain-integral from a nested loop (over mesh points and over edges connected

to a point) as a single loop over unique edges: 1.3x.

4. Optimize data access in the source term: 1.4x.
5. Re-write the loop computing primitive-variable gradients from a gather-scatter loop over

elements to a nested loop over mesh points with an inner loop over edges connected to a point: 1.5x. Altogether: 6.2x speedup

SLIDE 14

Quinoa::Inciter: 3 hydro schemes, strong scaling

360 900 1800 3600 7200 14400 28800 36000 900 1800 3600 7200 14400 28800 50400

10

2

10

3

10

4

10

5

Number of CPUs (36/node)

10

1

10

2

10

3

10

4

Wall clock time, sec

CG, non-SMP CG, SMP DG(P1), non-SMP DG(P1), SMP ALECG, non-SMP ALECG, SMP ideal

Single-material hydro, 794M cells (100 time steps, no I/O)

SLIDE 15

Quinoa::Inciter: Parallel load imbalance triggered by physics

Figure: Spatial distributions of extra load in each cell whose fluid density exceeds the value of 1.5, during time evolution of the Sedov problem: (left) shortly after the onset of load imbalance, (right) at a later time of the simulation.

SLIDE 16

Quinoa::Inciter: Automatic load balancing yields 10x speedup

100 200 300 400 500 time step 5000 10000 15000 20000 grind-time, ms/timestep no extra load, virt=0, noLB no extra load, virt=100x, noLB extra load, virt=0, noLB extra load, virt=10x, GreedyCommLB extra load, virt=100x, GreedyCommLB extra load, virt=100x, DistributedLB extra load, virt=100x, NeighborLB

Figure: Grind-time during time stepping computing a Sedov problem with load imbalance, using various built-in load balancers in Charm++. Run on 10 compute nodes with 36CPUs/node.

SLIDE 17

Current and future work

1. Multi-material FV/DG at large scales
2. P-adaptation
3. Productization (SBIR, PI:Charmworks)
4. 3D mesh-to-mesh solution transfer toward large-scale fluid-structure interaction

Asynchronous Distributed-Memory Task-Parallel Algorithm for Compressible Flows on 3D Unstructured Grids

Los Alamos National Laboratory, Los Alamos, NM, USA October 20, 2020 LA-UR-20-28309

Project goals

◮ Large-scale Computational Fluid Dynamics (CFD) capability ◮ Simulation use cases

◮ Distinguishing characteristics

◮ Capability requirements compared to internal flow calculations

Quinoa::Inciter: Built on Charm++

Quinoa::Inciter: ALECG hydro scheme, numerical method

vw∈v

Dvw

j F vw j

+

Bvw

j

j + F w j

j F v j

x) =

Nv( x)U v, Dvw

j

= 1 2

∂xj − Nw ∂Nv ∂xj

Bvw

j

= 1 2

NvNwnj dΓ, Bv

j =

NvNvnj dΓ

Quinoa::Inciter: ALECG hydro scheme, References I

◮ [1, 2, 3]

A three-dimensional finite element arbitrary Lagrangian-Eulerian method for shock hydrodynamics on unstructured grids. Computers & Fluids, 92:172–187, 2014.

Verification of a three-dimensional unstructured finite element method using analytic and manufactured solutions. Computers & Fluids, 81:57 – 67, 2013.

Manufactured solutions for the three-dimensional Euler equations with relevance to Inertial Confinement Fusion.

Solution verification: Vortical flow

Figure: Left: initial (first column) and final (second column) velocity, pressure (third column), and total energy distributions (fourth column). Right: L2 errors as a function of mesh resolution.

Solution verification: Sedov

Solution validation: square cavity, domain and initial conditions

5 5 5 6 10 0.5 State 1 State 2

Figure: Domain and initial conditions for square cavity problem. Dimensions are in cm.

Solution validation: square cavity, solution with experimental data

Figure: Solutions with increasingly finer meshes for the square cavity problem. Lines S1, Sr1, and Sr2 denote experimental shock positions.

Solution validation: Onera M6 wing, mesh and numerical solution

Figure: Top – upper and lower surface mesh used for the ONERA M6 wing configuration. Bottom – computed pressure contours on the upper and lower surface.

Solution validation: Onera M6 wing, simulation & experiments

Figure: Comparison beween the computed and experimental surface pressure coefficient for the ONERA wing section at 20%, 44%, 65%, 80%, 90%, and 95% semispans.

Quinoa::Inciter: ALECG, on-node performance

Time step profile: µs % rhs 8482724 91 bgrad 34333 0.4 diag 48549 0.5 solve 40355 0.4 total 27830000 100 RHS profile: µs % grad 1109746 51 domain 677741 30 bnd 2565 src 413999 19 total 2183459 100

Quinoa::Inciter: ALECG, on-node performance improvements

(contiguous) access to normals associated to edges: 1.3x.

to a point) as a single loop over unique edges: 1.3x.

elements to a nested loop over mesh points with an inner loop over edges connected to a point: 1.5x. Altogether: 6.2x speedup

Quinoa::Inciter: 3 hydro schemes, strong scaling

10

10

10

10

Number of CPUs (36/node)

10

10

10

10

Wall clock time, sec

Single-material hydro, 794M cells (100 time steps, no I/O)

Quinoa::Inciter: Parallel load imbalance triggered by physics

Figure: Spatial distributions of extra load in each cell whose fluid density exceeds the value of 1.5, during time evolution of the Sedov problem: (left) shortly after the onset of load imbalance, (right) at a later time of the simulation.

Quinoa::Inciter: Automatic load balancing yields 10x speedup

Figure: Grind-time during time stepping computing a Sedov problem with load imbalance, using various built-in load balancers in Charm++. Run on 10 compute nodes with 36CPUs/node.

Current and future work

(see next talk by Eric Mikida)