Asynchronous Distributed-Memory Task-Parallel Algorithm for - - PowerPoint PPT Presentation

asynchronous distributed memory task parallel algorithm
SMART_READER_LITE
LIVE PREVIEW

Asynchronous Distributed-Memory Task-Parallel Algorithm for - - PowerPoint PPT Presentation

Asynchronous Distributed-Memory Task-Parallel Algorithm for Compressible Flows on 3D Unstructured Grids J. Bakosi, M. Charest, A. Pandare , J. Waltz Los Alamos National Laboratory, Los Alamos, NM, USA October 20, 2020 LA-UR-20-28309 Project


slide-1
SLIDE 1

Asynchronous Distributed-Memory Task-Parallel Algorithm for Compressible Flows on 3D Unstructured Grids

  • J. Bakosi, M. Charest, A. Pandare, J. Waltz

Los Alamos National Laboratory, Los Alamos, NM, USA October 20, 2020 LA-UR-20-28309

slide-2
SLIDE 2

Project goals

◮ Large-scale Computational Fluid Dynamics (CFD) capability ◮ Simulation use cases

◮ shocked flow over surrogate reentry bodies ◮ blast loading on vehicles or other complex structures ◮ weapons effects calculations in urban environments

◮ Distinguishing characteristics

◮ external flows over complex 3D geometries ◮ high-speed compressible flow

◮ Capability requirements compared to internal flow calculations

◮ complex domain must be explicitly meshed (rather than modeled) ◮ multiple orders of magnitude larger computational meshes ◮ larger demand for HPC: O(109) cells, O(104) CPUs must be routine calculations

slide-3
SLIDE 3

Quinoa::Inciter: Built on Charm++

◮ Compressible hydro (single or multiple materials) ◮ Unstructured 3D (tetrahedra only) grids ◮ Continuous and discontinuous Galerkin finite elements ◮ Adaptive: mesh refinement (WIP), polynomial-degree refinement ◮ Native Charm++ code interoperating with MPI libs ◮ Overdecomposition ◮ Parallel I/O ◮ SMP, non-SMP ◮ Automatic load balancing ◮ Open source: quinoacomputing.org

slide-4
SLIDE 4

Quinoa::Inciter: ALECG hydro scheme, numerical method

◮ Edge-based finite element (or node-centered finite volume) method ◮ Compressible single-material (Euler, ideal gas) flow ∂U ∂t + ∂Fj ∂xj = 0, U =    ρ ρui ρE    , Fj =    ρuj ρuiuj + pδij uj(ρE + p)    ◮ Galerkin lumped-mass, locally conservative formulation dU v dt = − 1 V v

  • j

vw∈v

Dvw

j F vw j

+

  • vw∈v

Bvw

j

  • F v

j + F w j

  • + Bv

j F v j

  • U(

x) =

  • v∈Ωh

Nv( x)U v, Dvw

j

= 1 2

  • Ωh∈vw
  • Ωh
  • Nv ∂Nw

∂xj − Nw ∂Nv ∂xj

  • dΩ

Bvw

j

= 1 2

  • Γh∈vw
  • Γh

NvNwnj dΓ, Bv

j =

  • Γh∈v
  • Γh

NvNvnj dΓ

slide-5
SLIDE 5

Quinoa::Inciter: ALECG hydro scheme, References I

◮ [1, 2, 3]

  • J. Waltz, N. Morgan, T.R. Canfield, M.R.J. Charest, L.D. Risinger, and J.G. Wohlbier.

A three-dimensional finite element arbitrary Lagrangian-Eulerian method for shock hydrodynamics on unstructured grids. Computers & Fluids, 92:172–187, 2014.

  • J. Waltz, T.R. Canfield, N.R. Morgan, L.D. Risinger, and J.G. Wohlbier.

Verification of a three-dimensional unstructured finite element method using analytic and manufactured solutions. Computers & Fluids, 81:57 – 67, 2013.

  • J. Waltz, T.R. Canfield, N.R. Morgan, L.D. Risinger, and J.G. Wohlbier.

Manufactured solutions for the three-dimensional Euler equations with relevance to Inertial Confinement Fusion.

  • J. Comp. Phys., 267:196 – 209, 2014.
slide-6
SLIDE 6

Solution verification: Vortical flow

10

  • 3

10

  • 2

10

  • 1

log(h) 10

  • 6

10

  • 5

10

  • 4

10

  • 3

log(L2) ρ ρu1 ρu2 ρu3 ρE 2nd order

Figure: Left: initial (first column) and final (second column) velocity, pressure (third column), and total energy distributions (fourth column). Right: L2 errors as a function of mesh resolution.

slide-7
SLIDE 7

Solution verification: Sedov

0.2 0.4 0.6 0.8 1 1.2 x 1 2 3 4 density Mesh 1 Mesh 2 Mesh 3 Mesh 4 semi-analytic 10

  • 3

10

  • 2

Log(h) 10

  • 2

10

  • 1

Log(L1) rho (Slope = 0.9592) 1st order

slide-8
SLIDE 8

Solution validation: square cavity, domain and initial conditions

5 5 5 6 10 0.5 State 1 State 2

Figure: Domain and initial conditions for square cavity problem. Dimensions are in cm.

slide-9
SLIDE 9

Solution validation: square cavity, solution with experimental data

Figure: Solutions with increasingly finer meshes for the square cavity problem. Lines S1, Sr1, and Sr2 denote experimental shock positions.

slide-10
SLIDE 10

Solution validation: Onera M6 wing, mesh and numerical solution

Figure: Top – upper and lower surface mesh used for the ONERA M6 wing configuration. Bottom – computed pressure contours on the upper and lower surface.

slide-11
SLIDE 11

Solution validation: Onera M6 wing, simulation & experiments

  • 1
  • 0.8
  • 0.6
  • 0.4
  • 0.2

0.2 0.4 0.6 0.8 1 1.2 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

  • Cp

x/c Surface pressure coefficient at 20% semispan experiment computation (coarse mesh) computation (finer mesh)

  • 0.8
  • 0.6
  • 0.4
  • 0.2

0.2 0.4 0.6 0.8 1 1.2 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

  • Cp

x/c Surface pressure coefficient at 44% semispan experiment computation (coarse mesh) computation (finer mesh)

  • 1
  • 0.5

0.5 1 1.5 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

  • Cp

x/c Surface pressure coefficient at 65% semispan experiment computation (coarse mesh) computation (finer mesh)

  • 0.8
  • 0.6
  • 0.4
  • 0.2

0.2 0.4 0.6 0.8 1 1.2 1.4 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

  • Cp

x/c Surface pressure coefficient at 80% semispan experiment computation (coarse mesh) computation (finer mesh)

  • 0.8
  • 0.6
  • 0.4
  • 0.2

0.2 0.4 0.6 0.8 1 1.2 1.4 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

  • Cp

x/c Surface pressure coefficient at 90% semispan experiment computation (coarse mesh) computation (finer mesh)

  • 0.8
  • 0.6
  • 0.4
  • 0.2

0.2 0.4 0.6 0.8 1 1.2 1.4 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

  • Cp

x/c Surface pressure coefficient at 95% semispan experiment computation (coarse mesh) computation (finer mesh)

Figure: Comparison beween the computed and experimental surface pressure coefficient for the ONERA wing section at 20%, 44%, 65%, 80%, 90%, and 95% semispans.

slide-12
SLIDE 12

Quinoa::Inciter: ALECG, on-node performance

Time step profile: µs % rhs 8482724 91 bgrad 34333 0.4 diag 48549 0.5 solve 40355 0.4 total 27830000 100 RHS profile: µs % grad 1109746 51 domain 677741 30 bnd 2565 src 413999 19 total 2183459 100

slide-13
SLIDE 13

Quinoa::Inciter: ALECG, on-node performance improvements

  • 1. Remove unnecessary code for generating unused derived data structures: 1.6x.
  • 2. Replace a tree-based data structure with a flat one, enabling a streaming-style

(contiguous) access to normals associated to edges: 1.3x.

  • 3. Re-write domain-integral from a nested loop (over mesh points and over edges connected

to a point) as a single loop over unique edges: 1.3x.

  • 4. Optimize data access in the source term: 1.4x.
  • 5. Re-write the loop computing primitive-variable gradients from a gather-scatter loop over

elements to a nested loop over mesh points with an inner loop over edges connected to a point: 1.5x. Altogether: 6.2x speedup

slide-14
SLIDE 14

Quinoa::Inciter: 3 hydro schemes, strong scaling

360 900 1800 3600 7200 14400 28800 36000 900 1800 3600 7200 14400 28800 50400

10

2

10

3

10

4

10

5

Number of CPUs (36/node)

10

1

10

2

10

3

10

4

Wall clock time, sec

CG, non-SMP CG, SMP DG(P1), non-SMP DG(P1), SMP ALECG, non-SMP ALECG, SMP ideal

Single-material hydro, 794M cells (100 time steps, no I/O)

slide-15
SLIDE 15

Quinoa::Inciter: Parallel load imbalance triggered by physics

Figure: Spatial distributions of extra load in each cell whose fluid density exceeds the value of 1.5, during time evolution of the Sedov problem: (left) shortly after the onset of load imbalance, (right) at a later time of the simulation.

slide-16
SLIDE 16

Quinoa::Inciter: Automatic load balancing yields 10x speedup

100 200 300 400 500 time step 5000 10000 15000 20000 grind-time, ms/timestep no extra load, virt=0, noLB no extra load, virt=100x, noLB extra load, virt=0, noLB extra load, virt=10x, GreedyCommLB extra load, virt=100x, GreedyCommLB extra load, virt=100x, DistributedLB extra load, virt=100x, NeighborLB

Figure: Grind-time during time stepping computing a Sedov problem with load imbalance, using various built-in load balancers in Charm++. Run on 10 compute nodes with 36CPUs/node.

slide-17
SLIDE 17

Current and future work

  • 1. Multi-material FV/DG at large scales
  • 2. P-adaptation
  • 3. Productization (SBIR, PI:Charmworks)
  • 4. 3D mesh-to-mesh solution transfer toward large-scale fluid-structure interaction

(see next talk by Eric Mikida)