ExaDG: High-order discontinuous Galerkin for the exa-scale Guido - - PowerPoint PPT Presentation

▶

Dec 15, 2023 390 likes •622 views

Institute for Computational Mechanics Martin Kronbichler Technische Universit at M unchen ExaDG: High-order discontinuous Galerkin for the exa-scale Guido Kanschat 1 Katharina Kormann 2 Martin Kronbichler 3 Wolfgang A. Wall 3 1

SLIDE 1

Institute for Computational Mechanics Martin Kronbichler Technische Universit¨ at M¨ unchen

ExaDG: High-order discontinuous Galerkin for the exa-scale

Guido Kanschat1 Katharina Kormann2 Martin Kronbichler3 Wolfgang A. Wall3

1Interdisziplin¨

ares Institut f¨ ur Wissenschaftliches Rechnen, Universit¨ at Heidelberg

2Max–Planck–Institut f¨

ur Plasmaphysik

3Institute for Computational Mechanics

Technische Universit¨ at M¨ unchen

January 27, 2016

ExaDG: Operator evaluation framework Parallel adaptive multigrid Applications Summary

SLIDE 2

Institute for Computational Mechanics Martin Kronbichler Technische Universit¨ at M¨ unchen

Outline

ExaDG: Operator evaluation framework Matrix-free algorithm Parallel adaptive multigrid Applications Summary

ExaDG: Operator evaluation framework Parallel adaptive multigrid Applications Summary

SLIDE 3

Institute for Computational Mechanics Martin Kronbichler Technische Universit¨ at M¨ unchen

ExaDG: Motivation of project

◮ Modern PDE simulators spend about 60–95 % of CPU time on

matrix-vector products; unstructured/adaptive case typically done with sparse matrices

◮ Bottleneck: memory-limited, only 1–5 % of arithmetic peak ◮ Matrix-free alternative based on cellwise integration1

◮ (Much) lower memory transfer ◮ CPUs: 30–70 % of arithmetic peak ◮

2–5 times as fast for Q2 elements on petascale machine (SuperMUC)

◮ Medium/high order 2 ≤ p ≤ 6 honors

local computations and cache hierarchy rather than data transfer

10-8 10-7 10-6 10-5 1 2 3 4 5 6 7 8 Time [s] / DoF Element degree SpMV usual SpMV stat cond matrix-free general matrix-free Cartesian

Laplacian on continuous elements CPU: Xeon E5 (SNB @ SuperMUC)

Matrix-free implementation of general-purpose PDE operator eval- uation promising for the exascale

1Kronbichler, Kormann, Comput. Fluids, 2012 ExaDG: Operator evaluation framework Parallel adaptive multigrid Applications Summary

SLIDE 4

Institute for Computational Mechanics Martin Kronbichler Technische Universit¨ at M¨ unchen

Generic operator evaluation framework

◮ Integrated in deal.II finite element library (www.dealii.org) ◮ Target generic weak forms on quadrilaterals, input by user:

perator in one quadrature point → C++ templated function

◮ Mesh loop according to specific algorithm structure (operator

evaluation on active cells, evaluation on geometric multigrid levels, multigrid smoothing on cell patches)

◮ Library should handle loops in a hardware-specific way

◮ Loop over quadrature points and cells/faces/vertices ◮ Full support for mesh adaptivity with hanging nodes (L2-, Hdiv-,

Hcurl-, H1-conforming elements)

◮ Parallelization: MPI (domain decomposition), threads ◮ Vectorization ExaDG: Operator evaluation framework Parallel adaptive multigrid Applications Summary

SLIDE 5

Institute for Computational Mechanics Martin Kronbichler Technische Universit¨ at M¨ unchen

Infrastructure for parallel meshes in deal.II

◮ deal.II implements quad/hex only meshes, forest-of-trees concept ◮ deal.II integrates functionality of the p4est library for

dynamically adapted meshes based on hierarchical refinement into trees (with hanging nodes)2

◮ Each processor has its view of mesh (refined from coarse mesh)3 ◮ Scaling results in deal.II to 65,536 cores (p4est has been scaled

to 1.5m MPI ranks)

P0 P1 P2 P3

2www.p4est.org

3W. Bangerth, C. Burstedde, T. Heister, M. Kronbichler, ACM TOMS 38(2), 2011

ExaDG: Operator evaluation framework Parallel adaptive multigrid Applications Summary

SLIDE 6

Institute for Computational Mechanics Martin Kronbichler Technische Universit¨ at M¨ unchen

Faster matrix-vector products for high order elements

matrix-based:    A =

K∈{cells}

PT

KAKPK (assembly)

v = Au matrix-free: v =

K∈{cells}

PT

KAK (PKu)

Matrix-vector product Matrix-free algorithm:

◮ v = 0 ◮ loop over cells

(i) Extract local vector values on cell: uK = PKu (ii) Apply operation locally

n cell: vK = AKuK

(without forming AK; sum factorization, like EXA-DUNE) (iii) Sum results from (ii) into the global solution vector: v = v + PT

KvK

ExaDG: Operator evaluation framework Parallel adaptive multigrid Applications Summary

SLIDE 7

Institute for Computational Mechanics Martin Kronbichler Technische Universit¨ at M¨ unchen

Characterization matrix-free vs. sparse matrices

Operation count per degree of freedom (arithmetics)

100 200 300 400 500 600 700 800 900 1000 1 2 3 4 5 6 7 8 Arithmetic operations / DoF Element degree SpMV matrix-free matrix-free GL

Memory per degree of freedom (≈ main memory transfer)

1 10 100 1000 10000 1 2 3 4 5 6 7 8 Memory [bytes] / DoF Element degree SpMV matrix-free general matrix-free Cartesian

3D Laplacian, hexahedral elements Qp, periodic b.c. (= no boundary), pre-evaluated Jacobians on all quadrature points, “even-odd decomposition” to decrease work for 1D kernels from (p + 1)2 to (p + 1)( p

+ 2)

Sparse matrix-vector product is memory bandwidth bound → speedup for Q2 and Q3: trade memory transfer for arithmetics

ExaDG: Operator evaluation framework Parallel adaptive multigrid Applications Summary

SLIDE 8

Institute for Computational Mechanics Martin Kronbichler Technische Universit¨ at M¨ unchen

Performance continuous elements

3D Laplacian, Qp elements, Cartesian & curved geometry Intel Xeon E5 Sandy Bridge 2.7 GHz (Su- perMUC) time per degree of freedom (DoF) per core (but node fully loaded)

100 200 300 400 500 600 700 800 900 1000 1 2 3 4 5 6 7 8 Arithmetic operations / DoF Element degree SpMV matrix-free matrix-free GL

+

1 10 100 1000 10000 1 2 3 4 5 6 7 8 Memory [bytes] / DoF Element degree SpMV matrix-free general matrix-free Cartesian

=

10-8 10-7 10-6 10-5 1 2 3 4 5 6 7 8 Time [s] / DoF Element degree SpMV usual SpMV stat cond matrix-free general matrix-free Cartesian

ExaDG: Operator evaluation framework Parallel adaptive multigrid Applications Summary

SLIDE 9

Institute for Computational Mechanics Martin Kronbichler Technische Universit¨ at M¨ unchen

Performance discontinuous Galerkin

3D Laplacian, Qp elements, symmetric interior penalty DG, Cartesian & curved geometry Intel Xeon E5 Sandy Bridge 2.7 GHz (Su- perMUC) time per degree

f freedom per

core (but node fully loaded)

10-8 10-7 10-6 10-5 1 2 3 4 5 6 7 8 Time [s] / DoF Element degree SpMV usual matrix-free curved matrix-free Cartesian

ExaDG: Operator evaluation framework Parallel adaptive multigrid Applications Summary

SLIDE 10

Institute for Computational Mechanics Martin Kronbichler Technische Universit¨ at M¨ unchen

Characteristics of element kernels

Operator evaluation: 30–70% of arithmetic peak

◮ Vectorization over elements

◮ data layout: array-of-struct-of-array ◮ intrinsics via C++ template class

VectorizedArray<double>

◮ everything but vector read/write fully

vectorized (DG: fully vectorized)

◮ Loop bounds known at compile time

◮ element degree and number of 1D

quadrature points as templates

◮ unroll loops of length ⌊p/2⌋ + 1 or p + 1

◮ Challenge: Mix of operations

◮ memory intensive: vector read/write,

pre-evaluated Jacobians, coefficients, etc.

n quadrature points

◮ arithmetic intensive: sum factorization

→ Develop performance model!

. . . vmovapd 0x988(%rsp),%ymm4 vmovapd 0xbc8(%rsp),%ymm6 vmovapd 0x9a8(%rsp),%ymm14 vsubpd %ymm4,%ymm6,%ymm11 vmulpd %ymm15,%ymm6,%ymm3 vmulpd %ymm6,%ymm13,%ymm8 vfmadd231pd %ymm13,%ymm4,%ymm3 vfmadd231pd %ymm4,%ymm15,%ymm8 vmulpd %ymm6,%ymm12,%ymm10 vmovapd 0xbe8(%rsp),%ymm0 vmovapd 0x268(%rsp),%ymm9 vfmsub231pd %ymm9,%ymm4,%ymm10 vmulpd %ymm9,%ymm6,%ymm9 vmovapd 0xaa8(%rsp),%ymm2 vfmsub231pd %ymm4,%ymm12,%ymm9 vmovapd %ymm11,0x748(%rsp) vmovapd 0xac8(%rsp),%ymm1 vmulpd %ymm15,%ymm0,%ymm5 vmulpd %ymm0,%ymm13,%ymm4 vfmadd231pd %ymm7,%ymm2,%ymm3 vfmadd231pd %ymm7,%ymm2,%ymm8 vfmadd231pd %ymm13,%ymm14,%ymm5 vmovapd %ymm3,0x2c8(%rsp) vsubpd %ymm14,%ymm0,%ymm3 . . . ExaDG: Operator evaluation framework Parallel adaptive multigrid Applications Summary

SLIDE 11

Institute for Computational Mechanics Martin Kronbichler Technische Universit¨ at M¨ unchen

Project goal: Parallel adaptive multigrid

Implicit solvers for large-scale problems

◮ Many established solvers need more than just operator evaluation

(mat-vec)

◮ Matrix-free evaluation requires specific preconditioners ◮ Simple example: Geometric multigrid with polynomial

(Chebyshev) smoother

ExaDG: Operator evaluation framework Parallel adaptive multigrid Applications Summary

SLIDE 12

Institute for Computational Mechanics Martin Kronbichler Technische Universit¨ at M¨ unchen

Project goal: Parallel adaptive multigrid

Implicit solvers for large-scale problems

◮ Many established solvers need more than just operator evaluation

(mat-vec)

◮ Matrix-free evaluation requires specific preconditioners ◮ Simple example: Geometric multigrid with polynomial

(Chebyshev) smoother

◮ Goal: GMG with overlapping Schwarz smoothers for high order

bases

◮ Schwarz smoothers approximately invert differential operator on

patches of elements by tensorial techniques

◮ Integrate with MPI and domain decomposition ◮ Combine with algebraic multigrid for coarse meshes with millions

f elements, using low-order bases

ExaDG: Operator evaluation framework Parallel adaptive multigrid Applications Summary

SLIDE 13

Institute for Computational Mechanics Martin Kronbichler Technische Universit¨ at M¨ unchen

Geometric multigrid methods in deal.II

◮ Geometric multigrid on adaptively refined meshes with local

smoothing45

◮ Recent development (end of 2015): fully parallel computations in

2D & 3D on both continuous and discontinuous elements6

AMG/GMG GMG

✟✟ ✟✟✟ ✟✟✟ ✟ ✟✟ ✟ ✟✟ ✟

Blue Smoothed on fine level Red Additional residual contribution in level transfer

4DG: G. Kanschat, Comput. Struct. 82(28):2437–2445, 2004 5FEM: B. Janssen, G. Kanschat, SIAM J. Sci. Comput. 33(4):2095–2114, 2011 6Contributors: T. Heister, G. Kanschat, M. Kronbichler ExaDG: Operator evaluation framework Parallel adaptive multigrid Applications Summary

SLIDE 14

Institute for Computational Mechanics Martin Kronbichler Technische Universit¨ at M¨ unchen

Scaling to large processor counts: Uniform meshes

◮ Continuous

finite/spectral elements

◮ Constant-coefficient

Poisson equation

◮ Q3 basis ◮ CG solver

preconditioned by geometric multigrid

◮ Chebyshev smoother

(degree 5, α = 15)

◮ ∼ 6 iterations for

tolerance 10−12

Computations on SuperMUC Phase 1 (SandyB) and SuperMUC Phase 2 (Haswell)

Strong scaling (constant global problem size along lines)

0.01 0.1 1 10 64 256 1024 4096 16384 65536 Time Poisson solver [s] Number of cores SandyB, 128M ele SandyB, 16M ele SandyB, 2M ele Haswell, 16M ele Haswell, 2M ele linear scaling

mesh: 1283 . . . 5123 approximation: Q3 elements FEM degrees of freedom: 57.1m . . . 3.63b

ExaDG: Operator evaluation framework Parallel adaptive multigrid Applications Summary

SLIDE 15

Institute for Computational Mechanics Martin Kronbichler Technische Universit¨ at M¨ unchen

Scaling to large processor counts: Uniform meshes

◮ Continuous finite

elements/spectral elements (FEM), CG solver preconditioned by geometric multigrid with Chebyshev smoother (degree 5, α = 15), matrix-free (∼ 6 iterations)

◮ Interior penalty

discontinous Galerkin method (DG SIP), geometric multigrid, Chebyshev smoother (degree 5, α = 15), matrix-free (∼ 7 iterations)

Weak scaling (constant size per proc)

100000 200000 300000 400000 500000 600000 16 128 1024 8192 65536 DoFs / ( core * s ) Number of cores FE Haswell FE SandyB DG Haswell DG SandyB

Constant-coefficient Poisson equation uniform meshes: 483 . . . 7683 approximation: Q3 elements FEM degrees of freedom: 3.05m . . . 12.2b DG degrees of freedom: 7.08m . . . 29.0b

Computations on SuperMUC Phase 1 (SandyB, 16 cores/node) and SuperMUC Phase 2 (Haswell, 28 cores/node) ExaDG: Operator evaluation framework Parallel adaptive multigrid Applications Summary

SLIDE 16

Institute for Computational Mechanics Martin Kronbichler Technische Universit¨ at M¨ unchen

Scaling to large processor counts: Uniform meshes

◮ Continuous finite

elements/spectral elements (FEM), CG solver preconditioned by geometric multigrid with Chebyshev smoother (degree 5, α = 15), matrix-free (∼ 7 iterations)

◮ Interior penalty

discontinous Galerkin method (DG SIP), geometric multigrid, Chebyshev smoother (degree 5, α = 15), matrix-free (∼ 7 iterations)

Weak scaling (constant size per proc)

2 4 6 8 10 12 14 16 28 224 1792 14336 Time linear solver [s] Number of cores DG SIP Q4 FEM Q4

Constant-coefficient Poisson equation uniform meshes: 483 . . . 3843 approximation: Q4 elements FEM degrees of freedom: 7.19m . . . 3.64b DG degrees of freedom: 13.8m . . . 7.08b

Computations on SuperMUC Phase 2 (Haswell, 28 cores/node) ExaDG: Operator evaluation framework Parallel adaptive multigrid Applications Summary

SLIDE 17

Institute for Computational Mechanics Martin Kronbichler Technische Universit¨ at M¨ unchen

Solver time versus problem size

Continuous finite el- ements/spectral ele- ments (FEM), Q3 ba- sis, CG solver precon- ditioned by geometric multigrid with Cheby- shev smoother (degree 5, α = 15), matrix-free (∼ 6 iterations)

Computations on SuperMUC Phase 1 (SandyB, 16 cores/node)

0.01 0.1 1 10 100000 1x106 1x107 1x108 1x109 1x1010 1x1011 1x1012 Time Poisson solver [s] Problem size [DoFs] SandyB, 1024 cores SandyB, 8192 cores SandyB, 65536 cores

Computations run until memory per core exhausted

◮ Approx. 4 million DoFs in 1.5 GB ◮ GMG + CG + adaptive machinery not particularly optimized for

memory usage; local size could probably be doubled

◮ Matrix-based solver for Q2 exhausts memory for ∼ 500k DoFs

ExaDG: Operator evaluation framework Parallel adaptive multigrid Applications Summary

SLIDE 18

Institute for Computational Mechanics Martin Kronbichler Technische Universit¨ at M¨ unchen

Strong scaling: Adaptive vs. uniform mesh

Continuous finite elements, Pois- son equation, Q2 approximation in 3D Slowdown for adaptive case (ap-

prox. factor 5)

◮ Twice as many iterations

(10 vs. 5)

◮ Load imbalance on level

smoothers for adapted meshes (strategy: assign level cells according to

wner of first child cell)

Constant global problem size

0.01 0.1 1 10 56 112 224 448 896 1792 3584 7168 14336 Time linear solver [s] Number of cores FEM Q2 uniform 2m cells FEM Q2 adaptive 2m cells FEM Q2 uniform 17m cells FEM Q2 adaptive 17m cells

mesh: 2.1m million and 16.8m cells adaptive mesh: 3 more levels than uniform, ∼ 10% hanging nodes on final mesh

ExaDG: Operator evaluation framework Parallel adaptive multigrid Applications Summary

SLIDE 19

Institute for Computational Mechanics Martin Kronbichler Technische Universit¨ at M¨ unchen

Weak scaling: Adaptive vs. uniform mesh

Continuous finite elements, Q2 approximation in 3D Slowdown for adaptive case

◮ Twice as many iterations

(10 vs. 5)

◮ Load imbalance on level

smoothers for adapted meshes (strategy: assign level cells according to

wner of first child cell)

Constant size per proc

1 2 3 4 5 6 7 8 28 224 1792 14336 Time linear solver [s] Number of cores FEM Q2 uniform FEM Q2 adaptive

degrees of freedom: 7.19m . . . 3.64b adaptive mesh: 3 more levels than uniform, ∼ 15% . . . 6% hanging nodes on final mesh

ExaDG: Operator evaluation framework Parallel adaptive multigrid Applications Summary

SLIDE 20

Institute for Computational Mechanics Martin Kronbichler Technische Universit¨ at M¨ unchen

Project goal: Benchmarking on engineering problems

Develop widely applicable methods with added value for realistic extreme- scale simulations in engineering Application 1: Incompressible flow, simulation of turbulent flow, both split- ting methods and monolithic solvers Application 2: Fluid–structure in- teraction, monolithic solver, deforming mesh (ALE), large coarse grids (e.g. sim- ulation of human lung) . . . and some more in interaction with community

ExaDG: Operator evaluation framework Parallel adaptive multigrid Applications Summary

SLIDE 21

Institute for Computational Mechanics Martin Kronbichler Technische Universit¨ at M¨ unchen

Summary

◮ Build on very fast vector assembly framework in deal.II → fits linear

and nonlinear operators

◮ Extend algorithms beyond those that spend most time on

matrix-vector products

◮ Relatively low overhead for adaptivity and more unstructured meshes ◮ Improve data locality/cache usage of local kernels ◮ Target throughput architectures ◮ Reduce bottlenecks in large-scale (MPI) applications

ExaDG: Operator evaluation framework Parallel adaptive multigrid Applications Summary

SLIDE 22

Institute for Computational Mechanics Martin Kronbichler Technische Universit¨ at M¨ unchen

Try to substantiate these questions

◮ Is higher order effective in realistic engineering settings? ◮ Is exascale possible on realistic engineering settings with complex

meshes and complex physics?

◮ Can coupled/implicit solvers be realized in an exascale setting?

. . . add data points to ongoing discussions in application areas!

ExaDG: Operator evaluation framework Parallel adaptive multigrid Applications Summary