ExaDG: High-order discontinuous Galerkin for the exa-scale Guido - - PowerPoint PPT Presentation

exadg high order discontinuous galerkin for the exa scale
SMART_READER_LITE
LIVE PREVIEW

ExaDG: High-order discontinuous Galerkin for the exa-scale Guido - - PowerPoint PPT Presentation

Institute for Computational Mechanics Martin Kronbichler Technische Universit at M unchen ExaDG: High-order discontinuous Galerkin for the exa-scale Guido Kanschat 1 Katharina Kormann 2 Martin Kronbichler 3 Wolfgang A. Wall 3 1


slide-1
SLIDE 1

Institute for Computational Mechanics Martin Kronbichler Technische Universit¨ at M¨ unchen

ExaDG: High-order discontinuous Galerkin for the exa-scale

Guido Kanschat1 Katharina Kormann2 Martin Kronbichler3 Wolfgang A. Wall3

1Interdisziplin¨

ares Institut f¨ ur Wissenschaftliches Rechnen, Universit¨ at Heidelberg

2Max–Planck–Institut f¨

ur Plasmaphysik

3Institute for Computational Mechanics

Technische Universit¨ at M¨ unchen

January 27, 2016

ExaDG: Operator evaluation framework Parallel adaptive multigrid Applications Summary

slide-2
SLIDE 2

Institute for Computational Mechanics Martin Kronbichler Technische Universit¨ at M¨ unchen

Outline

ExaDG: Operator evaluation framework Matrix-free algorithm Parallel adaptive multigrid Applications Summary

ExaDG: Operator evaluation framework Parallel adaptive multigrid Applications Summary

slide-3
SLIDE 3

Institute for Computational Mechanics Martin Kronbichler Technische Universit¨ at M¨ unchen

ExaDG: Motivation of project

◮ Modern PDE simulators spend about 60–95 % of CPU time on

matrix-vector products; unstructured/adaptive case typically done with sparse matrices

◮ Bottleneck: memory-limited, only 1–5 % of arithmetic peak ◮ Matrix-free alternative based on cellwise integration1

◮ (Much) lower memory transfer ◮ CPUs: 30–70 % of arithmetic peak ◮

2–5 times as fast for Q2 elements on petascale machine (SuperMUC)

◮ Medium/high order 2 ≤ p ≤ 6 honors

local computations and cache hierarchy rather than data transfer

10-8 10-7 10-6 10-5 1 2 3 4 5 6 7 8 Time [s] / DoF Element degree SpMV usual SpMV stat cond matrix-free general matrix-free Cartesian

Laplacian on continuous elements CPU: Xeon E5 (SNB @ SuperMUC)

Matrix-free implementation of general-purpose PDE operator eval- uation promising for the exascale

1Kronbichler, Kormann, Comput. Fluids, 2012 ExaDG: Operator evaluation framework Parallel adaptive multigrid Applications Summary

slide-4
SLIDE 4

Institute for Computational Mechanics Martin Kronbichler Technische Universit¨ at M¨ unchen

Generic operator evaluation framework

◮ Integrated in deal.II finite element library (www.dealii.org) ◮ Target generic weak forms on quadrilaterals, input by user:

  • perator in one quadrature point → C++ templated function

◮ Mesh loop according to specific algorithm structure (operator

evaluation on active cells, evaluation on geometric multigrid levels, multigrid smoothing on cell patches)

◮ Library should handle loops in a hardware-specific way

◮ Loop over quadrature points and cells/faces/vertices ◮ Full support for mesh adaptivity with hanging nodes (L2-, Hdiv-,

Hcurl-, H1-conforming elements)

◮ Parallelization: MPI (domain decomposition), threads ◮ Vectorization ExaDG: Operator evaluation framework Parallel adaptive multigrid Applications Summary

slide-5
SLIDE 5

Institute for Computational Mechanics Martin Kronbichler Technische Universit¨ at M¨ unchen

Infrastructure for parallel meshes in deal.II

◮ deal.II implements quad/hex only meshes, forest-of-trees concept ◮ deal.II integrates functionality of the p4est library for

dynamically adapted meshes based on hierarchical refinement into trees (with hanging nodes)2

◮ Each processor has its view of mesh (refined from coarse mesh)3 ◮ Scaling results in deal.II to 65,536 cores (p4est has been scaled

to 1.5m MPI ranks)

P0 P1 P2 P3

2www.p4est.org

  • 3W. Bangerth, C. Burstedde, T. Heister, M. Kronbichler, ACM TOMS 38(2), 2011

ExaDG: Operator evaluation framework Parallel adaptive multigrid Applications Summary

slide-6
SLIDE 6

Institute for Computational Mechanics Martin Kronbichler Technische Universit¨ at M¨ unchen

Faster matrix-vector products for high order elements

matrix-based:    A =

  • K∈{cells}

PT

KAKPK (assembly)

v = Au matrix-free: v =

  • K∈{cells}

PT

KAK (PKu)

Matrix-vector product Matrix-free algorithm:

◮ v = 0 ◮ loop over cells

(i) Extract local vector values on cell: uK = PKu (ii) Apply operation locally

  • n cell: vK = AKuK

(without forming AK; sum factorization, like EXA-DUNE) (iii) Sum results from (ii) into the global solution vector: v = v + PT

KvK

ExaDG: Operator evaluation framework Parallel adaptive multigrid Applications Summary

slide-7
SLIDE 7

Institute for Computational Mechanics Martin Kronbichler Technische Universit¨ at M¨ unchen

Characterization matrix-free vs. sparse matrices

Operation count per degree of freedom (arithmetics)

100 200 300 400 500 600 700 800 900 1000 1 2 3 4 5 6 7 8 Arithmetic operations / DoF Element degree SpMV matrix-free matrix-free GL

Memory per degree of freedom (≈ main memory transfer)

1 10 100 1000 10000 1 2 3 4 5 6 7 8 Memory [bytes] / DoF Element degree SpMV matrix-free general matrix-free Cartesian

3D Laplacian, hexahedral elements Qp, periodic b.c. (= no boundary), pre-evaluated Jacobians on all quadrature points, “even-odd decomposition” to decrease work for 1D kernels from (p + 1)2 to (p + 1)( p

2

  • + 2)

Sparse matrix-vector product is memory bandwidth bound → speedup for Q2 and Q3: trade memory transfer for arithmetics

ExaDG: Operator evaluation framework Parallel adaptive multigrid Applications Summary

slide-8
SLIDE 8

Institute for Computational Mechanics Martin Kronbichler Technische Universit¨ at M¨ unchen

Performance continuous elements

3D Laplacian, Qp elements, Cartesian & curved geometry Intel Xeon E5 Sandy Bridge 2.7 GHz (Su- perMUC) time per degree of freedom (DoF) per core (but node fully loaded)

100 200 300 400 500 600 700 800 900 1000 1 2 3 4 5 6 7 8 Arithmetic operations / DoF Element degree SpMV matrix-free matrix-free GL

+

1 10 100 1000 10000 1 2 3 4 5 6 7 8 Memory [bytes] / DoF Element degree SpMV matrix-free general matrix-free Cartesian

=

10-8 10-7 10-6 10-5 1 2 3 4 5 6 7 8 Time [s] / DoF Element degree SpMV usual SpMV stat cond matrix-free general matrix-free Cartesian

ExaDG: Operator evaluation framework Parallel adaptive multigrid Applications Summary

slide-9
SLIDE 9

Institute for Computational Mechanics Martin Kronbichler Technische Universit¨ at M¨ unchen

Performance discontinuous Galerkin

3D Laplacian, Qp elements, symmetric interior penalty DG, Cartesian & curved geometry Intel Xeon E5 Sandy Bridge 2.7 GHz (Su- perMUC) time per degree

  • f freedom per

core (but node fully loaded)

10-8 10-7 10-6 10-5 1 2 3 4 5 6 7 8 Time [s] / DoF Element degree SpMV usual matrix-free curved matrix-free Cartesian

ExaDG: Operator evaluation framework Parallel adaptive multigrid Applications Summary

slide-10
SLIDE 10

Institute for Computational Mechanics Martin Kronbichler Technische Universit¨ at M¨ unchen

Characteristics of element kernels

Operator evaluation: 30–70% of arithmetic peak

◮ Vectorization over elements

◮ data layout: array-of-struct-of-array ◮ intrinsics via C++ template class

VectorizedArray<double>

◮ everything but vector read/write fully

vectorized (DG: fully vectorized)

◮ Loop bounds known at compile time

◮ element degree and number of 1D

quadrature points as templates

◮ unroll loops of length ⌊p/2⌋ + 1 or p + 1

◮ Challenge: Mix of operations

◮ memory intensive: vector read/write,

pre-evaluated Jacobians, coefficients, etc.

  • n quadrature points

◮ arithmetic intensive: sum factorization

→ Develop performance model!

. . . vmovapd 0x988(%rsp),%ymm4 vmovapd 0xbc8(%rsp),%ymm6 vmovapd 0x9a8(%rsp),%ymm14 vsubpd %ymm4,%ymm6,%ymm11 vmulpd %ymm15,%ymm6,%ymm3 vmulpd %ymm6,%ymm13,%ymm8 vfmadd231pd %ymm13,%ymm4,%ymm3 vfmadd231pd %ymm4,%ymm15,%ymm8 vmulpd %ymm6,%ymm12,%ymm10 vmovapd 0xbe8(%rsp),%ymm0 vmovapd 0x268(%rsp),%ymm9 vfmsub231pd %ymm9,%ymm4,%ymm10 vmulpd %ymm9,%ymm6,%ymm9 vmovapd 0xaa8(%rsp),%ymm2 vfmsub231pd %ymm4,%ymm12,%ymm9 vmovapd %ymm11,0x748(%rsp) vmovapd 0xac8(%rsp),%ymm1 vmulpd %ymm15,%ymm0,%ymm5 vmulpd %ymm0,%ymm13,%ymm4 vfmadd231pd %ymm7,%ymm2,%ymm3 vfmadd231pd %ymm7,%ymm2,%ymm8 vfmadd231pd %ymm13,%ymm14,%ymm5 vmovapd %ymm3,0x2c8(%rsp) vsubpd %ymm14,%ymm0,%ymm3 . . . ExaDG: Operator evaluation framework Parallel adaptive multigrid Applications Summary

slide-11
SLIDE 11

Institute for Computational Mechanics Martin Kronbichler Technische Universit¨ at M¨ unchen

Project goal: Parallel adaptive multigrid

Implicit solvers for large-scale problems

◮ Many established solvers need more than just operator evaluation

(mat-vec)

◮ Matrix-free evaluation requires specific preconditioners ◮ Simple example: Geometric multigrid with polynomial

(Chebyshev) smoother

ExaDG: Operator evaluation framework Parallel adaptive multigrid Applications Summary

slide-12
SLIDE 12

Institute for Computational Mechanics Martin Kronbichler Technische Universit¨ at M¨ unchen

Project goal: Parallel adaptive multigrid

Implicit solvers for large-scale problems

◮ Many established solvers need more than just operator evaluation

(mat-vec)

◮ Matrix-free evaluation requires specific preconditioners ◮ Simple example: Geometric multigrid with polynomial

(Chebyshev) smoother

◮ Goal: GMG with overlapping Schwarz smoothers for high order

bases

◮ Schwarz smoothers approximately invert differential operator on

patches of elements by tensorial techniques

◮ Integrate with MPI and domain decomposition ◮ Combine with algebraic multigrid for coarse meshes with millions

  • f elements, using low-order bases

ExaDG: Operator evaluation framework Parallel adaptive multigrid Applications Summary

slide-13
SLIDE 13

Institute for Computational Mechanics Martin Kronbichler Technische Universit¨ at M¨ unchen

Geometric multigrid methods in deal.II

◮ Geometric multigrid on adaptively refined meshes with local

smoothing45

◮ Recent development (end of 2015): fully parallel computations in

2D & 3D on both continuous and discontinuous elements6

AMG/GMG GMG

✟✟ ✟✟✟ ✟✟✟ ✟ ✟✟ ✟ ✟✟ ✟

Blue Smoothed on fine level Red Additional residual contribution in level transfer

4DG: G. Kanschat, Comput. Struct. 82(28):2437–2445, 2004 5FEM: B. Janssen, G. Kanschat, SIAM J. Sci. Comput. 33(4):2095–2114, 2011 6Contributors: T. Heister, G. Kanschat, M. Kronbichler ExaDG: Operator evaluation framework Parallel adaptive multigrid Applications Summary

slide-14
SLIDE 14

Institute for Computational Mechanics Martin Kronbichler Technische Universit¨ at M¨ unchen

Scaling to large processor counts: Uniform meshes

◮ Continuous

finite/spectral elements

◮ Constant-coefficient

Poisson equation

◮ Q3 basis ◮ CG solver

preconditioned by geometric multigrid

◮ Chebyshev smoother

(degree 5, α = 15)

◮ ∼ 6 iterations for

tolerance 10−12

Computations on SuperMUC Phase 1 (SandyB) and SuperMUC Phase 2 (Haswell)

Strong scaling (constant global problem size along lines)

0.01 0.1 1 10 64 256 1024 4096 16384 65536 Time Poisson solver [s] Number of cores SandyB, 128M ele SandyB, 16M ele SandyB, 2M ele Haswell, 16M ele Haswell, 2M ele linear scaling

mesh: 1283 . . . 5123 approximation: Q3 elements FEM degrees of freedom: 57.1m . . . 3.63b

ExaDG: Operator evaluation framework Parallel adaptive multigrid Applications Summary

slide-15
SLIDE 15

Institute for Computational Mechanics Martin Kronbichler Technische Universit¨ at M¨ unchen

Scaling to large processor counts: Uniform meshes

◮ Continuous finite

elements/spectral elements (FEM), CG solver preconditioned by geometric multigrid with Chebyshev smoother (degree 5, α = 15), matrix-free (∼ 6 iterations)

◮ Interior penalty

discontinous Galerkin method (DG SIP), geometric multigrid, Chebyshev smoother (degree 5, α = 15), matrix-free (∼ 7 iterations)

Weak scaling (constant size per proc)

100000 200000 300000 400000 500000 600000 16 128 1024 8192 65536 DoFs / ( core * s ) Number of cores FE Haswell FE SandyB DG Haswell DG SandyB

Constant-coefficient Poisson equation uniform meshes: 483 . . . 7683 approximation: Q3 elements FEM degrees of freedom: 3.05m . . . 12.2b DG degrees of freedom: 7.08m . . . 29.0b

Computations on SuperMUC Phase 1 (SandyB, 16 cores/node) and SuperMUC Phase 2 (Haswell, 28 cores/node) ExaDG: Operator evaluation framework Parallel adaptive multigrid Applications Summary

slide-16
SLIDE 16

Institute for Computational Mechanics Martin Kronbichler Technische Universit¨ at M¨ unchen

Scaling to large processor counts: Uniform meshes

◮ Continuous finite

elements/spectral elements (FEM), CG solver preconditioned by geometric multigrid with Chebyshev smoother (degree 5, α = 15), matrix-free (∼ 7 iterations)

◮ Interior penalty

discontinous Galerkin method (DG SIP), geometric multigrid, Chebyshev smoother (degree 5, α = 15), matrix-free (∼ 7 iterations)

Weak scaling (constant size per proc)

2 4 6 8 10 12 14 16 28 224 1792 14336 Time linear solver [s] Number of cores DG SIP Q4 FEM Q4

Constant-coefficient Poisson equation uniform meshes: 483 . . . 3843 approximation: Q4 elements FEM degrees of freedom: 7.19m . . . 3.64b DG degrees of freedom: 13.8m . . . 7.08b

Computations on SuperMUC Phase 2 (Haswell, 28 cores/node) ExaDG: Operator evaluation framework Parallel adaptive multigrid Applications Summary

slide-17
SLIDE 17

Institute for Computational Mechanics Martin Kronbichler Technische Universit¨ at M¨ unchen

Solver time versus problem size

Continuous finite el- ements/spectral ele- ments (FEM), Q3 ba- sis, CG solver precon- ditioned by geometric multigrid with Cheby- shev smoother (degree 5, α = 15), matrix-free (∼ 6 iterations)

Computations on SuperMUC Phase 1 (SandyB, 16 cores/node)

0.01 0.1 1 10 100000 1x106 1x107 1x108 1x109 1x1010 1x1011 1x1012 Time Poisson solver [s] Problem size [DoFs] SandyB, 1024 cores SandyB, 8192 cores SandyB, 65536 cores

Computations run until memory per core exhausted

◮ Approx. 4 million DoFs in 1.5 GB ◮ GMG + CG + adaptive machinery not particularly optimized for

memory usage; local size could probably be doubled

◮ Matrix-based solver for Q2 exhausts memory for ∼ 500k DoFs

ExaDG: Operator evaluation framework Parallel adaptive multigrid Applications Summary

slide-18
SLIDE 18

Institute for Computational Mechanics Martin Kronbichler Technische Universit¨ at M¨ unchen

Strong scaling: Adaptive vs. uniform mesh

Continuous finite elements, Pois- son equation, Q2 approximation in 3D Slowdown for adaptive case (ap-

  • prox. factor 5)

◮ Twice as many iterations

(10 vs. 5)

◮ Load imbalance on level

smoothers for adapted meshes (strategy: assign level cells according to

  • wner of first child cell)

Constant global problem size

0.01 0.1 1 10 56 112 224 448 896 1792 3584 7168 14336 Time linear solver [s] Number of cores FEM Q2 uniform 2m cells FEM Q2 adaptive 2m cells FEM Q2 uniform 17m cells FEM Q2 adaptive 17m cells

mesh: 2.1m million and 16.8m cells adaptive mesh: 3 more levels than uniform, ∼ 10% hanging nodes on final mesh

ExaDG: Operator evaluation framework Parallel adaptive multigrid Applications Summary

slide-19
SLIDE 19

Institute for Computational Mechanics Martin Kronbichler Technische Universit¨ at M¨ unchen

Weak scaling: Adaptive vs. uniform mesh

Continuous finite elements, Q2 approximation in 3D Slowdown for adaptive case

◮ Twice as many iterations

(10 vs. 5)

◮ Load imbalance on level

smoothers for adapted meshes (strategy: assign level cells according to

  • wner of first child cell)

Constant size per proc

1 2 3 4 5 6 7 8 28 224 1792 14336 Time linear solver [s] Number of cores FEM Q2 uniform FEM Q2 adaptive

degrees of freedom: 7.19m . . . 3.64b adaptive mesh: 3 more levels than uniform, ∼ 15% . . . 6% hanging nodes on final mesh

ExaDG: Operator evaluation framework Parallel adaptive multigrid Applications Summary

slide-20
SLIDE 20

Institute for Computational Mechanics Martin Kronbichler Technische Universit¨ at M¨ unchen

Project goal: Benchmarking on engineering problems

Develop widely applicable methods with added value for realistic extreme- scale simulations in engineering Application 1: Incompressible flow, simulation of turbulent flow, both split- ting methods and monolithic solvers Application 2: Fluid–structure in- teraction, monolithic solver, deforming mesh (ALE), large coarse grids (e.g. sim- ulation of human lung) . . . and some more in interaction with community

ExaDG: Operator evaluation framework Parallel adaptive multigrid Applications Summary

slide-21
SLIDE 21

Institute for Computational Mechanics Martin Kronbichler Technische Universit¨ at M¨ unchen

Summary

◮ Build on very fast vector assembly framework in deal.II → fits linear

and nonlinear operators

◮ Extend algorithms beyond those that spend most time on

matrix-vector products

◮ Relatively low overhead for adaptivity and more unstructured meshes ◮ Improve data locality/cache usage of local kernels ◮ Target throughput architectures ◮ Reduce bottlenecks in large-scale (MPI) applications

ExaDG: Operator evaluation framework Parallel adaptive multigrid Applications Summary

slide-22
SLIDE 22

Institute for Computational Mechanics Martin Kronbichler Technische Universit¨ at M¨ unchen

Try to substantiate these questions

◮ Is higher order effective in realistic engineering settings? ◮ Is exascale possible on realistic engineering settings with complex

meshes and complex physics?

◮ Can coupled/implicit solvers be realized in an exascale setting?

. . . add data points to ongoing discussions in application areas!

ExaDG: Operator evaluation framework Parallel adaptive multigrid Applications Summary