Building a Better Astrophysics AMR Code with Charm++: Enzo-P/Cello - - PowerPoint PPT Presentation

building a better astrophysics amr code with charm enzo p
SMART_READER_LITE
LIVE PREVIEW

Building a Better Astrophysics AMR Code with Charm++: Enzo-P/Cello - - PowerPoint PPT Presentation

Building a Better Astrophysics AMR Code with Charm++: Enzo-P/Cello (or more adventures in parallel computing) Prof. Michael L Norman Director, San Diego Supercomputer Center University of California, San Diego Supported by NSF grants


slide-1
SLIDE 1

Building a Better Astrophysics AMR Code with Charm++: Enzo-P/Cello

(or more adventures in parallel computing)

  • Prof. Michael L Norman

Director, San Diego Supercomputer Center University of California, San Diego

Supported by NSF grants SI2-SSE-1440709, PHY-1104819 and AST-0808184.

4/17/17

  • M. L. Norman - Charm++ Workshop 2017

1

slide-2
SLIDE 2

I am a serial code developer…

  • I do it because I like it
  • I do it to learn new physics, so I can tackle new

problems

  • I do it to learn new HPC computing methods

because they are interesting

  • Developing with Charm++ is my latest

experiment

4/17/17

  • M. L. Norman - Charm++ Workshop 2017

2

slide-3
SLIDE 3

My intrepid partner in this journey

  • James Bordner
  • PhD CS UIUC, 1999
  • C++ programmer

extraordinaire

  • Enzo-P/Cello is entirely his

design and implementation

4/17/17

  • M. L. Norman - Charm++ Workshop 2017

3

slide-4
SLIDE 4

My first foray into numerical cosmology on NCSA CM5 (1992-1994)

4/17/17

  • M. L. Norman - Charm++ Workshop 2017

4

Thinking Machines CM5 Large scale structure on a 5123 grid KRONOS run on 512 processors Connection Machine Fortran

slide-5
SLIDE 5

Enzo:

Numerical Cosmology on an Adaptive Mesh

Bryan & Norman (1997, 1999)

  • Adaptive in space and time
  • Arbitrary number of refinement levels
  • Arbitrary number of refinement patches
  • Flexible, physics-based refinement criteria
  • Advanced solvers

4/17/17

  • M. L. Norman - Charm++ Workshop 2017

5

slide-6
SLIDE 6

Enzo in action

Berger & Collela (1989) Structured AMR

Gas density Refinement level

4/17/17

  • M. L. Norman - Charm++ Workshop 2017

6

slide-7
SLIDE 7

Application: Radiation Hydrodynamic Cosmological Simulations of the First Galaxies

4/17/17

  • M. L. Norman - Charm++ Workshop 2017

7

NCSA Blue Waters

slide-8
SLIDE 8

Enzo: AMR Hydrodynamic Cosmology Code

http://enzo-project.org

  • Enzo code under

continuous development since 1994

– First hydrodynamic cosmological AMR code – Hundreds of users

  • Rich set of physics solvers

(hydro, N-body, radiation transport, chemistry,…)

  • Have done simulations with

1012 dynamic range and 42 levels

First Stars First Galaxies Reionization

4/17/17

  • M. L. Norman - Charm++ Workshop 2017

8

slide-9
SLIDE 9

Enzo’s Path

1994 NCSA SGI Power Challenge Array Shared memory multiprocessor 2013 NCSA Cray XE6 Blue Waters Distributed memory multicore Lots of computers in between

4/17/17

  • M. L. Norman - Charm++ Workshop 2017

9

slide-10
SLIDE 10

Birth of a Galaxy Animation

From First Stars to First Galaxies

4/17/17

  • M. L. Norman - Charm++ Workshop 2017

10

slide-11
SLIDE 11

Extreme Scale Numerical Cosmology

  • Dark matter only N-body

simulations have crossed the 1012 particle threshold on the world’s largest supercomputers

  • Hydrodynamic cosmology

applications are lagging behind N-body simulations

  • This is due to the lack of

extreme scale AMR frameworks

1 trillion particle dark matter simulation on IBM BG/Q, Habib et al. (2013)

4/17/17

  • M. L. Norman - Charm++ Workshop 2017

11

slide-12
SLIDE 12

Enzo’s Scaling Limitations

Refinement level

4/17/17

  • M. L. Norman - Charm++ Workshop 2017

12

  • Scaling limitations are due to AMR

data structures

  • Root grid is block decomposed, each

block an MPI task

  • Blocks are much larger than subgrid

blocks owned by tasks

  • Structure formation leads to task

load imbalance

  • Moving subgrids to other tasks to

load balance breaks data locality due to parent-child communication

  • Each block an

MPI task

  • OMP thread
  • ver subgrids
slide-13
SLIDE 13

time Dt Dt/2 Dt/2 Dt/4 Dt/4 Dt/4 Dt/4

“W cycle”

Serialization over level updates also limits scalability and performance

4/17/17

  • M. L. Norman - Charm++ Workshop 2017

13

slide-14
SLIDE 14

Relative scale Dt Dt/2 Dt/2 Dt/4 Dt/4 Dt/4 Dt/4

“W cycle”

Deep hierarchical timestepping is needed to reduce cost

4/17/17

  • M. L. Norman - Charm++ Workshop 2017

14

slide-15
SLIDE 15

Adopted Strategy

  • Keep the best part of Enzo (numerical solvers) and

replace the AMR infrastructure

  • Implement using modern OOP best practices for

modularity and extensibility

  • Use the best available scalable AMR algorithm
  • Move from bulk synchronous to data-driven

asynchronous execution model to support patch adaptive timestepping

  • Leverage parallel runtimes that support this execution

model, and have a path to exascale

  • Make AMR software library application-independent so
  • thers can use it

4/17/17

  • M. L. Norman - Charm++ Workshop 2017

15

slide-16
SLIDE 16

Software Architecture

4/17/17

  • M. L. Norman - Charm++ Workshop 2017

16

Numerical solvers Scalable data structures & functions Parallel execution & services (DLB, FT, IO, etc.) Hardware (heterogeneous, hierarchical)

slide-17
SLIDE 17

Software Architecture

4/17/17

  • M. L. Norman - Charm++ Workshop 2017

17

Enzo numerical solvers Forest-of-octrees AMR Charm++ Hardware (heterogeneous, hierarchical)

slide-18
SLIDE 18

Software Architecture

4/17/17

  • M. L. Norman - Charm++ Workshop 2017

18

Enzo-P Cello Charm++ Charm++ supported platforms

slide-19
SLIDE 19

Forest (=Array) of Octrees

Burstedde, Wilcox, Gattas 2011

4/17/17

  • M. L. Norman - Charm++ Workshop 2017

19

2 x 2 x 2 trees 6 x 2 x 2 trees unrefined tree refined tree

slide-20
SLIDE 20

p4est weak scaling: mantle convection

Burstedde et al. (2010), Gordon Bell prize finalist paper

4/17/17

  • M. L. Norman - Charm++ Workshop 2017

20

slide-21
SLIDE 21

What makes it so scalable?

Fully distributed data structure; no parent-child

4/17/17

  • M. L. Norman - Charm++ Workshop 2017

21

Burstedde, Wilcox, Gattas 2011

slide-22
SLIDE 22

Charm++

4/17/17

  • M. L. Norman - Charm++ Workshop 2017

22

slide-23
SLIDE 23

(Laxmikant Kale et al. PPL/UIUC)

4/17/17

  • M. L. Norman - Charm++ Workshop 2017

23

slide-24
SLIDE 24

4/17/17

  • M. L. Norman - Charm++ Workshop 2017

24

slide-25
SLIDE 25

Charm++ powers NAMD

4/17/17

  • M. L. Norman - Charm++ Workshop 2017

25

slide-26
SLIDE 26
  • Goal: implement Enzo’s rich set of physics solvers on

a new, extremely scalable AMR software framework (Cello)

  • Cello implements forest of quad/octree AMR on top
  • f Charm++ parallel objects system
  • Cello designed to be application and architecture

agnostic (OOP)

  • Cello available NOW at http://cello-project.org

4/17/17

  • M. L. Norman - Charm++ Workshop 2017

26

Supported by NSF grants SI2-SSE-1440709

slide-27
SLIDE 27

fields & particles fields & particles

4/17/17

  • M. L. Norman - Charm++ Workshop 2017

27

sequential parallel parallel

slide-28
SLIDE 28

Demonstration of Enzo-P/Cello

4/17/17

  • M. L. Norman - Charm++ Workshop 2017

28

Total energy

slide-29
SLIDE 29

Demonstration of Enzo-P/Cello

4/17/17

  • M. L. Norman - Charm++ Workshop 2017

29

This image cannot currently be displayed.

Mesh refinement level

slide-30
SLIDE 30

Demonstration of Enzo-P/Cello

4/17/17

  • M. L. Norman - Charm++ Workshop 2017

30

Tracer particles

slide-31
SLIDE 31

4/17/17

  • M. L. Norman - Charm++ Workshop 2017

31

slide-32
SLIDE 32

Dynamic Load Balancing

Charm++ implements dozens of user-selectable methods

4/17/17

  • M. L. Norman - Charm++ Workshop 2017

32

slide-33
SLIDE 33

How does Cello implement FOT?

  • A forest is array of octrees
  • f arbitrary size K x L x M
  • An octree has leaf nodes

which are blocks (N x N x N)

  • Each block is a chare (unit
  • f sequential work)
  • The entire FOT is stored as a

chare array using a bit index scheme

  • Chare arrays are fully

distributed data structures in Charm++

4/17/17

  • M. L. Norman - Charm++ Workshop 2017

33

2 x 2 x 2 tree N x N x N block

slide-34
SLIDE 34

4/17/17

  • M. L. Norman - Charm++ Workshop 2017

34

  • Each leaf node of the tree is a block
  • Each block is a chare
  • The forest of trees is represented as a chare array
slide-35
SLIDE 35

4/17/17

  • M. L. Norman - Charm++ Workshop 2017

35

slide-36
SLIDE 36

4/17/17

  • M. L. Norman - Charm++ Workshop 2017

36

slide-37
SLIDE 37

4/17/17

  • M. L. Norman - Charm++ Workshop 2017

37

slide-38
SLIDE 38

4/17/17

  • M. L. Norman - Charm++ Workshop 2017

38

slide-39
SLIDE 39

4/17/17

  • M. L. Norman - Charm++ Workshop 2017

39

slide-40
SLIDE 40

4/17/17

  • M. L. Norman - Charm++ Workshop 2017

40

slide-41
SLIDE 41

Particles in Cello

4/17/17

  • M. L. Norman - Charm++ Workshop 2017

41

slide-42
SLIDE 42

4/17/17

  • M. L. Norman - Charm++ Workshop 2017

42

slide-43
SLIDE 43

4/17/17

  • M. L. Norman - Charm++ Workshop 2017

43

slide-44
SLIDE 44

4/17/17

  • M. L. Norman - Charm++ Workshop 2017

44

slide-45
SLIDE 45

WEAK SCALING TEST – HOW BIG AN AMR MESH CAN WE DO?

4/17/17

  • M. L. Norman - Charm++ Workshop 2017

45

slide-46
SLIDE 46

Unit cell: 1 tree per core 201 blocks/tree, 323 cells/block

4/17/17

  • M. L. Norman - Charm++ Workshop 2017

46

slide-47
SLIDE 47

Weak scaling test: Alphabet Soup array of supersonic blast waves mesh

N trees Np = cores Blocks/ Chares Cells 13 1 201 6.6 M 23 8 1,608 33 27 5,427 43 64 12,864 53 125 63 216 83 512 103 1000 201,000 163 4096 243 13824 323 32768 403 64000 12.9M 483 110592 22.2M 0.7T 543 157464 31.6M 1.0T 643 262144 52.7M 1.7T

4/17/17

  • M. L. Norman - Charm++ Workshop 2017

47

slide-48
SLIDE 48

Largest AMR Simulation in the world 1.7 trillion cells 262K cores

  • n NCSA

Blue Waters

  • M. L. Norman - Charm++ Workshop 2017

48

html

4/17/17

slide-49
SLIDE 49

Charm++ messaging bottleneck

4/17/17

  • M. L. Norman - Charm++ Workshop 2017

49

slide-50
SLIDE 50

Cello fcns Enzo-P solver

4/17/17

  • M. L. Norman - Charm++ Workshop 2017

50

slide-51
SLIDE 51

4/17/17

  • M. L. Norman - Charm++ Workshop 2017

51

slide-52
SLIDE 52

SCALING IN THE HUMAN DIMENSION – SEPARATION OF CONCERNS

4/17/17

  • M. L. Norman - Charm++ Workshop 2017

52

slide-53
SLIDE 53

Cello

High-level Data Structures Middle-level Hardware-interface

4/17/17

  • M. L. Norman - Charm++ Workshop 2017

53

slide-54
SLIDE 54

(C++)

4/17/17

  • M. L. Norman - Charm++ Workshop 2017

54

Object-oriented design implements “separation of concerns”, enhancing extensibility, maintainability, understandability

slide-55
SLIDE 55

Adding a Method to Enzo-P is Easy

(As easy as writing a sequential program)

4/17/17

  • M. L. Norman - Charm++ Workshop 2017

55

slide-56
SLIDE 56

Voila’, parallel AMR heat conduction

4/17/17

  • M. L. Norman - Charm++ Workshop 2017

56

slide-57
SLIDE 57

Current Work: Linear Solvers

  • Poisson and implicit flux-

limited diffusion eqs.

  • CG and BiCGStab

implemented and functioning in parallel

– Suffer from poor algorithmic scaling

  • HG algorithm (D. Reynolds)

under development (multigrid preconditioned BiCGStab)

– Matlab prototype exhibits excellent algorithmic and parallel scalability

4/17/17

  • M. L. Norman - Charm++ Workshop 2017

57

slide-58
SLIDE 58

Takeaways

  • Cello is a software framework for extreme scale

AMR simulations

  • Cello implements the most scalable AMR

algorithm known: forest-of-octrees

  • Parallelism is handled by Charm++, which

supports fully distributed AMR data structures, asynchronous execution, dynamic load balancing, and fault tolerance, parallel IO

  • Developing applications on top of Cello is easy—

as simple as writing a sequential program

  • It is available NOW at http://cello-project.org

4/17/17

  • M. L. Norman - Charm++ Workshop 2017

58

slide-59
SLIDE 59

Path Forward

  • Finish scalable gravity solver (we’re close!)
  • Do a 1 trillion cell/particle hydro cosmology

simulation as a demonstration

  • Implement block adaptive timestepping

– Exercises Charm++’s dynamic execution capability

  • Experiment with Charm’s built-in DLB schemes
  • n real applications

4/17/17

  • M. L. Norman - Charm++ Workshop 2017

59

slide-60
SLIDE 60

Resources

  • Project site: http://www.cello-project.org
  • Source code: https://bitbucket.org/cello-project
  • Tutorials: on project site

4/17/17

  • M. L. Norman - Charm++ Workshop 2017

60

slide-61
SLIDE 61

4/17/17

  • M. L. Norman - Charm++ Workshop 2017

61