(Toward) Radiative transfer on AMR with GPUs Dominique Aubert - - PowerPoint PPT Presentation

toward radiative transfer on amr with gpus
SMART_READER_LITE
LIVE PREVIEW

(Toward) Radiative transfer on AMR with GPUs Dominique Aubert - - PowerPoint PPT Presentation

(Toward) Radiative transfer on AMR with GPUs Dominique Aubert Universit de Strasbourg Austin, TX, 14.12.12 jeudi 13 dcembre 2012 A few words about GPUs Cache and control replaced by calculation units Large number of


slide-1
SLIDE 1

(Toward) Radiative transfer on AMR with GPUs

Dominique Aubert Université de Strasbourg Austin, TX, 14.12.12

jeudi 13 décembre 2012

slide-2
SLIDE 2

A few words about GPUs

  • Cache and control replaced by calculation units
  • Large number of Multiprocessors + Scheduler
  • High load + Independent + Non-Random Memory access
  • x10 to x100 compared to CPU
  • High-level interface with CUDA (C, Nvidia), OpenCL (Kronos)
  • High-end GPUs ~1.5-2 kEuros, 4-6 GB RAM
  • Tianhe (Changsha, 7168 GPUs), Titan (Oak Ridge 7000-18 000 GPUs by

2012), MareNostrum (???), in France: Titane (198 GPUs), Curie (268 GPUs)

jeudi 13 décembre 2012

slide-3
SLIDE 3

Principle of GPU programming with CUDA

Host RAM GPU RAM Shared Memory GPU RAM Host RAM Blocks Data Transfer

Calculations

If possible: independent & identical threads + High arithmetic intensity = acceleration

jeudi 13 décembre 2012

slide-4
SLIDE 4
  • 1. Cosmological Radiative Transfer

jeudi 13 décembre 2012

slide-5
SLIDE 5

Radiative Transfer equations : explicit solver

∂U ∂t + ∂F(U) ∂x = S Up+1 − Up ∆t + ∂F(Up) ∂x = S

Explicit: CFL constrains

100 000 timesteps required to cover the reionization (z~5)

c < ∆x ∆t ∆t < ∆x c

First 2 moments of the RT equations + variable Eddington Tensor with M1 closure relation

Gonzales et al. 2008, Aubert & Teysier 2008, Rosdahl & Blaizot 2012

with GPUs it’s ok @ c=300 000 km/s

Aubert & Teyssier, 08,10

jeudi 13 décembre 2012

slide-6
SLIDE 6

Post-Processed Radiative Transfer with ATON

UV+X rad. transport

gas density +sources

H Chemistry heating

radiative energy Ionisation state Temperature

Conservative transport fixed & predictable number of operations Subcycled physics (almost) fixed number of operations

Regular 3D Grid

Independent and contiguous calculations Independent and high load

jeudi 13 décembre 2012

slide-7
SLIDE 7

2563 1923 1283 643 2563 1923 1283 643

Performances GPUs VS CPUs x80

CPU (Opteron 2.7 GHz)

GPU 8800 GTX

Aubert & Teyssier, 08,10

jeudi 13 décembre 2012

slide-8
SLIDE 8

Multi-GPU with boundary layers

Aubert & Teyssier, 08,10

jeudi 13 décembre 2012

slide-9
SLIDE 9

Applications :TRASH Project (Transfert RAdiatif Sur Hydrodynamique)

Gas and source distribution from the Mare Nostrum Hydro simulation 1024x1024x1024 cells + 2 refinement levels Self-consistent stellar particles used as sources cudATON on TITANE-CCRT: 10243 grid

Cartesian domain decomposition 8x8x2 (128 GPUs - S1070 servers- Infiniband DDR)

~60 000 - 180 000 time steps dt ~10 000 yrs over 1 Gyrs

jeudi 13 décembre 2012

slide-10
SLIDE 10

Aubert & Teyssier 2010 Structure of the UV background @ different resolution and sub- grid models

cudATON on TITANE-CCRT: 10243 grid

Cartesian domain decomposition 8x8x2 (128 GPUs - S1070 servers- Infiniband DDR)

~60 000 - 180 000 time steps dt ~10 000 yrs over 1 Gyrs

Aubert & Teyssier, ApJ, 2010

jeudi 13 décembre 2012

slide-11
SLIDE 11

Timings on Titane

Communication ~10-15% global time

5123 64 GPUs 5123 8 GPUs 10243 64 GPUs 10243 128 GPUs

Aubert & Teyssier, ApJ, 2010

jeudi 13 décembre 2012

slide-12
SLIDE 12

Small scale effects

with subgrid clumping without subgrid clumping

100 Mpc/h -10243 box clumping C(delta) extracted from a 12.5 h/Mpc -10243 Aubert & Teyssier, ApJ, 2010

jeudi 13 décembre 2012

slide-13
SLIDE 13

J21 Vs nH x Vs nH

Aubert & Teyssier, ApJ, 2010

jeudi 13 décembre 2012

slide-14
SLIDE 14

Residual Neutral Fraction and J21

~100 runs @ 10243 resolution

Aubert & Teyssier, ApJ, 2010

jeudi 13 décembre 2012

slide-15
SLIDE 15

Residual Neutral Fraction and J21

100 Mpc/h -10243

Aubert & Teyssier, ApJ, 2010

jeudi 13 décembre 2012

slide-16
SLIDE 16

Application : Local Group Reionisation (with P . Ocvirk)

CLUES zoom on the local Group Timing of the local reionisation ? Ocvirk et al.2012a,b (submitted+in prep.)

jeudi 13 décembre 2012

slide-17
SLIDE 17

Application: Merger Trees of HII regions during

  • verlap (with J. Chardin)

Chardin, Aubert & Ocvirk, A&A, 2012

jeudi 13 décembre 2012

slide-18
SLIDE 18

Grand Challenge Curie-CCRT 256 GPUs 2048x2048x2048 60 000 pdt -15h

Curie, CCRT-CEA

Large Volumes for 21cm forecast (with B. Semelin)

ionized fraction at z~10

jeudi 13 décembre 2012

slide-19
SLIDE 19

RAMSES-RT (with T. Stranex & R. Teyssier, Zurich)

UNIGRID version will be used on Titan for the INCITE project

RAMSES & ATON are coupled

RAMSES (DYNAMICS) ATON (RT)

courtesy T. Stranex

jeudi 13 décembre 2012

slide-20
SLIDE 20
  • 3. Towards Multi-Fluid AMR

jeudi 13 décembre 2012

slide-21
SLIDE 21

N-Body AMR+GPU+Multi ok ~x10 w.r.t. CPU

Aubert et al. 2009

Hydro AMR+GPU ok ~x15 w.r.t. CPU Radiative Transfer GPU+ Multi ok x 30-40 ? w.r.t. CPU

Aubert & Teyssier, 2008,2010

EMMA Project:

  • 3 fluids coupled on an AMR structure with Hardware Acceleration,

with e.g. GPUs

jeudi 13 décembre 2012

slide-22
SLIDE 22

Multi-GPU PM

1.2 billions particles (10243 real particles +2 108 ghosts) 8 sec/tstep on 64 Teslas with 25 % spent in communications with sort optimisation we may expect 6 sec/ tstep communication~40% asynchronous coms ?

jeudi 13 décembre 2012

slide-23
SLIDE 23

Under Heavy development

jeudi 13 décembre 2012

slide-24
SLIDE 24

Quartz

  • Written in C+CUDA+MPI
  • Parallel (Space-Filling Curve + essential Tree

domain decomposition)

  • AMR, with FTT data structure
  • N-Body + Hydro only ( for the moment)
  • MG Poisson Solver on GPU+ MUSCL-Hancock

Godunov Hydo Solver on GPU + Data Logistics

  • n CPU
  • Hopefully will become EMMA (ElectroMagnetism

and Mechanics on AMR) for gravity+hydro +radiation

jeudi 13 décembre 2012

slide-25
SLIDE 25

Fully Threaded Tree (Khokhlov 1997) (aka «Pointer Party»)

ART (Kravtsov et al. 1997) RAMSES (Teyssier 2001)

→ → → → → → → → → → → →

Particles Particles

jeudi 13 décembre 2012

slide-26
SLIDE 26

Fully THREADED Tree

In a lot of cases, the tree is explored Horizontally Level by Level (with some +/-1 level interactions at boundaries) Even CIC can be considered level by level

jeudi 13 décembre 2012

slide-27
SLIDE 27

Potential (via relaxation) Multi-levels Grid Multi-levels CIC density

jeudi 13 décembre 2012

slide-28
SLIDE 28

«Vectorization»

Scatter Gather AMR Tree Flat Vector CPU GPU

Leads to a bottle neck. Patch based AMR may be more appropriate (see e.g. Schive et al. 2009)

jeudi 13 décembre 2012

slide-29
SLIDE 29

How do we vectorize ?

storing neighbor values

1.2 3.1

  • 1.2

2.5 8.1 9.9

  • 0.1

12.1 0.3

  • 1.8

7.6 2.1 1 2 3 4 5 6 7 8 9 10 11 12

  • 1.2

2.5 1.2 3.1 7.6 2.1 0.3

  • 1.8
  • 0.1

12.1 8.1 9.9

  • 1.2

9.9

  • 1.8

7.6

storing neighbor adresses

3 6 10 11

Coalescent but large gather ~Non-Coalescent but no gather

jeudi 13 décembre 2012

slide-30
SLIDE 30

AMR issues with Explicit formulation

Level L (Coarse) Level L+1 (fine)

Radiation Hydro Hydro Radiation

Subcycling induce problematic inter-levels interaction

It forces the hydro to be synchronized with radiation E.g Rosdahl & Blaizot reduces the speed of light by 10-100 and synchronize the hydro on a small radiation timestep

jeudi 13 décembre 2012

slide-31
SLIDE 31

Current Status

Without optimizations ~X10-15 (DP) compared to CPU for Hydro. RT might kill it or increase it...

jeudi 13 décembre 2012