[PPT] - Unified CPU+GPU Programming for the Production PowerPoint Presentation

SLIDE 1

Creative Commons: Nasa Goddard Space Flight Centre, 2010

Michel Müller

Research Assistant, Aoki Laboratory michel@sim.gsic.titech.ac.jp

Supervised by

Prof. Dr. Takayuki Aoki

Tokyo Institute of Technology

CPU+GPU Programming for the Production Weather Model Unified ASUCA

SLIDE 2

CPU+GPU Programming for the Production Weather Model Unified ASUCA

SLIDE 3

Unified

dynamical core physical processes

ASUCA

Single Fortran code
Performant on

both CPU and GPU

Applicable to both

physics and dynamics

National Japanese

weather model

In production

since 2014  (PowerPC)

Meso-scale
Non hydrostatic
Regular mesh FEM

SLIDE 4

Motivation Hybrid Fortran Results Outlook ➤ ➤ ➤

SLIDE 5

Motivation Hybrid Fortran Results Outlook ➤ ➤ ➤

Unified

Single Fortran code
Performant on

both CPU and GPU

Applicable to both

physics and dynamics

SLIDE 6

Motivation Hybrid Fortran Results Outlook ➤ ➤ ➤

coarse grained parallelism

Unified

Single Fortran code
Performant on

both CPU and GPU

Applicable to both

physics and dynamics

SLIDE 7

Motivation Hybrid Fortran Results Outlook ➤ ➤ ➤

coarse grained parallelism GPU unfriendly storage order

Unified

Single Fortran code
Performant on

both CPU and GPU

Applicable to both

physics and dynamics

SLIDE 8

Motivation Hybrid Fortran Results Outlook ➤ ➤ ➤

coarse grained parallelism GPU unfriendly storage order separation device/host code

Unified

Single Fortran code
Performant on

both CPU and GPU

Applicable to both

physics and dynamics

SLIDE 9

Motivation Hybrid Fortran Results Outlook ➤ ➤ ➤

coarse grained parallelism GPU unfriendly storage order data movement to/from device memory separation device/host code

Unified

Single Fortran code
Performant on

both CPU and GPU

Applicable to both

physics and dynamics

SLIDE 10

Motivation Hybrid Fortran Results Outlook ➤ ➤ ➤

coarse grained parallelism GPU unfriendly storage order data movement to/from device memory separation device/host code CUDA boilerplate

Unified

Single Fortran code
Performant on

both CPU and GPU

Applicable to both

physics and dynamics

SLIDE 11

Motivation Hybrid Fortran Results Outlook ➤ ➤ ➤

coarse grained parallelism

SLIDE 12

Motivation Hybrid Fortran Results Outlook ➤ ➤ ➤

coarse grained parallelism

SLIDE 13

Motivation Hybrid Fortran Results Outlook ➤ ➤ ➤

?

coarse grained parallelism

SLIDE 14

Motivation Hybrid Fortran Results Outlook ➤ ➤ ➤

dynamical core physical processes .. of ASUCA

SLIDE 15

Motivation Hybrid Fortran Results Outlook ➤ ➤ ➤

dynamical core physical processes .. of ASUCA

SLIDE 16

Motivation Hybrid Fortran Results Outlook ➤ ➤ ➤

dynamical core physical processes .. of ASUCA

SLIDE 17

Motivation Hybrid Fortran Results Outlook ➤ ➤ ➤

coarse grained parallelism

SLIDE 18

Motivation Hybrid Fortran Results Outlook ➤ ➤ ➤

GPU unfriendly storage order

SLIDE 19

Motivation Hybrid Fortran Results Outlook ➤ ➤ ➤

GPU unfriendly storage order

SLIDE 20

Motivation Hybrid Fortran Results Outlook ➤ ➤ ➤

GPU unfriendly storage order

SLIDE 21

Motivation Hybrid Fortran Results Outlook ➤ ➤ ➤

GPU unfriendly storage order

SLIDE 22

Motivation Hybrid Fortran Results Outlook ➤ ➤ ➤

SLIDE 23

Motivation Hybrid Fortran Results Outlook ➤ ➤ ➤

coarse grained parallelism

SLIDE 24

Motivation Hybrid Fortran Results Outlook ➤ ➤ ➤

coarse grained parallelism

GPU unfriendly storage order

SLIDE 25

Motivation Hybrid Fortran Results Outlook ➤ ➤ ➤

automate ALL THE THINGS!

SLIDE 26

Motivation Hybrid Fortran Results Outlook ➤ ➤ ➤

?

SLIDE 27

Motivation Hybrid Fortran Results Outlook ➤ ➤ ➤

SLIDE 28

Motivation Hybrid Fortran Results Outlook ➤ ➤ ➤

SLIDE 29

Motivation Hybrid Fortran Results Outlook ➤ ➤ ➤

Build System

python'3 python1

F90'Fortran

h90'Fortran'source' +'direc6ves

xml'Callgraph'+' parsed'direc6ves xml'Callgraph'+'parsed' direc6ves'+'loop'analysis'

executable

make python'2

F90'Fortran F90'Fortran

hybrid'file python'program GNU'Make

legend

file'with'CPU+'GPU' buildtools/Makefile MakeSeIngs user'defined storage_order.F90

utput

input [projectNdir]/Makefile file$with$CPU+$GPU$ version

SLIDE 30

Motivation Hybrid Fortran Results Outlook ➤ ➤ ➤

Build System

python'3 python1

F90'Fortran

h90'Fortran'source' +'direc6ves

xml'Callgraph'+' parsed'direc6ves xml'Callgraph'+'parsed' direc6ves'+'loop'analysis'

executable

make python'2

F90'Fortran F90'Fortran

hybrid'file python'program GNU'Make

legend

file'with'CPU+'GPU' buildtools/Makefile MakeSeIngs user'defined storage_order.F90

utput

input [projectNdir]/Makefile file$with$CPU+$GPU$ version

calculate_all_columns

sum_column

calculate_all_columns

sum_column

CPU version GPU version

SLIDE 31

Motivation Hybrid Fortran Results Outlook ➤ ➤ ➤

Build System

python'3 python1

F90'Fortran

h90'Fortran'source' +'direc6ves

xml'Callgraph'+' parsed'direc6ves xml'Callgraph'+'parsed' direc6ves'+'loop'analysis'

executable

make python'2

F90'Fortran F90'Fortran

hybrid'file python'program GNU'Make

legend

file'with'CPU+'GPU' buildtools/Makefile MakeSeIngs user'defined storage_order.F90

utput

input [projectNdir]/Makefile file$with$CPU+$GPU$ version

SLIDE 32

Motivation Hybrid Fortran Results Outlook ➤ ➤ ➤

Mid 2014

ASUCA on Hybrid Fortran

arashi

diag$long

physics$ long

physics$ diag$long

Max/Min/ Ave

utput

rungekutta$ long

diag$adjust$ long

physics$ adjust$long$

physics$rk$ long

dynamics$rk$ long

sediment

diagnose$rk$ short

dynamics$rk$ short

monitflux

radiation

convection pbl/surface

microphys.

Ported

Not$ ported

Tests%passed:  Rad$on$CPU,$KIJ$Order  Rad$on$CPU,$IJK$Order$ Gabls3$on$CPU,$KIJ$Order$ Gabls3$on$CPU,$IJK$Order  Warmbubble$on$CPU,$KIJ$Order$     Rad$on$GPU,$KIJ$Order  Rad$on$GPU,$IJK$Order$ Gabls3$on$GPU,$KIJ$Order$ Gabls3$on$GPU,$IJK$Order  Warmbubble$on$GPU,$KIJ$Order 

RKshort dtshort

∫

dtlong

∫

makegrid_ideal

ideal makegrid prep

SLIDE 33

Motivation Hybrid Fortran Results Outlook ➤ ➤ ➤

Now

arashi

diag$long

physics$ long

physics$ diag$long

Max/Min/ Ave

utput

rungekutta$ long

diag$adjust$ long

physics$ adjust$long$

physics$rk$ long

dynamics$rk$ long

sediment

diagnose$rk$ short

dynamics$rk$ short

monitflux

radiation

convection pbl/surface

microphys.

RKshort dtshort

∫

dtlong

∫

makegrid_ideal

ideal makegrid prep

ASUCA on Hybrid Fortran

Ported

Not$ ported

Tests%passed:  Rad$on$CPU,$KIJ$Order  Rad$on$CPU,$IJK$Order$ Gabls3$on$CPU,$KIJ$Order$ Gabls3$on$CPU,$IJK$Order  Warmbubble$on$CPU,$KIJ$Order$ Warmbubble$on$CPU,$IJK$Order    Rad$on$GPU,$KIJ$Order  Rad$on$GPU,$IJK$Order$ Gabls3$on$GPU,$KIJ$Order$ Gabls3$on$GPU,$IJK$Order  Warmbubble$on$GPU,$KIJ$Order$ Warmbubble$on$GPU,$IJK$Order 

SLIDE 34

Motivation Hybrid Fortran Results Outlook ➤ ➤ ➤

ASUCA

Hybrid Asuca Dynamics

OpenMP ASUCA Dynamics OpenACC ASUCA Dynamics

Hybrid Asuca Physics

OpenMP ASUCA Physics CUDA Fortran ASUCA Physics

SLIDE 35

Motivation Hybrid Fortran Results Outlook ➤ ➤ ➤

ASUCA

Hybrid Asuca Dynamics

OpenMP ASUCA Dynamics OpenACC ASUCA Dynamics

nRMS < 1E-9

✓ ✓

112 Kernels ~10k LOC OpenACC

advection HEVI diagnose rayleigh damping

Hybrid Asuca Physics

OpenMP ASUCA Physics CUDA Fortran ASUCA Physics

nRMS < 1E-9

✓ ✓

121 Kernels ~21k LOC

Performance compared to Reference Code 

n Westmere Xeon

~1x ~3.6x

CUDA Fortran

Shortwave Radiation Longwave Radiation Planetary Boundary Layer surface

utside of

kernel(s) kernel inside of kernel not affected by kernel

SLIDE 36

Motivation Hybrid Fortran Results Outlook ➤ ➤ ➤

Further Results

SLIDE 37

Motivation Hybrid Fortran Results Outlook ➤ ➤ ➤

#  kernels analytical  validation

ref. data

validation directive  priv.-  isation hybrid  ||-isation reduction stencil  access halo direct  kernel  call array  declared  in   kernel scalar  param.  from  device arr. multi  kernel  routine strides array  access.  func. || region  in branch early  return impl.  scheme local  module  data foreign  module  data pointer  swap

getting started 3

✓ ✓ ✓ ✓

5D vector 1

✓ ✓ ✓

simple stencil 1

✓ ✓ ✓ ✓

stencil w/ local array 1

✓ ✓ ✓ ✓ ✓

scalar passed in 1

✓ ✓ ✓ ✓ ✓

multi kernel routines 4

✓ ✓ ✓ ✓

strides 2

✓ ✓ ✓ ✓

accessor functions 1

✓ ✓ ✓ ✓

II branches 2

✓ ✓ ✓ ✓

early returns 3

✓ ✓ ✓ ✓ ✓

schemes 4

✓ ✓ ✓ ✓ ✓

module data 10

✓ ✓ ✓ ✓

3D diffusion 4

✓ ✓ ✓

particle push 1

✓ ✓

midaco solver 1

✓ ✓ ✓

poisson FEM solver 2

✓ ✓ ✓ ✓

example feature

SLIDE 38

Motivation Hybrid Fortran Results Outlook ➤ ➤ ➤

Branched codebase has partially aged for > 2 years => high code divergence    => For production version of Hybrid code, need to basically start over

SLIDE 39

Motivation Hybrid Fortran Results Outlook ➤ ➤ ➤

Unified

Single Fortran code
Performant on

both CPU and GPU

Applicable to both

physics and dynamics

coarse grained parallelism GPU unfriendly storage order data movement to/from device memory separation device/host code CUDA boilerplate

SLIDE 40

Motivation Hybrid Fortran Results Outlook ➤ ➤ ➤

OpenACC manual conversion manual conversion 1 directive  per data object and routine kernel / host code in same routine reduced to single directive per kernel

coarse grained parallelism

GPU unfriendly storage order

data movement to/from device memory

separation device/host code

CUDA boilerplate

SLIDE 41

Motivation Hybrid Fortran Results Outlook ➤ ➤ ➤

OpenACC Hybrid Fortran  Now manual conversion directive based conversion manual conversion directive based conversion 1 directive  per data object and routine 1 directive  per data object and routine kernel / host code in same routine reduced to single directive per kernel reduced to single directive per kernel

coarse grained parallelism

GPU unfriendly storage order

data movement to/from device memory

separation device/host code

CUDA boilerplate

kernels / host code must reside in separate routines

SLIDE 42

Motivation Hybrid Fortran Results Outlook ➤ ➤ ➤

OpenACC Hybrid Fortran  Now manual conversion directive based conversion manual conversion directive based conversion 1 directive  per data object and routine 1 directive  per data object and routine kernel / host code in same routine reduced to single directive per kernel reduced to single directive per kernel

coarse grained parallelism

GPU unfriendly storage order

data movement to/from device memory

separation device/host code

CUDA boilerplate

kernel / host code in same routine

SLIDE 43

Motivation Hybrid Fortran Results Outlook ➤ ➤ ➤

OpenACC Hybrid Fortran  Now Hybrid Fortran  2017 manual conversion directive based conversion automatic conversion, centralized config manual conversion directive based conversion automatic conversion, centralized config 1 directive  per data object and routine 1 directive  per data object and routine 1 directive per data region kernel / host code in same routine kernel / host code in same routine reduced to single directive per kernel reduced to single directive per kernel reduced to single directive per kernel

coarse grained parallelism

GPU unfriendly storage order

data movement to/from device memory

separation device/host code

CUDA boilerplate

kernel / host code in same routine

SLIDE 44

Motivation Hybrid Fortran Results Outlook ➤ ➤ ➤

Parse all imports and specifications  Synthetization of subroutines  Feature parity between OpenACC and CUDA Fortran backend (except reductions)  Passing data between CUDA Fortran and OpenACC implemented kernels  Centralize domain configuration,   analyse data flow, synthesize data region,   privatize data  Recognise device code boundaries and update required data accordingly

SLIDE 45

Motivation Hybrid Fortran Results Outlook ➤ ➤ ➤

inline converter

it started as an …. … and is becoming more of a

transpiler w/ intermediate representation

SLIDE 46

Questions?

michel@sim.gsic.titech.ac.jp

Btw. Hybrid Fortran is free and Open Source (LGPL License)

CPU+GPU Programming for the Production Weather Model Unified ASUCA

CPU+GPU Programming for the Production Weather Model Unified ASUCA

Unified

dynamical core physical processes

ASUCA

both CPU and GPU

physics and dynamics

weather model

since 2014 (PowerPC)

Motivation Hybrid Fortran Results Outlook ➤ ➤ ➤

Unified

both CPU and GPU

physics and dynamics

coarse grained parallelism

Unified

both CPU and GPU

physics and dynamics

coarse grained parallelism GPU unfriendly storage order

Unified

both CPU and GPU

physics and dynamics

coarse grained parallelism GPU unfriendly storage order separation device/host code

Unified

both CPU and GPU

physics and dynamics

coarse grained parallelism GPU unfriendly storage order data movement to/from device memory separation device/host code

Unified

both CPU and GPU

physics and dynamics

coarse grained parallelism GPU unfriendly storage order data movement to/from device memory separation device/host code CUDA boilerplate

Unified

both CPU and GPU

physics and dynamics

coarse grained parallelism

coarse grained parallelism

?

coarse grained parallelism

dynamical core physical processes .. of ASUCA

dynamical core physical processes .. of ASUCA

dynamical core physical processes .. of ASUCA

coarse grained parallelism

GPU unfriendly storage order

GPU unfriendly storage order

GPU unfriendly storage order

GPU unfriendly storage order

coarse grained parallelism

coarse grained parallelism

automate ALL THE THINGS!

?

Build System

Build System

CPU version GPU version

Build System

ASUCA on Hybrid Fortran

arashi

∫

dtlong

∫

arashi

∫

dtlong

∫

ASUCA on Hybrid Fortran

ASUCA

ASUCA

✓ ✓

✓ ✓

~1x ~3.6x

Further Results

Branched codebase has partially aged for > 2 years => high code divergence => For production version of Hybrid code, need to basically start over

Unified

both CPU and GPU

physics and dynamics

coarse grained parallelism GPU unfriendly storage order data movement to/from device memory separation device/host code CUDA boilerplate

CUDA boilerplate

CUDA boilerplate

CUDA boilerplate

CUDA boilerplate

inline converter

it started as an …. … and is becoming more of a

since 2014  (PowerPC)

Branched codebase has partially aged for > 2 years => high code divergence    => For production version of Hybrid code, need to basically start over