Unified CPU+GPU Programming for the Production - - PowerPoint PPT Presentation

unified cpu gpu programming for the production weather
SMART_READER_LITE
LIVE PREVIEW

Unified CPU+GPU Programming for the Production - - PowerPoint PPT Presentation

Unified CPU+GPU Programming for the Production Weather Model ASUCA Michel Mller Research Assistant, Aoki Laboratory michel@sim.gsic.titech.ac.jp Supervised by Prof. Dr. Takayuki Aoki Tokyo Institute of Technology Creative


slide-1
SLIDE 1

Creative Commons: Nasa Goddard Space Flight Centre, 2010

Michel Müller

Research Assistant, Aoki Laboratory michel@sim.gsic.titech.ac.jp

Supervised by

  • Prof. Dr. Takayuki Aoki

Tokyo Institute of Technology

CPU+GPU Programming for the Production Weather Model Unified ASUCA

slide-2
SLIDE 2

CPU+GPU Programming for the Production Weather Model Unified ASUCA

slide-3
SLIDE 3

Unified

dynamical core physical processes

ASUCA

  • Single Fortran code
  • Performant on 


both CPU and GPU

  • Applicable to both


physics and dynamics

  • National Japanese


weather model

  • In production 


since 2014
 (PowerPC)

  • Meso-scale
  • Non hydrostatic
  • Regular mesh FEM
slide-4
SLIDE 4

Motivation Hybrid Fortran Results Outlook ➤ ➤ ➤

slide-5
SLIDE 5

Motivation Hybrid Fortran Results Outlook ➤ ➤ ➤

Unified

  • Single Fortran code
  • Performant on 


both CPU and GPU

  • Applicable to both


physics and dynamics

slide-6
SLIDE 6

Motivation Hybrid Fortran Results Outlook ➤ ➤ ➤

coarse grained parallelism

Unified

  • Single Fortran code
  • Performant on 


both CPU and GPU

  • Applicable to both


physics and dynamics

slide-7
SLIDE 7

Motivation Hybrid Fortran Results Outlook ➤ ➤ ➤

coarse grained parallelism GPU unfriendly storage order

Unified

  • Single Fortran code
  • Performant on 


both CPU and GPU

  • Applicable to both


physics and dynamics

slide-8
SLIDE 8

Motivation Hybrid Fortran Results Outlook ➤ ➤ ➤

coarse grained parallelism GPU unfriendly storage order separation device/host code

Unified

  • Single Fortran code
  • Performant on 


both CPU and GPU

  • Applicable to both


physics and dynamics

slide-9
SLIDE 9

Motivation Hybrid Fortran Results Outlook ➤ ➤ ➤

coarse grained parallelism GPU unfriendly storage order data movement to/from device memory separation device/host code

Unified

  • Single Fortran code
  • Performant on 


both CPU and GPU

  • Applicable to both


physics and dynamics

slide-10
SLIDE 10

Motivation Hybrid Fortran Results Outlook ➤ ➤ ➤

coarse grained parallelism GPU unfriendly storage order data movement to/from device memory separation device/host code CUDA boilerplate

Unified

  • Single Fortran code
  • Performant on 


both CPU and GPU

  • Applicable to both


physics and dynamics

slide-11
SLIDE 11

Motivation Hybrid Fortran Results Outlook ➤ ➤ ➤

coarse grained parallelism

slide-12
SLIDE 12

Motivation Hybrid Fortran Results Outlook ➤ ➤ ➤

coarse grained parallelism

slide-13
SLIDE 13

Motivation Hybrid Fortran Results Outlook ➤ ➤ ➤

?

coarse grained parallelism

slide-14
SLIDE 14

Motivation Hybrid Fortran Results Outlook ➤ ➤ ➤

dynamical core physical processes .. of ASUCA

slide-15
SLIDE 15

Motivation Hybrid Fortran Results Outlook ➤ ➤ ➤

dynamical core physical processes .. of ASUCA

slide-16
SLIDE 16

Motivation Hybrid Fortran Results Outlook ➤ ➤ ➤

dynamical core physical processes .. of ASUCA

slide-17
SLIDE 17

Motivation Hybrid Fortran Results Outlook ➤ ➤ ➤

coarse grained parallelism

slide-18
SLIDE 18

Motivation Hybrid Fortran Results Outlook ➤ ➤ ➤

GPU unfriendly storage order

slide-19
SLIDE 19

Motivation Hybrid Fortran Results Outlook ➤ ➤ ➤

GPU unfriendly storage order

slide-20
SLIDE 20

Motivation Hybrid Fortran Results Outlook ➤ ➤ ➤

GPU unfriendly storage order

slide-21
SLIDE 21

Motivation Hybrid Fortran Results Outlook ➤ ➤ ➤

GPU unfriendly storage order

slide-22
SLIDE 22

Motivation Hybrid Fortran Results Outlook ➤ ➤ ➤

slide-23
SLIDE 23

Motivation Hybrid Fortran Results Outlook ➤ ➤ ➤

coarse grained parallelism

slide-24
SLIDE 24

Motivation Hybrid Fortran Results Outlook ➤ ➤ ➤

coarse grained parallelism

GPU unfriendly storage order

slide-25
SLIDE 25

Motivation Hybrid Fortran Results Outlook ➤ ➤ ➤

automate ALL THE THINGS!

slide-26
SLIDE 26

Motivation Hybrid Fortran Results Outlook ➤ ➤ ➤

?

slide-27
SLIDE 27

Motivation Hybrid Fortran Results Outlook ➤ ➤ ➤

slide-28
SLIDE 28

Motivation Hybrid Fortran Results Outlook ➤ ➤ ➤

slide-29
SLIDE 29

Motivation Hybrid Fortran Results Outlook ➤ ➤ ➤

Build System

python'3 python1

F90'Fortran

h90'Fortran'source' +'direc6ves

xml'Callgraph'+' parsed'direc6ves xml'Callgraph'+'parsed' direc6ves'+'loop'analysis'

executable

make python'2

F90'Fortran F90'Fortran

hybrid'file python'program GNU'Make

legend

file'with'CPU+'GPU' buildtools/Makefile MakeSeIngs user'defined storage_order.F90

  • utput

input [projectNdir]/Makefile file$with$CPU+$GPU$ version

slide-30
SLIDE 30

Motivation Hybrid Fortran Results Outlook ➤ ➤ ➤

Build System

python'3 python1

F90'Fortran

h90'Fortran'source' +'direc6ves

xml'Callgraph'+' parsed'direc6ves xml'Callgraph'+'parsed' direc6ves'+'loop'analysis'

executable

make python'2

F90'Fortran F90'Fortran

hybrid'file python'program GNU'Make

legend

file'with'CPU+'GPU' buildtools/Makefile MakeSeIngs user'defined storage_order.F90

  • utput

input [projectNdir]/Makefile file$with$CPU+$GPU$ version

calculate_all_columns

sum_column

calculate_all_columns

sum_column

CPU version GPU version

slide-31
SLIDE 31

Motivation Hybrid Fortran Results Outlook ➤ ➤ ➤

Build System

python'3 python1

F90'Fortran

h90'Fortran'source' +'direc6ves

xml'Callgraph'+' parsed'direc6ves xml'Callgraph'+'parsed' direc6ves'+'loop'analysis'

executable

make python'2

F90'Fortran F90'Fortran

hybrid'file python'program GNU'Make

legend

file'with'CPU+'GPU' buildtools/Makefile MakeSeIngs user'defined storage_order.F90

  • utput

input [projectNdir]/Makefile file$with$CPU+$GPU$ version

slide-32
SLIDE 32

Motivation Hybrid Fortran Results Outlook ➤ ➤ ➤

Mid 2014

ASUCA on Hybrid Fortran

arashi

diag$long

physics$ long

physics$ diag$long

Max/Min/ Ave

  • utput

rungekutta$ long

diag$adjust$ long

physics$ adjust$long$

physics$rk$ long

dynamics$rk$ long

sediment

diagnose$rk$ short

dynamics$rk$ short

monitflux

radiation

convection pbl/surface

microphys.

Ported

Not$ ported

Tests%passed:
 Rad$on$CPU,$KIJ$Order
 Rad$on$CPU,$IJK$Order$ Gabls3$on$CPU,$KIJ$Order$ Gabls3$on$CPU,$IJK$Order
 Warmbubble$on$CPU,$KIJ$Order$ 
 
 Rad$on$GPU,$KIJ$Order
 Rad$on$GPU,$IJK$Order$ Gabls3$on$GPU,$KIJ$Order$ Gabls3$on$GPU,$IJK$Order
 Warmbubble$on$GPU,$KIJ$Order


RKshort dtshort

dtlong

makegrid_ideal

ideal makegrid prep

slide-33
SLIDE 33

Motivation Hybrid Fortran Results Outlook ➤ ➤ ➤

Now

arashi

diag$long

physics$ long

physics$ diag$long

Max/Min/ Ave

  • utput

rungekutta$ long

diag$adjust$ long

physics$ adjust$long$

physics$rk$ long

dynamics$rk$ long

sediment

diagnose$rk$ short

dynamics$rk$ short

monitflux

radiation

convection pbl/surface

microphys.

RKshort dtshort

dtlong

makegrid_ideal

ideal makegrid prep

ASUCA on Hybrid Fortran

Ported

Not$ ported

Tests%passed:
 Rad$on$CPU,$KIJ$Order
 Rad$on$CPU,$IJK$Order$ Gabls3$on$CPU,$KIJ$Order$ Gabls3$on$CPU,$IJK$Order
 Warmbubble$on$CPU,$KIJ$Order$ Warmbubble$on$CPU,$IJK$Order
 
 Rad$on$GPU,$KIJ$Order
 Rad$on$GPU,$IJK$Order$ Gabls3$on$GPU,$KIJ$Order$ Gabls3$on$GPU,$IJK$Order
 Warmbubble$on$GPU,$KIJ$Order$ Warmbubble$on$GPU,$IJK$Order


slide-34
SLIDE 34

Motivation Hybrid Fortran Results Outlook ➤ ➤ ➤

ASUCA

Hybrid Asuca Dynamics

OpenMP ASUCA Dynamics OpenACC ASUCA Dynamics

Hybrid Asuca Physics

OpenMP ASUCA Physics CUDA Fortran ASUCA Physics

slide-35
SLIDE 35

Motivation Hybrid Fortran Results Outlook ➤ ➤ ➤

ASUCA

Hybrid Asuca Dynamics

OpenMP ASUCA Dynamics OpenACC ASUCA Dynamics

nRMS < 1E-9

✓ ✓

112 Kernels ~10k LOC OpenACC

advection HEVI diagnose rayleigh damping

Hybrid Asuca Physics

OpenMP ASUCA Physics CUDA Fortran ASUCA Physics

nRMS < 1E-9

✓ ✓

121 Kernels ~21k LOC

Performance compared to Reference Code


  • n Westmere Xeon

~1x ~3.6x

CUDA Fortran

Shortwave Radiation Longwave Radiation Planetary Boundary Layer surface

  • utside of

kernel(s) kernel inside of kernel not affected by kernel

slide-36
SLIDE 36

Motivation Hybrid Fortran Results Outlook ➤ ➤ ➤

Further Results

slide-37
SLIDE 37

Motivation Hybrid Fortran Results Outlook ➤ ➤ ➤

#
 kernels analytical
 validation

  • ref. data


validation directive
 priv.-
 isation hybrid
 ||-isation reduction stencil
 access halo direct
 kernel
 call array
 declared
 in 
 kernel scalar
 param.
 from
 device arr. multi
 kernel
 routine strides array
 access.
 func. || region
 in branch early
 return impl.
 scheme local
 module
 data foreign
 module
 data pointer
 swap

getting started 3

✓ ✓ ✓ ✓

5D vector 1

✓ ✓ ✓

simple stencil 1

✓ ✓ ✓ ✓

stencil w/ local array 1

✓ ✓ ✓ ✓ ✓

scalar passed in 1

✓ ✓ ✓ ✓ ✓

multi kernel routines 4

✓ ✓ ✓ ✓

strides 2

✓ ✓ ✓ ✓

accessor functions 1

✓ ✓ ✓ ✓

II branches 2

✓ ✓ ✓ ✓

early returns 3

✓ ✓ ✓ ✓ ✓

schemes 4

✓ ✓ ✓ ✓ ✓

module data 10

✓ ✓ ✓ ✓

3D diffusion 4

✓ ✓ ✓

particle push 1

✓ ✓

midaco solver 1

✓ ✓ ✓

poisson FEM solver 2

✓ ✓ ✓ ✓

example feature

slide-38
SLIDE 38

Motivation Hybrid Fortran Results Outlook ➤ ➤ ➤

Branched codebase has partially aged for > 2 years => high code divergence
 
 => For production version of Hybrid code, need to basically start over

slide-39
SLIDE 39

Motivation Hybrid Fortran Results Outlook ➤ ➤ ➤

Unified

  • Single Fortran code
  • Performant on 


both CPU and GPU

  • Applicable to both


physics and dynamics

coarse grained parallelism GPU unfriendly storage order data movement to/from device memory separation device/host code CUDA boilerplate

slide-40
SLIDE 40

Motivation Hybrid Fortran Results Outlook ➤ ➤ ➤

OpenACC manual conversion manual conversion 1 directive
 per data object and routine kernel / host code in same routine reduced to single directive per kernel

coarse grained parallelism

GPU unfriendly storage order

data movement to/from device memory

separation device/host code

CUDA boilerplate

slide-41
SLIDE 41

Motivation Hybrid Fortran Results Outlook ➤ ➤ ➤

OpenACC Hybrid Fortran
 Now manual conversion directive based conversion manual conversion directive based conversion 1 directive
 per data object and routine 1 directive
 per data object and routine kernel / host code in same routine reduced to single directive per kernel reduced to single directive per kernel

coarse grained parallelism

GPU unfriendly storage order

data movement to/from device memory

separation device/host code

CUDA boilerplate

kernels / host code must reside in separate routines

slide-42
SLIDE 42

Motivation Hybrid Fortran Results Outlook ➤ ➤ ➤

OpenACC Hybrid Fortran
 Now manual conversion directive based conversion manual conversion directive based conversion 1 directive
 per data object and routine 1 directive
 per data object and routine kernel / host code in same routine reduced to single directive per kernel reduced to single directive per kernel

coarse grained parallelism

GPU unfriendly storage order

data movement to/from device memory

separation device/host code

CUDA boilerplate

kernel / host code in same routine

slide-43
SLIDE 43

Motivation Hybrid Fortran Results Outlook ➤ ➤ ➤

OpenACC Hybrid Fortran
 Now Hybrid Fortran
 2017 manual conversion directive based conversion automatic conversion, centralized config manual conversion directive based conversion automatic conversion, centralized config 1 directive
 per data object and routine 1 directive
 per data object and routine 1 directive per data region kernel / host code in same routine kernel / host code in same routine reduced to single directive per kernel reduced to single directive per kernel reduced to single directive per kernel

coarse grained parallelism

GPU unfriendly storage order

data movement to/from device memory

separation device/host code

CUDA boilerplate

kernel / host code in same routine

slide-44
SLIDE 44

Motivation Hybrid Fortran Results Outlook ➤ ➤ ➤

Parse all imports and specifications
 Synthetization of subroutines
 Feature parity between OpenACC and CUDA Fortran backend (except reductions)
 Passing data between CUDA Fortran and OpenACC implemented kernels
 Centralize domain configuration, 
 analyse data flow, synthesize data region, 
 privatize data
 Recognise device code boundaries and update required data accordingly

slide-45
SLIDE 45

Motivation Hybrid Fortran Results Outlook ➤ ➤ ➤

inline converter

it started as an …. … and is becoming more of a

transpiler w/ intermediate representation

slide-46
SLIDE 46

Questions?

michel@sim.gsic.titech.ac.jp

  • Btw. Hybrid Fortran is free and Open Source (LGPL License) 


https://github.com/muellermichel/Hybrid-Fortran