Alpaka An Abstraction Library for Parallel Kernel Acceleration Erik - - PowerPoint PPT Presentation

alpaka an abstraction library for parallel kernel
SMART_READER_LITE
LIVE PREVIEW

Alpaka An Abstraction Library for Parallel Kernel Acceleration Erik - - PowerPoint PPT Presentation

Alpaka An Abstraction Library for Parallel Kernel Acceleration Erik Zenker 1,2 , Benjamin Worpitz 1,2 , Ren Widera 1 ,Axel Huebl 1,2 , Guido Juckeland 1 , Andreas Knpfer 2 , Wolfgang E. Nagel 2 , Michael Bussmann 1 1 Helmholtz-Zentrum


slide-1
SLIDE 1
  • Prof. Peter Mustermann I Institut xxxxx I www.hzdr.de

Alpaka – An Abstraction Library for Parallel Kernel Acceleration

Erik Zenker1,2, Benjamin Worpitz1,2, René Widera1,Axel Huebl1,2, Guido Juckeland1, Andreas Knüpfer2, Wolfgang E. Nagel2, Michael Bussmann1

1 Helmholtz-Zentrum Dresden – Rossendorf 2 Technische Universität Dresden

slide-2
SLIDE 2

Mitglied der Helmholtz-Gemeinschaft

2

René Widera, Erik Zenker, Guido Juckeland · Computational Radiation Physics · www.hzdr.de/crp { r.widera, e.zenker, g.juckeland }@hzdr.de

Electron Acceleration with Lasers

  • Compact X-Ray sources

Ion Acceleration with Lasers

  • Tumor Therapy

Plasma Instabilities

  • Astrophysics

We Have Lasers : Risk of Forest Fires !

PICon GPU

slide-3
SLIDE 3

Mitglied der Helmholtz-Gemeinschaft

3

René Widera, Erik Zenker, Guido Juckeland · Computational Radiation Physics · www.hzdr.de/crp { r.widera, e.zenker, g.juckeland }@hzdr.de

weak scaling efficiency strong scaling

Efficiency > 95% PIConGPU ─ Scales up to 18,432 GPUs

80 85 90 95 100 105 1 10 100 1000 10000 efficiency [%] number of GPUs ideal PIConGPU

6.9 PFlop/s (SP)

1 10 100 1000 10000 1 10 100 1000 10000 speedup number of GPUs ideal 1 to 32 8 to 256 64 to 2048 512 to 16384 4096 to 16384

slide-4
SLIDE 4

Mitglied der Helmholtz-Gemeinschaft

4

René Widera, Erik Zenker, Guido Juckeland · Computational Radiation Physics · www.hzdr.de/crp { r.widera, e.zenker, g.juckeland }@hzdr.de

Alpaka

slide-5
SLIDE 5

Mitglied der Helmholtz-Gemeinschaft

5

René Widera, Erik Zenker, Guido Juckeland · Computational Radiation Physics · www.hzdr.de/crp { r.widera, e.zenker, g.juckeland }@hzdr.de

Fibers

C++

Threads

TBB

Good News: There are Alpakas on the Compute Meadow

  • Single zero overhead interface to existing parallelism models
  • Single source C++11 kernels
  • Data-agnostic memory model
slide-6
SLIDE 6

Mitglied der Helmholtz-Gemeinschaft

6

René Widera, Erik Zenker, Guido Juckeland · Computational Radiation Physics · www.hzdr.de/crp { r.widera, e.zenker, g.juckeland }@hzdr.de

A Uniform Interface to All These Programming Models ?

C++ compilers are able to almost completely remove abstraction layers ▪ Interface is defined by a set

  • f free functions with

template arguments ▪ Template arguments need to fulfill type requirements (concepts)

▪ Interface is extendable through more concepts implementations (models)

Lib1 Lib2 Libn ... Zero overhead abstraction interface Application code

slide-7
SLIDE 7

Mitglied der Helmholtz-Gemeinschaft

7

René Widera, Erik Zenker, Guido Juckeland · Computational Radiation Physics · www.hzdr.de/crp { r.widera, e.zenker, g.juckeland }@hzdr.de

Heterogeneous Codes Need to be Maintainable Single Source

Heterogeneity Write once, execute everywhere Testability Validate once, get correct results everywhere Sustainability Porting implies minimal code changes Optimizability Tune for good performance at minimum coding effort Openness Open source and

  • pen standards
slide-8
SLIDE 8

Mitglied der Helmholtz-Gemeinschaft

8

René Widera, Erik Zenker, Guido Juckeland · Computational Radiation Physics · www.hzdr.de/crp { r.widera, e.zenker, g.juckeland }@hzdr.de

Abstract Hierarchical Redundant Parallelism Model

Grid Synchronize Parallel Sequential

slide-9
SLIDE 9

Mitglied der Helmholtz-Gemeinschaft

9

René Widera, Erik Zenker, Guido Juckeland · Computational Radiation Physics · www.hzdr.de/crp { r.widera, e.zenker, g.juckeland }@hzdr.de

Abstract Hierarchical Redundant Parallelism Model

Grid Synchronize Block Parallel Sequential

slide-10
SLIDE 10

Mitglied der Helmholtz-Gemeinschaft

10

René Widera, Erik Zenker, Guido Juckeland · Computational Radiation Physics · www.hzdr.de/crp { r.widera, e.zenker, g.juckeland }@hzdr.de

Abstract Hierarchical Redundant Parallelism Model

Grid Synchronize Block Thread Parallel Sequential

slide-11
SLIDE 11

Mitglied der Helmholtz-Gemeinschaft

11

René Widera, Erik Zenker, Guido Juckeland · Computational Radiation Physics · www.hzdr.de/crp { r.widera, e.zenker, g.juckeland }@hzdr.de

Abstract Hierarchical Redundant Parallelism Model

Grid Synchronize Block Thread Element Parallel Sequential

  • Element level is an

explicit sequential layer

slide-12
SLIDE 12

Mitglied der Helmholtz-Gemeinschaft

12

René Widera, Erik Zenker, Guido Juckeland · Computational Radiation Physics · www.hzdr.de/crp { r.widera, e.zenker, g.juckeland }@hzdr.de

Device Global Memory Grid Explicit deep copy Host Memory

Data Structure Agnostic Memory Model

slide-13
SLIDE 13

Mitglied der Helmholtz-Gemeinschaft

13

René Widera, Erik Zenker, Guido Juckeland · Computational Radiation Physics · www.hzdr.de/crp { r.widera, e.zenker, g.juckeland }@hzdr.de

Device Global Memory Grid Explicit deep copy Shared Memory Block Host Memory

Data Structure Agnostic Memory Model

slide-14
SLIDE 14

Mitglied der Helmholtz-Gemeinschaft

14

René Widera, Erik Zenker, Guido Juckeland · Computational Radiation Physics · www.hzdr.de/crp { r.widera, e.zenker, g.juckeland }@hzdr.de

Device Global Memory Grid Explicit deep copy Shared Memory Block Register Memory Register Memory Thread Host Memory

Data Structure Agnostic Memory Model

slide-15
SLIDE 15

Mitglied der Helmholtz-Gemeinschaft

15

René Widera, Erik Zenker, Guido Juckeland · Computational Radiation Physics · www.hzdr.de/crp { r.widera, e.zenker, g.juckeland }@hzdr.de

Map the Abstraction Model to your Desired Acceleration Back-End

  • Explicit mapping of parallelization levels to hardware

CPU RAM L3 L3

Core L1/2 R Core L1/2 R Core L1/2 R Core L1/2 R Package Package

AVX AVX AVX AVX

slide-16
SLIDE 16

Mitglied der Helmholtz-Gemeinschaft

16

René Widera, Erik Zenker, Guido Juckeland · Computational Radiation Physics · www.hzdr.de/crp { r.widera, e.zenker, g.juckeland }@hzdr.de

Map the Abstraction Model to your Desired Acceleration Back-End

  • Explicit mapping of parallelization levels to hardware

CPU RAM L3 L3

Core L1/2 R Core L1/2 R Core L1/2 R Core L1/2 R Package Package

AVX AVX AVX AVX

Grid Global Memory

slide-17
SLIDE 17

Mitglied der Helmholtz-Gemeinschaft

17

René Widera, Erik Zenker, Guido Juckeland · Computational Radiation Physics · www.hzdr.de/crp { r.widera, e.zenker, g.juckeland }@hzdr.de

Map the Abstraction Model to your Desired Acceleration Back-End

  • Explicit mapping of parallelization levels to hardware

Core L1/2 R Core L1/2 R Core L1/2 R Core L1/2 R

CPU RAM L3 L3

Core L1/2 R Core L1/2 R Core L1/2 R Core L1/2 R Package Package

AVX AVX AVX AVX

Grid Block Shared Memory Global Memory

slide-18
SLIDE 18

Mitglied der Helmholtz-Gemeinschaft

18

René Widera, Erik Zenker, Guido Juckeland · Computational Radiation Physics · www.hzdr.de/crp { r.widera, e.zenker, g.juckeland }@hzdr.de

Map the Abstraction Model to your Desired Acceleration Back-End

  • Explicit mapping of parallelization levels to hardware

Core L1/2 R Core L1/2 R Core L1/2 R Core L1/2 R

CPU RAM L3 L3

Core L1/2 R Core L1/2 R Core L1/2 R Core L1/2 R Package Package

AVX AVX AVX AVX

Grid Block Thread Register Memory Shared Memory Global Memory

slide-19
SLIDE 19

Mitglied der Helmholtz-Gemeinschaft

19

René Widera, Erik Zenker, Guido Juckeland · Computational Radiation Physics · www.hzdr.de/crp { r.widera, e.zenker, g.juckeland }@hzdr.de

Map the Abstraction Model to your Desired Acceleration Back-End

  • Explicit mapping of parallelization levels to hardware

Core L1/2 R Core L1/2 R Core L1/2 R Core L1/2 R

CPU RAM L3 L3

Core L1/2 R Core L1/2 R Core L1/2 R Core L1/2 R Package Package

AVX AVX AVX AVX

Grid Block Thread Element Register Memory Shared Memory Global Memory

slide-20
SLIDE 20

Mitglied der Helmholtz-Gemeinschaft

20

René Widera, Erik Zenker, Guido Juckeland · Computational Radiation Physics · www.hzdr.de/crp { r.widera, e.zenker, g.juckeland }@hzdr.de

Map the Abstraction Model to your Desired Acceleration Back-End

  • Specific unsupported levels of the model can be ignored
  • Abstract interface allows to extend the set of mappings
slide-21
SLIDE 21

Mitglied der Helmholtz-Gemeinschaft

21

René Widera, Erik Zenker, Guido Juckeland · Computational Radiation Physics · www.hzdr.de/crp { r.widera, e.zenker, g.juckeland }@hzdr.de

Alpaka : Vector Addition Kernel

struct VectorAdd { template<typename TAcc, typename TElem, typename TSize> ALPAKA_FN_ACC auto operator()( TAcc const & acc, TSize const & numElements, TElem const * const X, TElem * const Y) const -> void { } }; using alp = alpaka; auto globalIdx = alp::idx::getIdx<alp::Grid, alp::Threads>(acc)[0u]; auto elemsPerThread = alp::workdiv::getWorkDiv<alp::Thread, alp::Elems>(acc)[0u]; auto begin = globalIdx * elemsPerThread; auto end = min(begin + elemsPerThread, numElements); for(TSize i = begin; i < end; ++i){ Y[i] = X[i] + Y[i]; }

slide-22
SLIDE 22

Mitglied der Helmholtz-Gemeinschaft

22

René Widera, Erik Zenker, Guido Juckeland · Computational Radiation Physics · www.hzdr.de/crp { r.widera, e.zenker, g.juckeland }@hzdr.de

Alpaka : Initialization

// Configure Alpaka using Dim = alpaka::dim::DimInt<3u> using Size = std::size_t using Acc = alpaka::acc::AccCpuSerial<Dim, Size>; using Host = alpaka::acc::AccCpuSerial<Dim, Size>; using Stream = alpaka::stream::StreamCpuSync; using WorkDiv = alpaka::workdiv::WorkDivMembers<Dim, Size>; using Elem = float; // Retrieve devices and stream DevHost devHost ( alpaka::dev::DevMan<Host>::getDevByIdx(0) ); DevAcc devAcc ( alpaka::dev::DevMan<Acc>::getDevByIdx(0) ); Stream stream ( devAcc); // Specify work division auto elementsPerThread ( alpaka::Vec<Dim, Size>::ones() ); auto threadsPerBlock ( alpaka::Vec<Dim, Size>::all(2u) ); auto blocksPerGrid ( alpaka::Vec<Dim, Size>(4u, 8u, 16u) ); WorkDiv workdiv(alpaka::workdiv::WorkDivMembers<Dim, Size>(blocksPerGrid, threadsPerBlock, elementsPerThread));

slide-23
SLIDE 23

Mitglied der Helmholtz-Gemeinschaft

23

René Widera, Erik Zenker, Guido Juckeland · Computational Radiation Physics · www.hzdr.de/crp { r.widera, e.zenker, g.juckeland }@hzdr.de

Alpaka : Call the Kernel

// Memory allocation and host to device memory copy auto X_h = alpaka::mem::buf::alloc<int, int>(devHost, extent); auto Y_h = alpaka::mem::buf::alloc<int, int>(devHost, extent); auto X_d = alpaka::mem::buf::alloc<Val, Size>(devAcc, extent); auto Y_d = alpaka::mem::buf::alloc<Val, Size>(devAcc, extent); alpaka::mem::view::copy(stream, X_d, X_h, extent); alpaka::mem::view::copy(stream, Y_d, Y_h, extent); // Kernel creation and execution VectorAdd kernel; auto const exec(alpaka::exec::create<Acc>( workDiv, kernel, numElements alpaka::mem::view::getPtrNative(X_d), alpaka::mem::view::getPtrNative(Y_d))); alpaka::stream::enqueue(stream, exec); // Copy memory back to host alpaka::mem::view::copy(stream, Y_h, Y_d, extent);

slide-24
SLIDE 24

Mitglied der Helmholtz-Gemeinschaft

24

René Widera, Erik Zenker, Guido Juckeland · Computational Radiation Physics · www.hzdr.de/crp { r.widera, e.zenker, g.juckeland }@hzdr.de

Compile to Almost Same PTX Code (DAXPY)

slide-25
SLIDE 25

Mitglied der Helmholtz-Gemeinschaft

26

René Widera, Erik Zenker, Guido Juckeland · Computational Radiation Physics · www.hzdr.de/crp { r.widera, e.zenker, g.juckeland }@hzdr.de

Single Source Alpaka DGEMM Kernel on Various Architectures

slide-26
SLIDE 26

Mitglied der Helmholtz-Gemeinschaft

28

René Widera, Erik Zenker, Guido Juckeland · Computational Radiation Physics · www.hzdr.de/crp { r.widera, e.zenker, g.juckeland }@hzdr.de

PIConGPU Efficiency on Various Architectures

slide-27
SLIDE 27

Mitglied der Helmholtz-Gemeinschaft

29

René Widera, Erik Zenker, Guido Juckeland · Computational Radiation Physics · www.hzdr.de/crp { r.widera, e.zenker, g.juckeland }@hzdr.de

Clone us from GitHub

https://github.com/ComputationalRadiationPhysics

git clone https://github.com/ComputationalRadiationPhysics/alpaka

slide-28
SLIDE 28

Mitglied der Helmholtz-Gemeinschaft

30

René Widera, Erik Zenker, Guido Juckeland · Computational Radiation Physics · www.hzdr.de/crp { r.widera, e.zenker, g.juckeland }@hzdr.de