Algorithm Gavin J. Pringle The Benchmark code Particle transport - - PowerPoint PPT Presentation

algorithm
SMART_READER_LITE
LIVE PREVIEW

Algorithm Gavin J. Pringle The Benchmark code Particle transport - - PowerPoint PPT Presentation

Parallelism Inherent in the Wavefront Algorithm Gavin J. Pringle The Benchmark code Particle transport code using wavefront algorithm Primarily used for benchmarking Coded in Fortran 90 and MPI Scales to thousands of cores for


slide-1
SLIDE 1

Parallelism Inherent in the Wavefront Algorithm

Gavin J. Pringle

slide-2
SLIDE 2

2

The Benchmark code

Particle transport code using wavefront algorithm

Primarily used for benchmarking

Coded in Fortran 90 and MPI

Scales to thousands of cores for large problems

Over 90% of time in one kernel at the heart of the computation

slide-3
SLIDE 3

Serial Algorithm Outline

Outer iteration Loop over energy groups Inner iteration Loop over sweeps Loop over cells in z direction Loop over cells in y direction Loop over cells in x direction Loop over angles (only independent loop!) work (90% of time spent here) End loop over angles End loop over cells in x direction End loop over cells in y direction

slide-4
SLIDE 4

Close up of parallelised loops over cells

  • Loop over cells in z direction

Possible MPI_Recv communications Loop over cells in y direction Loop over cells in x direction Loop over angles (number of angles too small for MPI) work End loop over angles End loop over cells in x direction End loop over cells in y direction Possible MPI_Ssend communcations End loop over cells in z direction

slide-5
SLIDE 5

5

  • MPI 2D

decomposition is

2D decomposition of front x-y face.

  • Figure shows 4 MPI

tasks

j k l

slide-6
SLIDE 6

6

MPI data FromTop

This diagram shows the domain of one MPI task A cell cannot be processed until all cells been processed.

MPI data ToBottom MPI data ToLeft MPI data FromRight

Diagram of dependicies

slide-7
SLIDE 7

7

Sweep order: 3D diagonal slices

MPI data FromTop

Cells of the same colour are independent and may be processed in parallel

  • nce preceding slices are

complete.

MPI data ToBottom MPI data ToLeft MPI data FromRight

slide-8
SLIDE 8

8

Slice shapes (6x6x6)

Increasing triangles Then transforming Hexagons Then decreasing (flipped) triangles

slide-9
SLIDE 9

9

Slice 1

Cell nearest the viewer

slide-10
SLIDE 10

10

Slice 2

Moving down away from viewer

slide-11
SLIDE 11

11

Slice 3

slide-12
SLIDE 12

12

Slice 4

slide-13
SLIDE 13

13

Slice 5

slide-14
SLIDE 14

14

Slice 6

slide-15
SLIDE 15

15

Slice 7

slide-16
SLIDE 16

16

Slice 8

slide-17
SLIDE 17

17

Slice 9

slide-18
SLIDE 18

18

Slice 10

slide-19
SLIDE 19

19

Slice 11

slide-20
SLIDE 20

20

Slice 12

slide-21
SLIDE 21

21

Slice 13

slide-22
SLIDE 22

22

Slice 14

slide-23
SLIDE 23

23

Slice 15

slide-24
SLIDE 24

24

Slice 16

Point furthest from viewer

slide-25
SLIDE 25

Close up of parallelised loops over cells using MPI

  • Loop over cells in z direction

Possible MPI_Recv communications Loop over cells in y direction Loop over cells in x direction Loop over angles (number of angles too small for MPI) work End loop over angles End loop over cells in x direction End loop over cells in y direction Possible MPI_Ssend communcations End loop over cells in z direction

slide-26
SLIDE 26

Close up of parallelised loops over cells using MPI and OpenMP

  • Loop over slices

Possible MPI_Recv communications OMP DO PARALLEL Loop over cells in each slice OMP DO PARALLEL Loop over angles work End loop over angles OMP END DO PARALLEL End Loop over cells in each slice OMP END DO PARALLEL Possible MPI_Ssend communcations End loop over slices

slide-27
SLIDE 27

Parallel Algorithm Outline

Outer iteration Loop over energy groups Inner iteration Loop over sweeps Loop over slices Possible MPI_Recv communications OMP DO PARALLEL Loop over cells in each slice OMP DO PARALLEL Loop over angles work End loop over angles Etc

slide-28
SLIDE 28

Decoupling inter-dependant energy group calculations Initially, each energy group calculation used a previous energy groups results as input Decoupling the energy groups has two

  • utcomes

Execution time is greatly increased Energy Groups are now independent and can be parallelised

Often seen in HPC

Modern algorithms can be inherently serial An older version may be parallelisable

slide-29
SLIDE 29

TaskFarm Summary If all the tasks take the same time to compute

Block distribution of tasks Cyclic distribution of tasks

else if all tasks have different execution times

If length of tasks are unknown in advance

Cyclic distribution of tasks

else

Order tasks: longest first, shortest last Cyclic distribution of tasks

endif

Endif

slide-30
SLIDE 30

Final Parallel Algorithm Outline

Outer iteration MPI Task Farm of energy groups Inner iteration Loop over sweeps Loop over slices Possible MPI_Recv communications OMP DO PARALLEL Loop over cells in each slice OMP DO PARALLEL Loop over angles work End loop over angles Etc

slide-31
SLIDE 31

Conclusion Other wavefront codes have the loops in a different order Loop over energy groups can occur within loops over cells and might be parallelised with OpenMP

Must be decoupled

slide-32
SLIDE 32

Thank you Any questions? gavin@epcc.ed.ac.uk