Hints to improve automatic load balancing with LeWI for hybrid - - PowerPoint PPT Presentation

hints to improve automatic load balancing with lewi for
SMART_READER_LITE
LIVE PREVIEW

Hints to improve automatic load balancing with LeWI for hybrid - - PowerPoint PPT Presentation

Hints to improve automatic load balancing with LeWI for hybrid applications Marta Garcia, Jesus Labarta, Julita Corbalan Journal of Parallel and Distributed Computing Volume 74, Issue 9 September 2014 1 / 27 Motivation Loss of efficiency


slide-1
SLIDE 1

Hints to improve automatic load balancing with LeWI for hybrid applications

Marta Garcia, Jesus Labarta, Julita Corbalan Journal of Parallel and Distributed Computing — Volume 74, Issue 9 September 2014

1 / 27

slide-2
SLIDE 2

Motivation

Loss of efficiency Hybrid programming models (MPI + X) Manual tuning of parallel codes (load-balancing, data redistribution)

2 / 27

slide-3
SLIDE 3

The X (in this paper)

OpenMP Directives to annotate parallel code Fork/join model with shared memory Number of threads may change between parallel regions SMPSs (SMPSuperscalar) Task as basic element Annotate taskifiable functions and their parameters (in/out/inout) Task graph to track dependencies Number of threads may change any time

3 / 27

slide-4
SLIDE 4

DLB and LeWI

DLB (dynamic load balancing) “Runtime interposition to [...] intercept MPI calls” Balance load on the inner level (OpenMP/SMPSs) Several load balancing algorithms LeWI (Lend CPU when Idle) CPUs of rank in blocking MPI call are idle Lend CPUs to other ranks and recover them after MPI call completes

4 / 27

slide-5
SLIDE 5

LeWI

(a) No load balancing. (b) LeWI algorithm with SMPSs. (c) LeWI algorithm with OpenMP.

5 / 27

slide-6
SLIDE 6

Approach

“Extensive performance evaluation” “Modeling parallelization characteristics that limit the automatic load balancing potential” “Improving automatic load balancing”

6 / 27

slide-7
SLIDE 7

Performance evaluation

Marenostrum 2: 2 × IBM PowerPC 970MP (2 cores); 8 GiB RAM Linux 2.6.5-7.244-pseries64; MPICH; IBM XL C/C++ compiler w/o optimizations

Metrics

Speedup = parallel_execution_time

serial_execution_time

Efficiency =

useful_cpu_time elapsed_time ∗ cpus where

useful_cpu_time = cpu_time − (mpi_time + openmp/smpss_time + dlb_time) CPUs_used to simultaneously run application code

3 benchmarks + 2 real applications

7 / 27

slide-8
SLIDE 8

PILS (Parallel ImbaLance Simulation)

Synthetic benchmark Core: “floating point operations without data involved” Tunable parameters

Programming model (MPI, MPI + OpenMP, MPI + SMPSs) Load distribution Parallelism grain (=

1 #parallel regions)

Iterations

8 / 27

slide-9
SLIDE 9

PILS

9 / 27

slide-10
SLIDE 10

Parallelism Grain

10 / 27

slide-11
SLIDE 11

Other Codes

Benchmarks

BT-MZ: block tri-diagonal solver LUB: LU matrix factorization

Applications

Gromacs: molecular dynamics, MPI-only Gadget: cosmological N-body/SPH (smoothed-particle hydrodynamics) simulation

11 / 27

slide-12
SLIDE 12

Other Codes

Benchmarks

BT-MZ: block tri-diagonal solver LUB: LU matrix factorization

Applications

Gromacs: molecular dynamics, MPI-only Gadget: cosmological N-body/SPH (smoothed-particle hydrodynamics) simulation

Application Original version MPI + OpenMP MPI + SMPSs Executed in nodes (cpus) PILS MPI + OpenMP X X 1 (4) MPI + SMPSs BT-MZ MPI + OpenMP X X 1, 2, 4 (4, 8, 16) LUB MPI + OpenMP X X 1, 2, 4 (4, 8, 16) MPI + SMPSs Gromacs MPI X 1.64 (4.256) Gadget MPI X 200 (800)

11 / 27

slide-13
SLIDE 13

PILS, 2 and 4 MPI processes

12 / 27

slide-14
SLIDE 14

BT-MZ; 1 node

13 / 27

slide-15
SLIDE 15

BT-MZ; 2,4 nodes; Class C

14 / 27

slide-16
SLIDE 16

BT-MZ; 1 node; 4 MPI processes

15 / 27

slide-17
SLIDE 17

LUB; 1 node; Block size 200

. . . .

16 / 27

slide-18
SLIDE 18

Gromacs; 1–64 nodes + Details for 16 nodes

17 / 27

slide-19
SLIDE 19

Gromacs; Efficiency + CPUs used per Node

18 / 27

slide-20
SLIDE 20

Gadget; 200 nodes

19 / 27

slide-21
SLIDE 21

Factors Limiting Performance Improvement with LeWI

“Parallelism Grain in OpenMP applications” “Task duration in SMPSs applications” “Distribution of MPI processes among computation nodes”

20 / 27

slide-22
SLIDE 22

Parallelism Grain

21 / 27

slide-23
SLIDE 23

Modified Parallelism Grain in LUB

22 / 27

slide-24
SLIDE 24

Performance of Modified LUB

23 / 27

slide-25
SLIDE 25

Rank Distribution — BT-MZ

24 / 27

slide-26
SLIDE 26

Rank Distribution — Gromacs

25 / 27

slide-27
SLIDE 27

Rank Distribution — Gadget

Total

26 / 27

slide-28
SLIDE 28

Conclusion

Summary DLB/LeWI can improve performance transparently Inter-node load imbalances not handled Granularity of parallelism and placement as important factors Optimal configuration with vs. without DLB/LeWI

27 / 27

slide-29
SLIDE 29

Conclusion

Summary DLB/LeWI can improve performance transparently Inter-node load imbalances not handled Granularity of parallelism and placement as important factors Optimal configuration with vs. without DLB/LeWI Discussion Interaction with MPI Benchmarks (1.5 of 3 NPB-MZ, arbitrary load distribution) How to find “the right” granularity

27 / 27