performance analysis of parallel codes on heterogeneous systems E. - - PowerPoint PPT Presentation

performance analysis of parallel codes on heterogeneous
SMART_READER_LITE
LIVE PREVIEW

performance analysis of parallel codes on heterogeneous systems E. - - PowerPoint PPT Presentation

performance analysis of parallel codes on heterogeneous systems E. Agullo, O. Aumage, B. Bramas, A. Buttari, A. Guermouche, F. Lopez, S. Nakov, S. Thibault SOLHAR plenary meeting, Bordeaux 25-01-2026 a motivating example Plain speedup is not


slide-1
SLIDE 1

performance analysis of parallel codes on heterogeneous systems

  • E. Agullo, O. Aumage, B. Bramas, A. Buttari, A. Guermouche,
  • F. Lopez, S. Nakov, S. Thibault

SOLHAR plenary meeting, Bordeaux 25-01-2026

slide-2
SLIDE 2

a motivating example

slide-3
SLIDE 3

Plain speedup is not enough

  • qr mumps +StarPU with 1D, block-column partitioning
  • Matrices from UF collection

# Matrix Mflops 12 hirlam 1384160 18 spal 004 30335566 13 flower 8 4 2851508 19 n4c6-b6 62245957 14 Rucci1 5671282 20 sls 65607341 15 ch8-8-b3 10709211 21 TF18 194472820 16 GL7d24 16467844 22 lp nug30 221644546 17 neos2 20170318 23 mk13-b5 259751609

  • One node of the ADA supercomputer (IBM x3750-M4, Intel Sandy

Bridge E5-4650 @ 2.7 GHz, 4 × 8 cores)

2

slide-4
SLIDE 4

Experimental results: speedups

5 10 15 20 25 30 12 13 14 15 16 17 18 19 20 21 22 23

Matrix #

Speedup 1D -- 32 cores

1D

Speedup says something, e.g., performance is poor on small matrices and good on bigger ones. Speedup doesn’t say anything on the reason. Is there a problem in the implementation, in the algorithm, in the data? what’s that crappy matrix?

3

slide-5
SLIDE 5

performance analysis approach, the homogeneous case

slide-6
SLIDE 6

Area performance upper bound

Parallel efficiency

The parallel efficiency is defined as e(p) = tmin(p) t(p) = ˜ t(1) t(p) · p

  • ˜

t(1) is the execution time of the best sequential algorithm on one core;

  • t(p) is the execution time of the best parallel algorithm on p cores.

Note that, in general, t(1) ≥ ˜ t(1) because:

  • parallelism requires partitioning of data and operations which

reduces the efficiency of tasks;

  • the parallel algorithm may trade some extra flops for concurrency.

5

slide-7
SLIDE 7

A finer performance analysis

The execution time t(p) can be decomposed in the following three terms:

  • tt(p): the time spent executing tasks.
  • tr(p): the overhead of the runtime system. tr(1) := 0.
  • ti(p): idle time. ti(1) := 0.

The overall efficiency can thus be written as: e(p) = ˜ tt(1) tt(p) + tr(p) + ti(p) =

eg

˜ tt(1) tt(1) ·

et

tt(1) tt(p) ·

er

tt(p) tt(p) + tr(p) ·

ep

tt(p) + tr(p) + tc(p) tt(p) + tr(p) + tc(p) + ti(p). with:

6

slide-8
SLIDE 8

A finer performance analysis

The execution time t(p) can be decomposed in the following three terms:

  • tt(p): the time spent executing tasks.
  • tr(p): the overhead of the runtime system. tr(1) := 0.
  • ti(p): idle time. ti(1) := 0.

The overall efficiency can thus be written as: e(p) = ˜ tt(1) tt(p) + tr(p) + ti(p) =

eg

˜ tt(1) tt(1) ·

et

tt(1) tt(p) ·

er

tt(p) tt(p) + tr(p) ·

ep

tt(p) + tr(p) + tc(p) tt(p) + tr(p) + tc(p) + ti(p). with: eg: the granularity efficiency. Measures the impact exploiting of parallel algorithms compared to sequential ones.

6

slide-9
SLIDE 9

A finer performance analysis

The execution time t(p) can be decomposed in the following three terms:

  • tt(p): the time spent executing tasks.
  • tr(p): the overhead of the runtime system. tr(1) := 0.
  • ti(p): idle time. ti(1) := 0.

The overall efficiency can thus be written as: e(p) = ˜ tt(1) tt(p) + tr(p) + ti(p) =

eg

˜ tt(1) tt(1) ·

et

tt(1) tt(p) ·

er

tt(p) tt(p) + tr(p) ·

ep

tt(p) + tr(p) + tc(p) tt(p) + tr(p) + tc(p) + ti(p). with: et: the task efficiency. Measures the exploitation of data locality.

6

slide-10
SLIDE 10

A finer performance analysis

The execution time t(p) can be decomposed in the following three terms:

  • tt(p): the time spent executing tasks.
  • tr(p): the overhead of the runtime system. tr(1) := 0.
  • ti(p): idle time. ti(1) := 0.

The overall efficiency can thus be written as: e(p) = ˜ tt(1) tt(p) + tr(p) + ti(p) =

eg

˜ tt(1) tt(1) ·

et

tt(1) tt(p) ·

er

tt(p) tt(p) + tr(p) ·

ep

tt(p) + tr(p) + tc(p) tt(p) + tr(p) + tc(p) + ti(p). with: er: the runtime efficiency. Measures how the runtime overhead affects performance.

6

slide-11
SLIDE 11

A finer performance analysis

The execution time t(p) can be decomposed in the following three terms:

  • tt(p): the time spent executing tasks.
  • tr(p): the overhead of the runtime system. tr(1) := 0.
  • ti(p): idle time. ti(1) := 0.

The overall efficiency can thus be written as: e(p) = ˜ tt(1) tt(p) + tr(p) + ti(p) =

eg

˜ tt(1) tt(1) ·

et

tt(1) tt(p) ·

er

tt(p) tt(p) + tr(p) ·

ep

tt(p) + tr(p) + tc(p) tt(p) + tr(p) + tc(p) + ti(p). with: ep: the pipeline efficiency. Measures how much concurrency is available and how well it is exploited.

6

slide-12
SLIDE 12

Experimental results: efficiency breakdown

0.2 0.4 0.6 0.8 1 12 13 14 15 16 17 18 19 20 21 22 23

Granularity efficiency

e_g 1D 0.2 0.4 0.6 0.8 1 12 13 14 15 16 17 18 19 20 21 22 23

Task efficiency

e_t 1D 0.2 0.4 0.6 0.8 1 12 13 14 15 16 17 18 19 20 21 22 23

Pipeline efficiency

e_p 1D 0.2 0.4 0.6 0.8 1 12 13 14 15 16 17 18 19 20 21 22 23

Runtime efficiency

e_r 1D

7

slide-13
SLIDE 13

2D partitioning + CA front factorization

1D partitioning is not good for (strongly) overdetermined matrices: Most fronts are overdetermined The problem is mitigated by concurrent front factorizations

  • 2D block partitioning (not necessarily square)
  • Communication avoiding algorithms

More concurrency More complex dependencies Many more tasks (higher runtime overhead) Finer task granularity (less kernel efficiency) Thanks to the simplicity of the STF programming model it is possible to plug in 2D methods for factorizing the frontal matrices with a relatively moderate effort

8

slide-14
SLIDE 14

Experimental results: speedups

5 10 15 20 25 30 12 13 14 15 16 17 18 19 20 21 22 23

Matrix #

Speedup 2D -- 32 cores

1D 2D

The scalability of the task-based multifrontal method is enhanced by the the introduction of 2D CA algorithms:

  • Speedups are uniform for all tested matrices.
  • We perform a comparative performance analysis wrt to the 1D

case to show the benefits of the 2D scheme.

9

slide-15
SLIDE 15

Experimental results: efficiency breakdown

0.2 0.4 0.6 0.8 1 12 13 14 15 16 17 18 19 20 21 22 23

Granularity efficiency

e_g 1D e_g 2D 0.2 0.4 0.6 0.8 1 12 13 14 15 16 17 18 19 20 21 22 23

Task efficiency

e_t 1D e_t 2D 0.2 0.4 0.6 0.8 1 12 13 14 15 16 17 18 19 20 21 22 23

Pipeline efficiency

e_p 1D e_p 2D 0.2 0.4 0.6 0.8 1 12 13 14 15 16 17 18 19 20 21 22 23

Runtime efficiency

e_r 1D e_r 2D

10

slide-16
SLIDE 16

case study with scalfmm

slide-17
SLIDE 17

Uniform - native StarPU (with commute)

1000000, 7 5000000, 7 10000000, 7 20000000, 7 50000000, 8 100000000, 8 0.7 0.8 0.9 1.0 0.8 0.9 1.0 0.80 0.85 0.90 0.95 1.00 0.90 0.95 1.00 0.90 0.95 1.00 0.93 0.96 0.99 1 2 3 4 6 9 12 15 16 20 24 1 2 3 4 6 9 12 15 16 20 24 1 2 3 4 6 9 12 15 16 20 24 1 2 3 4 6 9 12 15 16 20 24 1 2 3 4 6 9 12 15 16 20 24 1 2 3 4 6 9 12 15 16 20 24

Number of threads Efficiency

Test case Parallel Task Runtime Pipeline

Taskdep efficiency on miriel with StarPU-C (uniform)

12

slide-18
SLIDE 18

Uniform - OpenMP-Klang-StarPU (with commute)

1000000, 7 5000000, 7 10000000, 7 20000000, 7 50000000, 8 100000000, 8 0.5 0.7 0.9 0.5 0.7 0.9 0.5 0.7 0.9 0.4 0.6 0.8 1.0 0.3 0.5 0.7 0.9 0.3 0.5 0.7 0.9 1 2 3 4 6 9 12 15 16 20 24 1 2 3 4 6 9 12 15 16 20 24 1 2 3 4 6 9 12 15 16 20 24 1 2 3 4 6 9 12 15 16 20 24 1 2 3 4 6 9 12 15 16 20 24 1 2 3 4 6 9 12 15 16 20 24

Number of threads Efficiency

Test case Parallel Task Runtime Pipeline

Taskdep efficiency on miriel with Klang-C (uniform)

13

slide-19
SLIDE 19

Ellipsoid - native StarPU (with commute)

1000000, 8 5000000, 8 10000000, 10 20000000, 10 50000000, 11 100000000, 11 0.80 0.85 0.90 0.95 1.00 1.05 0.85 0.90 0.95 1.00 0.6 0.7 0.8 0.9 1.0 0.8 0.9 1.0 0.6 0.7 0.8 0.9 1.0 0.80 0.85 0.90 0.95 1.00 1 2 3 4 6 9 12 15 16 20 24 1 2 3 4 6 9 12 15 16 20 24 1 2 3 4 6 9 12 15 16 20 24 1 2 3 4 6 9 12 15 16 20 24 1 2 3 4 6 9 12 15 16 20 24 1 2 3 4 6 9 12 15 16 20 24

Number of threads Efficiency

Test case Parallel Task Runtime Pipeline

Taskdep efficiency on miriel with StarPU-C (non-uniform)

14

slide-20
SLIDE 20

Ellipsoid - OpenMP-Klang-StarPU (with commute)

1000000, 8 5000000, 8 10000000, 10 20000000, 10 50000000, 11 100000000, 11 0.25 0.50 0.75 1.00 0.4 0.6 0.8 1.0 0.25 0.50 0.75 1.00 0.2 0.4 0.6 0.8 1.0 0.3 0.6 0.9 0.2 0.4 0.6 0.8 1.0 1 2 3 4 6 9 12 15 16 20 24 1 2 3 4 6 9 12 15 16 20 24 1 2 3 4 6 9 12 15 16 20 24 1 2 3 4 6 9 12 15 16 20 24 1 2 3 4 6 9 12 15 16 20 24 1 2 3 4 6 9 12 15 16 20 24

Number of threads Efficiency

Test case Parallel Task Runtime Pipeline

Taskdep efficiency on miriel with Klang-C (non-uniform)

15

slide-21
SLIDE 21

performance analysis approach, the heterogeneous case

slide-22
SLIDE 22

Area performance upper bound

The parallel efficiency can be defined as e(p) = tmin(p) t(p) where tmin(p) is a lower bound on execution time on p resources corresponding to the best schedule under the following assumptions:

PU0 PU1 PU2

17

slide-23
SLIDE 23

Area performance upper bound

The parallel efficiency can be defined as e(p) = tmin(p) t(p) where tmin(p) is a lower bound on execution time on p resources corresponding to the best schedule under the following assumptions:

  • 1. No runtime overhead and no

communications.

PU0 PU1 PU2

17

slide-24
SLIDE 24

Area performance upper bound

The parallel efficiency can be defined as e(p) = tmin(p) t(p) where tmin(p) is a lower bound on execution time on p resources corresponding to the best schedule under the following assumptions:

  • 1. No runtime overhead and no

communications.

  • 2. No tasks dependencies.

PU0 PU1 PU2

17

slide-25
SLIDE 25

Area performance upper bound

The parallel efficiency can be defined as e(p) = tmin(p) t(p) where tmin(p) is a lower bound on execution time on p resources corresponding to the best schedule under the following assumptions:

  • 1. No runtime overhead and no

communications.

  • 2. No tasks dependencies.
  • 3. Tasks are moldable.

PU0 PU1 PU2

area

17

slide-26
SLIDE 26

Area performance upper bound

The parallel efficiency can be defined as e(p) = tmin(p) t(p) where tmin(p) is a lower bound on execution time on p resources corresponding to the best schedule under the following assumptions:

  • 1. No runtime overhead and no

communications.

  • 2. No tasks dependencies.
  • 3. Tasks are moldable.

PU0 PU1 PU2

area

In the heterogeneous case we have tarea(p) is the solution of a linear program.

17

slide-27
SLIDE 27

Area performance upper bound

The parallel efficiency can be defined as e(p) = tmin(p) t(p) where tmin(p) is a lower bound on execution time on p resources corresponding to the best schedule under the following assumptions:

  • 1. No runtime overhead and no

communications.

  • 2. No tasks dependencies.
  • 3. Tasks are moldable.

PU0 PU1 PU2

area

We consider tarea(p) when there is no performance loss resulting from the parallelization: ˜ tarea(p).

17

slide-28
SLIDE 28

A finer performance analysis

The execution time t(p) can be decomposed in the following four terms:

  • tt(p): the time spent executing tasks.
  • tr(p): the overhead of the runtime system.
  • ti(p): idle time.

The overall efficiency can thus be written as:

e(p) = ˜ tarea(p) t(p) = ˜ tarea(p) × p tt(p) + tr (p) + tc (p) + ti (p) = ˜ tarea

t

(p) tt(p) + tr (p) + tc (p) + ti (p) =

eg

˜ tarea

t

(p) tarea

t

(p) ·

et

tarea

t

(p) tt(p) ·

er

tt(p) tt(p) + tr (p) ·

ep

tt(p) + tr (p) + tc (p) tt(p) + tr (p) + tc (p) + ti (p) .

with:

18

slide-29
SLIDE 29

A finer performance analysis

The execution time t(p) can be decomposed in the following four terms:

  • tt(p): the time spent executing tasks.
  • tr(p): the overhead of the runtime system.
  • ti(p): idle time.
  • tc(p): the time spent performing communications.

The overall efficiency can thus be written as:

e(p) = ˜ tarea(p) t(p) = ˜ tarea(p) × p tt(p) + tr (p) + tc (p) + ti (p) = ˜ tarea

t

(p) tt(p) + tr (p) + tc (p) + ti (p) =

eg

˜ tarea

t

(p) tarea

t

(p) ·

et

tarea

t

(p) tt(p) ·

er

tt(p) tt(p) + tr (p) ·

ec

tt(p) + tr (p) tt(p) + tr (p) + tc (p) ·

ep

tt(p) + tr (p) + tc (p) tt(p) + tr (p) + tc (p) + ti (p) .

with: ec: the communication efficiency. measures the cost of communications with respect to the actual work done due to data transfers between workers.

18

slide-30
SLIDE 30

A finer performance analysis

The execution time t(p) can be decomposed in the following four terms:

  • tt(p): the time spent executing tasks.
  • tr(p): the overhead of the runtime system.
  • ti(p): idle time.
  • tc(p): the time spent performing communications.

The overall efficiency can thus be written as:

e(p) = ˜ tarea(p) t(p) = ˜ tarea(p) × p tt(p) + tr (p) + tc (p) + ti (p) = ˜ tarea

t

(p) tt(p) + tr (p) + tc (p) + ti (p) =

eg

˜ tarea

t

(p) tarea

t

(p) ·

et

tarea

t

(p) tt(p) ·

er

tt(p) tt(p) + tr (p) ·

ec

tt(p) + tr (p) tt(p) + tr (p) + tc (p) ·

ep

tt(p) + tr (p) + tc (p) tt(p) + tr (p) + tc (p) + ti (p) .

with: et: the task efficiency. Measures how well the assignment of tasks to processing units matches the tasks properties to the units capabilities.

18

slide-31
SLIDE 31

Experimental results: absolute performance

  • qr mumps +StarPU with hierarchical partitioning and

HeteroPrio++ scheduler

  • One node of the Sirocco computer (Haswell Intel Xeon E5-2680 @

2.5 GHz, 2 × 12 cores + Nvidia K40)

200 400 600 800 1000 12 13 14 15 16 18 19 21

GFlop/s Matrix #

Performance -- Sirocco 1 GPU -- coarse 12 CPUs -- fine 12 CPUs + 1 GPU -- hierarchical 24 CPUs -- fine 24 CPUs + 1 GPU -- hierarchical

19

slide-32
SLIDE 32

Experimental results: efficiency breakdown

0.2 0.4 0.6 0.8 1.0 1.2 1.4 12 13 14 15 16 18 19 21 Matrix #

Granularity efficiency

e_g 12 CPUs + 1 GPU e_g 24 CPUs + 1 GPU 0.2 0.4 0.6 0.8 1.0 1.2 1.4 12 13 14 15 16 18 19 21 Matrix #

Task efficiency

e_t 12 CPUs + 1 GPU e_t 24 CPUs + 1 GPU 0.2 0.4 0.6 0.8 1.0 1.2 1.4 12 13 14 15 16 18 19 21 Matrix #

Pipeline efficiency

e_p 12 CPUs + 1 GPU e_p 24 CPUs + 1 GPU 0.2 0.4 0.6 0.8 1.0 1.2 1.4 12 13 14 15 16 18 19 21 Matrix #

Runtime efficiency

e_r 12 CPUs + 1 GPU e_r 24 CPUs + 1 GPU

20

slide-33
SLIDE 33

Experimental results: efficiency breakdown

0.2 0.4 0.6 0.8 1.0 1.2 1.4 12 13 14 15 16 18 19 21 Matrix #

Communication efficiency

e_c 12 CPUs + 1 GPU e_c 24 CPUs + 1 GPU 0.2 0.4 0.6 0.8 1.0 1.2 1.4 12 13 14 15 16 18 19 21 Matrix #

Efficiency

e 12 CPUs + 1 GPU e 24 CPUs + 1 GPU

21

slide-34
SLIDE 34

critical path analysis

slide-35
SLIDE 35

Critical path analysis

50 100 150 200 250 300 350 12 13 14 15 16 17 18 19 20 21 22 23

Matrix #

Max degree of concurrency 1D and 2D

1D nopipe 1D pipe 2D nopipe 2D pipe

max speedup = avg concurrency =

  • i∈DAG wi
  • i∈CP wi
  • The DAG used to conduct this analysis is the one related to the

case where 32 working threads are used.

  • The weight of tasks is chosen to be equal to the execution time

measured in an execution with only one working thread.

23

slide-36
SLIDE 36

Thanks!

Questions?