performance analysis of parallel codes on heterogeneous systems
- E. Agullo, O. Aumage, B. Bramas, A. Buttari, A. Guermouche,
- F. Lopez, S. Nakov, S. Thibault
SOLHAR plenary meeting, Bordeaux 25-01-2026
performance analysis of parallel codes on heterogeneous systems E. - - PowerPoint PPT Presentation
performance analysis of parallel codes on heterogeneous systems E. Agullo, O. Aumage, B. Bramas, A. Buttari, A. Guermouche, F. Lopez, S. Nakov, S. Thibault SOLHAR plenary meeting, Bordeaux 25-01-2026 a motivating example Plain speedup is not
SOLHAR plenary meeting, Bordeaux 25-01-2026
# Matrix Mflops 12 hirlam 1384160 18 spal 004 30335566 13 flower 8 4 2851508 19 n4c6-b6 62245957 14 Rucci1 5671282 20 sls 65607341 15 ch8-8-b3 10709211 21 TF18 194472820 16 GL7d24 16467844 22 lp nug30 221644546 17 neos2 20170318 23 mk13-b5 259751609
Bridge E5-4650 @ 2.7 GHz, 4 × 8 cores)
2
5 10 15 20 25 30 12 13 14 15 16 17 18 19 20 21 22 23
Matrix #
Speedup 1D -- 32 cores
1D
Speedup says something, e.g., performance is poor on small matrices and good on bigger ones. Speedup doesn’t say anything on the reason. Is there a problem in the implementation, in the algorithm, in the data? what’s that crappy matrix?
3
Parallel efficiency
The parallel efficiency is defined as e(p) = tmin(p) t(p) = ˜ t(1) t(p) · p
t(1) is the execution time of the best sequential algorithm on one core;
Note that, in general, t(1) ≥ ˜ t(1) because:
reduces the efficiency of tasks;
5
The execution time t(p) can be decomposed in the following three terms:
The overall efficiency can thus be written as: e(p) = ˜ tt(1) tt(p) + tr(p) + ti(p) =
eg
˜ tt(1) tt(1) ·
et
tt(1) tt(p) ·
er
tt(p) tt(p) + tr(p) ·
ep
tt(p) + tr(p) + tc(p) tt(p) + tr(p) + tc(p) + ti(p). with:
6
The execution time t(p) can be decomposed in the following three terms:
The overall efficiency can thus be written as: e(p) = ˜ tt(1) tt(p) + tr(p) + ti(p) =
eg
˜ tt(1) tt(1) ·
et
tt(1) tt(p) ·
er
tt(p) tt(p) + tr(p) ·
ep
tt(p) + tr(p) + tc(p) tt(p) + tr(p) + tc(p) + ti(p). with: eg: the granularity efficiency. Measures the impact exploiting of parallel algorithms compared to sequential ones.
6
The execution time t(p) can be decomposed in the following three terms:
The overall efficiency can thus be written as: e(p) = ˜ tt(1) tt(p) + tr(p) + ti(p) =
eg
˜ tt(1) tt(1) ·
et
tt(1) tt(p) ·
er
tt(p) tt(p) + tr(p) ·
ep
tt(p) + tr(p) + tc(p) tt(p) + tr(p) + tc(p) + ti(p). with: et: the task efficiency. Measures the exploitation of data locality.
6
The execution time t(p) can be decomposed in the following three terms:
The overall efficiency can thus be written as: e(p) = ˜ tt(1) tt(p) + tr(p) + ti(p) =
eg
˜ tt(1) tt(1) ·
et
tt(1) tt(p) ·
er
tt(p) tt(p) + tr(p) ·
ep
tt(p) + tr(p) + tc(p) tt(p) + tr(p) + tc(p) + ti(p). with: er: the runtime efficiency. Measures how the runtime overhead affects performance.
6
The execution time t(p) can be decomposed in the following three terms:
The overall efficiency can thus be written as: e(p) = ˜ tt(1) tt(p) + tr(p) + ti(p) =
eg
˜ tt(1) tt(1) ·
et
tt(1) tt(p) ·
er
tt(p) tt(p) + tr(p) ·
ep
tt(p) + tr(p) + tc(p) tt(p) + tr(p) + tc(p) + ti(p). with: ep: the pipeline efficiency. Measures how much concurrency is available and how well it is exploited.
6
0.2 0.4 0.6 0.8 1 12 13 14 15 16 17 18 19 20 21 22 23
Granularity efficiency
e_g 1D 0.2 0.4 0.6 0.8 1 12 13 14 15 16 17 18 19 20 21 22 23
Task efficiency
e_t 1D 0.2 0.4 0.6 0.8 1 12 13 14 15 16 17 18 19 20 21 22 23
Pipeline efficiency
e_p 1D 0.2 0.4 0.6 0.8 1 12 13 14 15 16 17 18 19 20 21 22 23
Runtime efficiency
e_r 1D
7
1D partitioning is not good for (strongly) overdetermined matrices: Most fronts are overdetermined The problem is mitigated by concurrent front factorizations
More concurrency More complex dependencies Many more tasks (higher runtime overhead) Finer task granularity (less kernel efficiency) Thanks to the simplicity of the STF programming model it is possible to plug in 2D methods for factorizing the frontal matrices with a relatively moderate effort
8
5 10 15 20 25 30 12 13 14 15 16 17 18 19 20 21 22 23
Matrix #
Speedup 2D -- 32 cores
1D 2D
The scalability of the task-based multifrontal method is enhanced by the the introduction of 2D CA algorithms:
case to show the benefits of the 2D scheme.
9
0.2 0.4 0.6 0.8 1 12 13 14 15 16 17 18 19 20 21 22 23
Granularity efficiency
e_g 1D e_g 2D 0.2 0.4 0.6 0.8 1 12 13 14 15 16 17 18 19 20 21 22 23
Task efficiency
e_t 1D e_t 2D 0.2 0.4 0.6 0.8 1 12 13 14 15 16 17 18 19 20 21 22 23
Pipeline efficiency
e_p 1D e_p 2D 0.2 0.4 0.6 0.8 1 12 13 14 15 16 17 18 19 20 21 22 23
Runtime efficiency
e_r 1D e_r 2D
10
1000000, 7 5000000, 7 10000000, 7 20000000, 7 50000000, 8 100000000, 8 0.7 0.8 0.9 1.0 0.8 0.9 1.0 0.80 0.85 0.90 0.95 1.00 0.90 0.95 1.00 0.90 0.95 1.00 0.93 0.96 0.99 1 2 3 4 6 9 12 15 16 20 24 1 2 3 4 6 9 12 15 16 20 24 1 2 3 4 6 9 12 15 16 20 24 1 2 3 4 6 9 12 15 16 20 24 1 2 3 4 6 9 12 15 16 20 24 1 2 3 4 6 9 12 15 16 20 24
Number of threads Efficiency
Test case Parallel Task Runtime Pipeline
Taskdep efficiency on miriel with StarPU-C (uniform)
12
1000000, 7 5000000, 7 10000000, 7 20000000, 7 50000000, 8 100000000, 8 0.5 0.7 0.9 0.5 0.7 0.9 0.5 0.7 0.9 0.4 0.6 0.8 1.0 0.3 0.5 0.7 0.9 0.3 0.5 0.7 0.9 1 2 3 4 6 9 12 15 16 20 24 1 2 3 4 6 9 12 15 16 20 24 1 2 3 4 6 9 12 15 16 20 24 1 2 3 4 6 9 12 15 16 20 24 1 2 3 4 6 9 12 15 16 20 24 1 2 3 4 6 9 12 15 16 20 24
Number of threads Efficiency
Test case Parallel Task Runtime Pipeline
Taskdep efficiency on miriel with Klang-C (uniform)
13
1000000, 8 5000000, 8 10000000, 10 20000000, 10 50000000, 11 100000000, 11 0.80 0.85 0.90 0.95 1.00 1.05 0.85 0.90 0.95 1.00 0.6 0.7 0.8 0.9 1.0 0.8 0.9 1.0 0.6 0.7 0.8 0.9 1.0 0.80 0.85 0.90 0.95 1.00 1 2 3 4 6 9 12 15 16 20 24 1 2 3 4 6 9 12 15 16 20 24 1 2 3 4 6 9 12 15 16 20 24 1 2 3 4 6 9 12 15 16 20 24 1 2 3 4 6 9 12 15 16 20 24 1 2 3 4 6 9 12 15 16 20 24
Number of threads Efficiency
Test case Parallel Task Runtime Pipeline
Taskdep efficiency on miriel with StarPU-C (non-uniform)
14
1000000, 8 5000000, 8 10000000, 10 20000000, 10 50000000, 11 100000000, 11 0.25 0.50 0.75 1.00 0.4 0.6 0.8 1.0 0.25 0.50 0.75 1.00 0.2 0.4 0.6 0.8 1.0 0.3 0.6 0.9 0.2 0.4 0.6 0.8 1.0 1 2 3 4 6 9 12 15 16 20 24 1 2 3 4 6 9 12 15 16 20 24 1 2 3 4 6 9 12 15 16 20 24 1 2 3 4 6 9 12 15 16 20 24 1 2 3 4 6 9 12 15 16 20 24 1 2 3 4 6 9 12 15 16 20 24
Number of threads Efficiency
Test case Parallel Task Runtime Pipeline
Taskdep efficiency on miriel with Klang-C (non-uniform)
15
The parallel efficiency can be defined as e(p) = tmin(p) t(p) where tmin(p) is a lower bound on execution time on p resources corresponding to the best schedule under the following assumptions:
PU0 PU1 PU2
17
The parallel efficiency can be defined as e(p) = tmin(p) t(p) where tmin(p) is a lower bound on execution time on p resources corresponding to the best schedule under the following assumptions:
communications.
PU0 PU1 PU2
17
The parallel efficiency can be defined as e(p) = tmin(p) t(p) where tmin(p) is a lower bound on execution time on p resources corresponding to the best schedule under the following assumptions:
communications.
PU0 PU1 PU2
17
The parallel efficiency can be defined as e(p) = tmin(p) t(p) where tmin(p) is a lower bound on execution time on p resources corresponding to the best schedule under the following assumptions:
communications.
PU0 PU1 PU2
area
17
The parallel efficiency can be defined as e(p) = tmin(p) t(p) where tmin(p) is a lower bound on execution time on p resources corresponding to the best schedule under the following assumptions:
communications.
PU0 PU1 PU2
area
In the heterogeneous case we have tarea(p) is the solution of a linear program.
17
The parallel efficiency can be defined as e(p) = tmin(p) t(p) where tmin(p) is a lower bound on execution time on p resources corresponding to the best schedule under the following assumptions:
communications.
PU0 PU1 PU2
area
We consider tarea(p) when there is no performance loss resulting from the parallelization: ˜ tarea(p).
17
The execution time t(p) can be decomposed in the following four terms:
The overall efficiency can thus be written as:
e(p) = ˜ tarea(p) t(p) = ˜ tarea(p) × p tt(p) + tr (p) + tc (p) + ti (p) = ˜ tarea
t
(p) tt(p) + tr (p) + tc (p) + ti (p) =
eg
˜ tarea
t
(p) tarea
t
(p) ·
et
tarea
t
(p) tt(p) ·
er
tt(p) tt(p) + tr (p) ·
ep
tt(p) + tr (p) + tc (p) tt(p) + tr (p) + tc (p) + ti (p) .
with:
18
The execution time t(p) can be decomposed in the following four terms:
The overall efficiency can thus be written as:
e(p) = ˜ tarea(p) t(p) = ˜ tarea(p) × p tt(p) + tr (p) + tc (p) + ti (p) = ˜ tarea
t
(p) tt(p) + tr (p) + tc (p) + ti (p) =
eg
˜ tarea
t
(p) tarea
t
(p) ·
et
tarea
t
(p) tt(p) ·
er
tt(p) tt(p) + tr (p) ·
ec
tt(p) + tr (p) tt(p) + tr (p) + tc (p) ·
ep
tt(p) + tr (p) + tc (p) tt(p) + tr (p) + tc (p) + ti (p) .
with: ec: the communication efficiency. measures the cost of communications with respect to the actual work done due to data transfers between workers.
18
The execution time t(p) can be decomposed in the following four terms:
The overall efficiency can thus be written as:
e(p) = ˜ tarea(p) t(p) = ˜ tarea(p) × p tt(p) + tr (p) + tc (p) + ti (p) = ˜ tarea
t
(p) tt(p) + tr (p) + tc (p) + ti (p) =
eg
˜ tarea
t
(p) tarea
t
(p) ·
et
tarea
t
(p) tt(p) ·
er
tt(p) tt(p) + tr (p) ·
ec
tt(p) + tr (p) tt(p) + tr (p) + tc (p) ·
ep
tt(p) + tr (p) + tc (p) tt(p) + tr (p) + tc (p) + ti (p) .
with: et: the task efficiency. Measures how well the assignment of tasks to processing units matches the tasks properties to the units capabilities.
18
HeteroPrio++ scheduler
2.5 GHz, 2 × 12 cores + Nvidia K40)
200 400 600 800 1000 12 13 14 15 16 18 19 21
GFlop/s Matrix #
Performance -- Sirocco 1 GPU -- coarse 12 CPUs -- fine 12 CPUs + 1 GPU -- hierarchical 24 CPUs -- fine 24 CPUs + 1 GPU -- hierarchical
19
0.2 0.4 0.6 0.8 1.0 1.2 1.4 12 13 14 15 16 18 19 21 Matrix #
Granularity efficiency
e_g 12 CPUs + 1 GPU e_g 24 CPUs + 1 GPU 0.2 0.4 0.6 0.8 1.0 1.2 1.4 12 13 14 15 16 18 19 21 Matrix #
Task efficiency
e_t 12 CPUs + 1 GPU e_t 24 CPUs + 1 GPU 0.2 0.4 0.6 0.8 1.0 1.2 1.4 12 13 14 15 16 18 19 21 Matrix #
Pipeline efficiency
e_p 12 CPUs + 1 GPU e_p 24 CPUs + 1 GPU 0.2 0.4 0.6 0.8 1.0 1.2 1.4 12 13 14 15 16 18 19 21 Matrix #
Runtime efficiency
e_r 12 CPUs + 1 GPU e_r 24 CPUs + 1 GPU
20
0.2 0.4 0.6 0.8 1.0 1.2 1.4 12 13 14 15 16 18 19 21 Matrix #
Communication efficiency
e_c 12 CPUs + 1 GPU e_c 24 CPUs + 1 GPU 0.2 0.4 0.6 0.8 1.0 1.2 1.4 12 13 14 15 16 18 19 21 Matrix #
Efficiency
e 12 CPUs + 1 GPU e 24 CPUs + 1 GPU
21
50 100 150 200 250 300 350 12 13 14 15 16 17 18 19 20 21 22 23
Matrix #
Max degree of concurrency 1D and 2D
1D nopipe 1D pipe 2D nopipe 2D pipe
max speedup = avg concurrency =
case where 32 working threads are used.
measured in an execution with only one working thread.
23