[PPT] - Driving Improvements to Algebraic Multigrid Through Performance PowerPoint Presentation

SLIDE 1

Driving Improvements to Algebraic Multigrid Through Performance Modeling

Hormozd Gahvari1,3, William Gropp1, Kirk E. Jordan2, Martin Schulz3, Ulrike Meier Yang3

1University of Illinois at Urbana-Champaign

2IBM TJ Watson Research Center 3Lawrence Livermore National Laboratory

July 6, 2014

SLIDE 2

Alg lgebraic ic Mult ltig igrid id

2 ¡

Apply ¡mul)grid ¡concept: ¡ And ¡cycle: ¡

Solve ¡original ¡ “fine” ¡problem ¡ With ¡informa)on ¡from ¡ smaller ¡“coarse” ¡problems ¡ Level ¡0 ¡ Level ¡1 ¡ … ¡

LLNL-‑PRES-‑528011 ¡

To ¡unstructured ¡grid ¡problems: ¡

Requires ¡two ¡phases: ¡ ¡ 1. Setup ¡hierarchy ¡of ¡grids ¡ 2. Solve ¡problem ¡

SLIDE 3

Performa mance Issues

3 ¡

0.00001 ¡ 0.0001 ¡ 0.001 ¡ 0.01 ¡ 0.1 ¡ 1 ¡ 0 ¡ 2 ¡ 4 ¡ 6 ¡ 8 ¡ 10 ¡ Time ¡(s) ¡ Level ¡

AMG ¡Solve ¡Cycle ¡on ¡Hera ¡

128 ¡Cores ¡ 1024 ¡Cores ¡ 3456 ¡Cores ¡

LLNL-‑PRES-‑528011 ¡

Results ¡are ¡for ¡a ¡3D ¡7-‑point ¡Laplace ¡model ¡problem, ¡50 ¡x ¡50 ¡x ¡25 ¡points/core ¡
Why ¡was ¡there ¡such ¡degrada)on ¡here ¡but ¡not ¡on ¡Blue ¡Gene ¡machines? ¡
Mo)va)on ¡for ¡developing ¡performance ¡model ¡
AMG ¡scaled ¡well ¡on ¡IBM ¡Blue ¡Gene/L, ¡Blue ¡Gene/P, ¡but ¡has ¡struggled ¡on ¡
ther ¡machines ¡like ¡Hera, ¡an ¡Opteron ¡cluster ¡at ¡LLNL: ¡

Poor ¡performance ¡on ¡coarse ¡grids ¡hurts ¡scalability: ¡ Communica)on ¡pa_ern ¡on ¡one ¡of ¡the ¡coarse ¡grids: ¡

SLIDE 4

Performa mance Model

Approach: ¡work ¡level-‑by-‑level, ¡with ¡α-‑β ¡model ¡(Tsend=α+nβ ¡for ¡message ¡of ¡

length ¡n) ¡as ¡baseline ¡

7/6/14 ¡ 4 ¡

smooth, form residual restrict to level i+1 prolong to level i-1 smooth

LLNL-‑PRES-‑656515 ¡

Machine ¡parameters ¡for ¡network ¡and ¡computa)on ¡rate ¡measured ¡using ¡

benchmarks ¡

Communica)on, ¡computa)on ¡counts ¡are ¡available ¡from ¡solver ¡data ¡

structures ¡

Ø Fundamental ¡opera)ons ¡at ¡each ¡level ¡ shown ¡in ¡red ¡ ¡ Ø Treat ¡each ¡opera)on ¡as ¡MatVec ¡with ¡ appropriate ¡operator ¡

SLIDE 5

Performa mance Model

To ¡the ¡baseline ¡models, ¡we ¡add ¡penal)es ¡to ¡take ¡architecture ¡into ¡

account: ¡

– Distance ¡of ¡communica)on: ¡introduce ¡)me ¡per ¡hop ¡γ ¡

Measured ¡from ¡worst-‑case, ¡best-‑case ¡latencies ¡and ¡global ¡network ¡diameter ¡
Distance ¡of ¡diam(P) ¡charged ¡to ¡each ¡message ¡

¡

– Lower ¡effec)ve ¡bandwidth: ¡mul)ply ¡β ¡by ¡

r ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡+ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡, ¡depending ¡on ¡machine ¡

¡

– Mul)core ¡penal)es: ¡

c ¡= ¡number ¡of ¡cores ¡per ¡node ¡
Pi ¡= ¡number ¡of ¡“ac)ve” ¡processes ¡on ¡level ¡i ¡
Mul)core ¡latency ¡penalty: ¡mul)ply ¡α ¡by ¡
Mul)core ¡distance ¡penalty: ¡mul)ply ¡γ ¡by ¡

– Hybrid ¡MPI/OpenMP: ¡if ¡using ¡j ¡threads, ¡mul)ply ¡)me ¡per ¡flop ¡by ¡ (Mem ¡BW ¡for ¡1 ¡thread)/(Mem ¡BW ¡for ¡j ¡threads) ¡

7/6/14 ¡ 5 ¡

Hardware ¡Bandwidth ¡ MPI ¡Bandwidth ¡ Hardware ¡Bandwidth ¡ MPI ¡Bandwidth ¡ nmsgs ¡ nlinks ¡

cPi ¡ ¡P ¡ LLNL-‑PRES-‑656515 ¡

SLIDE 6

Performa mance Model

7/6/14 ¡ 6 ¡

1 2 3 4 5 6 7 10

−4

10

−3

10

−2

10

−1

Level Time (s) Cycle Time by Level on Hera, 1024 Processes α−β Model α−β−γ Model β Penalty α,β Penalties β,γ Penalties α,β,γ Penalties 2 4 6 8 10

−4

10

−3

10

−2

10

−1

Level Time (s) Cycle Time by Level on Intrepid, 8192 Processes α−β Model α−β−γ Model β Penalty α,β Penalties β,γ Penalties α,β,γ Penalties

Intrepid ¡(Blue ¡Gene/P ¡at ¡Argonne) ¡ Hera ¡(Opteron ¡Cluster ¡at ¡LLNL) ¡

vs. ¡
Fine ¡grid ¡dominates ¡performance ¡
Only ¡β ¡penalty ¡applies ¡
α-‑β ¡model ¡close ¡to ¡actual ¡
Coarse ¡grids ¡dominate ¡performance ¡
All ¡penal)es ¡apply ¡
α-‑β ¡model ¡much ¡different ¡from ¡actual ¡

LLNL-‑PRES-‑656515 ¡

SLIDE 7

Observatio ions

The ¡issue ¡is ¡not ¡the ¡communica)on ¡itself ¡but ¡the ¡

ability ¡of ¡the ¡interconnect ¡to ¡handle ¡it ¡

Trend ¡towards ¡more ¡on-‑node ¡parallelism ¡means ¡we ¡

cannot ¡rely ¡on ¡interconnects ¡

– Hera ¡= ¡worst ¡case ¡scenario ¡ – However, ¡even ¡something ¡between ¡current-‑genera)on ¡ machines ¡and ¡Hera ¡would ¡be ¡very ¡problema)c ¡

Model ¡gives ¡us ¡a ¡way ¡forward. ¡We ¡will ¡show ¡how ¡to ¡

use ¡it ¡to ¡

1. Guide ¡data ¡redistribu)on ¡that ¡trades ¡communica)on ¡for ¡

computa)on ¡

2. Guide ¡thread/task ¡mix ¡selec)on ¡in ¡hybrid ¡MPI/OpenMP ¡

7/6/14 ¡ 7 ¡ LLNL-‑PRES-‑656515 ¡

SLIDE 8

Data Redis istrib ibutio ion in in AMG

Idea ¡that ¡has ¡gained ¡trac)on: ¡

– Concentrate ¡data ¡on ¡coarse ¡grids ¡so ¡that ¡fewer ¡messages ¡are ¡sent ¡ – Has ¡been ¡done ¡with ¡or ¡without ¡redundant ¡replica)on ¡

Illustra)on ¡of ¡redistribu)on ¡strategy: ¡
Performance ¡model ¡can ¡be ¡adjusted ¡to ¡model ¡this: ¡

– At ¡level ¡where ¡redistribu)on ¡is ¡performed, ¡charge ¡for ¡needed ¡ collec)ve ¡opera)ons ¡ – On ¡this ¡and ¡coarser ¡levels, ¡communica)on ¡is ¡with ¡at ¡most ¡C-‑1 ¡

partners. ¡Adjust ¡computa)on ¡based ¡on ¡amount ¡of ¡data ¡concentrated ¡

– Adjust ¡)me ¡per ¡flop ¡based ¡on ¡“problem ¡size ¡classifica)on” ¡(parts ¡of ¡ data ¡that ¡fit ¡in ¡cache) ¡

7/6/14 ¡ LLNL-‑PRES-‑656515 ¡ 8 ¡

Split ¡problem ¡domain ¡into ¡chunks ¡(blue ¡boxes) ¡
Processes ¡within ¡a ¡chunk ¡have ¡same ¡part ¡of ¡domain ¡
Redundant ¡version ¡shown ¡with ¡12 ¡processes ¡and ¡

4 ¡chunks ¡

Nonredundant ¡version ¡would ¡keep ¡just ¡one ¡color ¡

group ¡

SLIDE 9

Guid idin ing Data Redis istrib ibutio ion

At ¡each ¡coarse ¡grid ¡in ¡setup ¡phase, ¡use ¡model ¡to ¡es)mate: ¡

1. Time ¡spent ¡at ¡that ¡level ¡in ¡solve ¡cycle ¡when ¡redistribu)ng ¡(Tswitch) ¡ 2. And ¡when ¡not ¡redistribu)ng ¡(Tnoswitch) ¡ 3. If ¡Tswitch ¡< ¡Tnoswitch, ¡then ¡redistribute ¡ ¡

Requires ¡extra ¡informa)on: ¡

– Interpola)on ¡operator ¡unavailable: ¡subs)tute ¡MatVec ¡with ¡solve ¡operator ¡ – Time ¡per ¡flop ¡unknown: ¡measure ¡with ¡MatVecs ¡using ¡local ¡por)on ¡of ¡parallel ¡ data ¡ ¡

Other ¡concerns: ¡

– Time ¡per ¡flop ¡changes ¡aner ¡redistribu)on: ¡do ¡not ¡change ¡it ¡in ¡model, ¡but ¡ prevent ¡redistribu)on ¡if ¡problem ¡size ¡classifica)on ¡increases ¡ – Hybrid ¡MPI/OpenMP ¡use? ¡Requires ¡essen)ally ¡no ¡change! ¡Implicit ¡in ¡ measurement ¡of ¡)me ¡per ¡flop ¡ – Number ¡of ¡chunks ¡to ¡carve ¡problem ¡into? ¡For ¡quick ¡setup, ¡search ¡powers ¡of ¡2 ¡ <= ¡max ¡# ¡sends ¡ – Possible ¡overeager ¡switching: ¡keep ¡track ¡of ¡running ¡es)mated ¡cycle ¡)me, ¡do ¡ not ¡switch ¡if ¡overall ¡modeled ¡improvement ¡is ¡< ¡5% ¡

7/6/14 ¡ 9 ¡ LLNL-‑PRES-‑656515 ¡

SLIDE 10

Redis istrib ibutio ion Experime iments

Pair ¡of ¡test ¡problems: ¡

1. 3D ¡Laplace ¡with ¡30 ¡x ¡30 ¡x ¡30 ¡points/core ¡ 2. Linear ¡Elas)city ¡with ¡~6,300 ¡points/core: ¡ ¡ ¡ ¡ ¡ ¡

¡

Used ¡nonredundant ¡redistribu)on ¡owing ¡to ¡issues ¡with ¡large ¡

numbers ¡of ¡MPI ¡communicators ¡at ¡scale ¡

Ran ¡on ¡three ¡machines: ¡

– Vulcan: ¡IBM ¡Blue ¡Gene/Q ¡at ¡LLNL ¡ – Titan: ¡Cray ¡XK7 ¡at ¡ORNL ¡ – Eos: ¡Cray ¡XC30 ¡at ¡ORNL ¡

7/6/14 ¡ 10 ¡ !" #$%&!'()" #$%&!'(" LLNL-‑PRES-‑656515 ¡

SLIDE 11

Lapla lace Result lts

7/6/14 ¡ 11 ¡ LLNL-‑PRES-‑656515 ¡

512 4096 32768 0.5 1 1.5 2 2.5 3 3.5 4

No. Cores

Runtime (s) AMG on Vulcan, 3D Laplace, 16 x 4 MPI/OpenMP Mix

0.95 0.76 0.89 0.74 1.46 0.97 1.11 0.97 2.30 1.29 1.88 1.28

Setup Solve 512 4096 32768 0.5 1 1.5 2 2.5 3 3.5 4 4.5

No. Cores

Runtime (s) AMG on Titan, 3D Laplace, 8 x 2 MPI/OpenMP Mix

0.48 0.51 0.45 0.52 0.88 0.94 0.67 0.80 2.57 1.70 1.69 1.27

Setup Solve 512 4096 8000 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6

No. Cores

Runtime (s) AMG on Eos, 3D Laplace, 16 x 1 MPI/OpenMP Mix

0.36 0.19 0.24 0.18 1.05 0.26 0.66 0.24 1.30 0.30 0.73 0.28

Setup Solve

Cores ¡ Setup ¡ Solve ¡ Overall ¡ 512 ¡ 1.03 ¡ 1.07 ¡ 1.05 ¡ 4096 ¡ 1.32 ¡ 1.00 ¡ 1.17 ¡ 32768 ¡ 1.22 ¡ 1.01 ¡ 1.14 ¡

Vulcan, ¡16 ¡x ¡4 ¡Mix ¡ Titan, ¡8 ¡x ¡2 ¡Mix ¡ Eos, ¡16 ¡x ¡1 ¡Mix ¡

Speedups: ¡

Cores ¡ Setup ¡ Solve ¡ Overall ¡ 512 ¡ 1.07 ¡ 0.98 ¡ 1.02 ¡ 4096 ¡ 1.18 ¡ 1.31 ¡ 1.24 ¡ 32768 ¡ 1.52 ¡ 1.34 ¡ 1.44 ¡

Speedups: ¡

Cores ¡ Setup ¡ Solve ¡ Overall ¡ 512 ¡ 1.50 ¡ 1.06 ¡ 1.31 ¡ 4096 ¡ 1.59 ¡ 1.08 ¡ 1.46 ¡ 8000 ¡ 1.78 ¡ 1.07 ¡ 1.58 ¡

Speedups: ¡

SLIDE 12

Lin inear r Ela lastic icit ity Result lts

7/6/14 ¡ 12 ¡

Eos ¡(all ¡MPI) ¡ Vulcan ¡(16 ¡x ¡4 ¡Mix) ¡ Titan ¡(all ¡MPI) ¡

LLNL-‑PRES-‑656515 ¡

SLIDE 13

MPI PI/Ope OpenMP Mix ix Su Suggestio ion

In ¡our ¡results, ¡we ¡have ¡seen ¡that ¡the ¡MPI/

OpenMP ¡mix ¡has ¡a ¡large ¡impact ¡on ¡performance ¡

Range ¡of ¡possibili)es ¡will ¡increase ¡on ¡future ¡

machines ¡

Can ¡we ¡help ¡users ¡select ¡this ¡mix? ¡
For ¡inspira)on, ¡turn ¡to ¡a ¡bounding ¡approach: ¡

– Allow ¡amount ¡of ¡computa)on ¡and ¡communica)on ¡in ¡ AMG ¡cycle ¡to ¡vary ¡ – Determine ¡OpenMP ¡“improvement ¡regions” ¡in ¡ parameter ¡space ¡

7/6/14 ¡ 13 ¡ LLNL-‑PRES-‑656515 ¡

SLIDE 14

MPI PI/Ope OpenMP Mix ix Su Suggestio ion

The ¡procedure: ¡

– Assume ¡a ¡cycle ¡run ¡MPI-‑only ¡has ¡a ¡certain ¡% ¡of ¡)me ¡devoted ¡to ¡computa)on ¡ – Use ¡hybrid ¡MPI/OpenMP ¡penal)es ¡to ¡calculate ¡for ¡each ¡MPI/OpenMP ¡mix ¡ how ¡much ¡communica)on ¡reduc)on ¡is ¡necessary ¡for ¡performance ¡ improvement ¡ ¡

The ¡math: ¡

– Assume ¡AMG ¡cycle ¡takes ¡100 ¡s. ¡Split ¡cycle ¡)me ¡into ¡components: ¡ ¡

¡Tcycle ¡= ¡100 ¡= ¡Tcomp

¡+ ¡Tcomm ¡

¡

– Let ¡fcomm ¡be ¡frac)on ¡of ¡communica)on ¡)me ¡needed ¡for ¡improvement ¡from ¡ OpenMP, ¡and ¡pomp ¡be ¡the ¡OpenMP ¡penalty ¡from ¡model. ¡Want ¡the ¡region ¡

¡pompTcomp ¡+ ¡fcommTcomm ¡≤ ¡100 ¡ ¡

– Visualize ¡by ¡ployng ¡ ¡ ¡fcomm ¡≤ ¡(100 ¡– ¡pompTcomp)/(100 ¡– ¡Tcomp) ¡

7/6/14 ¡ 14 ¡ LLNL-‑PRES-‑656515 ¡

SLIDE 15

MPI PI/Ope OpenMP Mix ix Su Suggestio ion

7/6/14 ¡ 15 ¡

20 40 60 80 100 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Percent of Cycle Time Spent in Computation Communication Reduction from OpenMP OpenMP Improvement Regions, Vulcan 64 OMP 32 OMP 16 OMP 8 OMP 4 OMP 2 OMP 20 40 60 80 100 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Percent of Cycle Time Spent in Computation Communication Reduction from OpenMP OpenMP Improvement Regions, Titan 16 OMP 8 OMP 4 OMP 2 OMP 20 40 60 80 100 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Percent of Cycle Time Spent in Computation Communication Reduction from OpenMP OpenMP Improvement Regions, Eos 16 OMP 8 OMP 4 OMP 2 OMP

Cores ¡ 32 ¡x ¡2 ¡ 16 ¡x ¡4 ¡ 8 ¡x ¡8 ¡ 512 ¡ 0.86 ¡s ¡ 0.76 ¡s ¡ 0.75 ¡s ¡ 4096 ¡ 1.07 ¡s ¡ 0.97 ¡s ¡ 1.00 ¡s ¡ 32768 ¡ 1.42 ¡s ¡ 1.29 ¡s ¡ 1.30 ¡s ¡

3D ¡Laplace ¡Solve ¡Times: ¡

Cores ¡ 16 ¡x ¡1 ¡ 8 ¡x ¡2 ¡ 4 ¡x ¡4 ¡ 512 ¡ 0.48 ¡s ¡ 0.51 ¡s ¡ 0.65 ¡s ¡ 4096 ¡ 0.84 ¡s ¡ 0.94 ¡s ¡ 0.97 ¡s ¡ 32768 ¡ 2.05 ¡s ¡ 1.70 ¡s ¡ 2.41 ¡s ¡

3D ¡Laplace ¡Solve ¡Times: ¡

Cores ¡ 16 ¡x ¡1 ¡ 8 ¡x ¡2 ¡ 4 ¡x ¡4 ¡ 512 ¡ 0.19 ¡s ¡ 0.19 ¡s ¡ 0.22 ¡s ¡ 4096 ¡ 0.28 ¡s ¡ 0.29 ¡s ¡ 0.29 ¡s ¡ 8000 ¡ 0.30 ¡s ¡ 0.32 ¡s ¡ 0.39 ¡s ¡

LLNL-‑PRES-‑656515 ¡

Vulcan ¡ Titan ¡ Eos ¡

3D ¡Laplace ¡Solve ¡Times: ¡ Biggest ¡regions ¡(8x8, ¡16x4) ¡have ¡ best ¡)mes! ¡ Biggest ¡region ¡is ¡blank, ¡sugges)ng ¡ all-‑MPI; ¡however, ¡8x2 ¡is ¡the ¡best ¡at ¡ large ¡scale ¡ Biggest ¡region ¡is ¡blank, ¡sugges)ng ¡ all-‑MPI, ¡which ¡is ¡the ¡best! ¡

SLIDE 16

Conclu lusio ions

Performance ¡modeling ¡has ¡enabled ¡us ¡to ¡

– Understand ¡performance ¡issues ¡faced ¡by ¡AMG ¡ – Understand ¡factors ¡behind ¡performance ¡on ¡different ¡ machines ¡ – Guide ¡data ¡redistribu)on ¡at ¡run)me ¡that ¡improves ¡ performance ¡

Decision ¡that ¡would ¡otherwise ¡require ¡guesswork ¡becomes ¡

automated ¡

Yields ¡improvements ¡even ¡on ¡modern ¡machines ¡when ¡using ¡

hybrid ¡programming ¡to ¡reduce ¡number ¡of ¡messages ¡sent ¡

– Guide ¡thread/task ¡mix ¡selec)on ¡when ¡using ¡hybrid ¡ programming ¡model ¡

Long ¡way ¡to ¡go ¡here ¡
We ¡can, ¡however, ¡use ¡informa)on ¡to ¡avoid ¡choices ¡that ¡are ¡
bviously ¡bad ¡

¡

7/6/14 ¡ 16 ¡ LLNL-‑PRES-‑656515 ¡

SLIDE 17

Future Work k Dir irectio ions

Addi)onal ¡changes ¡to ¡AMG ¡

– Refine ¡data ¡redistribu)on ¡scheme, ¡test ¡on ¡more ¡problems ¡ – Search ¡for ¡other ¡places ¡in ¡AMG ¡where ¡we ¡could ¡trade ¡one ¡ cost ¡for ¡another ¡to ¡get ¡a ¡performance ¡benefit ¡

Addi)onal ¡targets ¡for ¡performance ¡modeling ¡

– AMG ¡setup ¡phase ¡ – Refine ¡thread/task ¡predic)on ¡framework ¡ – Predic)ons ¡on ¡future ¡machines ¡

Other ¡applica)ons ¡

– Performance ¡models ¡are ¡MatVec ¡based; ¡can ¡examine ¡

ther ¡applica)ons ¡based ¡on ¡them ¡

– Thread/task ¡mix ¡predic)on ¡ – Find ¡tunable ¡parameters ¡like ¡we ¡did ¡for ¡AMG ¡and ¡use ¡ performance ¡models ¡to ¡turn ¡the ¡knobs ¡

7/6/14 ¡ 17 ¡ LLNL-‑PRES-‑656515 ¡

SLIDE 18

Questio ions?

7/6/14 ¡ 18 ¡ LLNL-‑PRES-‑656515 ¡

SLIDE 19

Ackn knowle ledgeme ments

Lawrence ¡Livermore ¡Na)onal ¡Laboratory ¡is ¡operated ¡by ¡Lawrence ¡Livermore ¡ Na)onal ¡Security, ¡LLC, ¡for ¡the ¡U.S. ¡Department ¡of ¡Energy, ¡Na)onal ¡Nuclear ¡ Security ¡Administra)on ¡under ¡Contract ¡DE-‑AC52-‑07NA27344. ¡This ¡research ¡ used ¡resources ¡of ¡the ¡Argonne ¡Leadership ¡Compu)ng ¡Facility ¡at ¡Argonne ¡ Na)onal ¡Laboratory, ¡which ¡is ¡supported ¡by ¡the ¡Office ¡of ¡Science ¡of ¡the ¡U.S. ¡ Department ¡of ¡Energy ¡under ¡Contract ¡DE-‑AC02-‑06CH11357, ¡and ¡resources ¡of ¡ the ¡Oak ¡Ridge ¡Leadership ¡Compu)ng ¡Facility ¡at ¡the ¡Oak ¡Ridge ¡Na)onal ¡ Laboratory, ¡which ¡is ¡supported ¡by ¡the ¡Office ¡of ¡Science ¡of ¡the ¡U.S. ¡ Department ¡of ¡Energy ¡under ¡Contract ¡DE-‑AC05-‑00OR22725. ¡An ¡award ¡of ¡ computer ¡)me ¡was ¡provided ¡by ¡the ¡Innova)ve ¡and ¡Novel ¡Computa)onal ¡ Impact ¡on ¡Theory ¡and ¡Experiment ¡(INCITE) ¡program. ¡

7/6/14 ¡ LLNL-‑PRES-‑656515 ¡ 19 ¡

Driving Improvements to Algebraic Multigrid Through Performance Modeling

Alg lgebraic ic Mult ltig igrid id

Performa mance Issues

Performa mance Model

Performa mance Model

account: ¡

Performa mance Model

Observatio ions

ability ¡of ¡the ¡interconnect ¡to ¡handle ¡it ¡

cannot ¡rely ¡on ¡interconnects ¡

– Hera ¡= ¡worst ¡case ¡scenario ¡ – However, ¡even ¡something ¡between ¡current-­‑genera)on ¡ machines ¡and ¡Hera ¡would ¡be ¡very ¡problema)c ¡

use ¡it ¡to ¡

computa)on ¡

Data Redis istrib ibutio ion in in AMG

Guid idin ing Data Redis istrib ibutio ion

Redis istrib ibutio ion Experime iments

¡

numbers ¡of ¡MPI ¡communicators ¡at ¡scale ¡

Lapla lace Result lts

Lin inear r Ela lastic icit ity Result lts

MPI PI/Ope OpenMP Mix ix Su Suggestio ion

OpenMP ¡mix ¡has ¡a ¡large ¡impact ¡on ¡performance ¡

machines ¡

– Allow ¡amount ¡of ¡computa)on ¡and ¡communica)on ¡in ¡ AMG ¡cycle ¡to ¡vary ¡ – Determine ¡OpenMP ¡“improvement ¡regions” ¡in ¡ parameter ¡space ¡

MPI PI/Ope OpenMP Mix ix Su Suggestio ion

MPI PI/Ope OpenMP Mix ix Su Suggestio ion

Conclu lusio ions

– Understand ¡performance ¡issues ¡faced ¡by ¡AMG ¡ – Understand ¡factors ¡behind ¡performance ¡on ¡different ¡ machines ¡ – Guide ¡data ¡redistribu)on ¡at ¡run)me ¡that ¡improves ¡ performance ¡

– Guide ¡thread/task ¡mix ¡selec)on ¡when ¡using ¡hybrid ¡ programming ¡model ¡

¡

Future Work k Dir irectio ions

– Refine ¡data ¡redistribu)on ¡scheme, ¡test ¡on ¡more ¡problems ¡ – Search ¡for ¡other ¡places ¡in ¡AMG ¡where ¡we ¡could ¡trade ¡one ¡ cost ¡for ¡another ¡to ¡get ¡a ¡performance ¡benefit ¡

– AMG ¡setup ¡phase ¡ – Refine ¡thread/task ¡predic)on ¡framework ¡ – Predic)ons ¡on ¡future ¡machines ¡

– Performance ¡models ¡are ¡MatVec ¡based; ¡can ¡examine ¡

– Thread/task ¡mix ¡predic)on ¡ – Find ¡tunable ¡parameters ¡like ¡we ¡did ¡for ¡AMG ¡and ¡use ¡ performance ¡models ¡to ¡turn ¡the ¡knobs ¡

Questio ions?

Ackn knowle ledgeme ments

– Hera ¡= ¡worst ¡case ¡scenario ¡ – However, ¡even ¡something ¡between ¡current-‑genera)on ¡ machines ¡and ¡Hera ¡would ¡be ¡very ¡problema)c ¡