[PPT] - Task mapping, job placements and routing strategies Abhinav PowerPoint Presentation

SLIDE 1

This work was performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under Contract DE-AC52-07NA27344. Lawrence Livermore National Laboratory, P. O. Box 808, Livermore, CA 94551

Charm++ ¡Workshop ¡◆ ¡April ¡30, ¡2014

Task ¡mapping, ¡job ¡placements ¡and ¡ routing ¡strategies

Abhinav ¡Bhatele Center ¡for ¡Applied ¡Scientific ¡Computing

LLNL: Peer-Timo Bremer, Todd Gamblin, Katherine E. Isaacs, Steven H. Langer, Kathryn Mohror, Martin Schulz Illinois: Ronak Buch, Nikhil Jain, Harshitha Menon, Laxmikant

V. Kale,

Michael Robson Utah: Amey Desai, Aaditya G. Landge, Valerio Pascucci Purdue: Ahmed Abdel-Gawad, Mithuna Thottethodi LBL: Brian Austin, Nicholas J. Wright

SLIDE 2

LLNL-PRES-654602 Abhinav Bhatele @ Charm++ Workshop

Communication: the bottleneck at extreme scale

2

Time (ns) Energy spent (pJ) Floating point operation < 0.25 30-45 Time to access DRAM 50 128 Get data from another node > 1000 128-576

P . Kogge et al., Exascale computing study: Technology challenges in achieving exascale systems, Technical Report, 2008.

SLIDE 3

LLNL-PRES-654602 Abhinav Bhatele @ Charm++ Workshop

Communication: the bottleneck at extreme scale

High costs for data movement in

terms of time and energy

2

Time (ns) Energy spent (pJ) Floating point operation < 0.25 30-45 Time to access DRAM 50 128 Get data from another node > 1000 128-576

P . Kogge et al., Exascale computing study: Technology challenges in achieving exascale systems, Technical Report, 2008.

SLIDE 4

LLNL-PRES-654602 Abhinav Bhatele @ Charm++ Workshop

Communication: the bottleneck at extreme scale

High costs for data movement in

terms of time and energy

Newer platforms stressing

communication further (more cores, bigger networks)

2

Time (ns) Energy spent (pJ) Floating point operation < 0.25 30-45 Time to access DRAM 50 128 Get data from another node > 1000 128-576 IBM Cray Cray Blue Gene/L 0.375 XT3 8.77 Blue Gene/P 0.375 XT4 1.36 Blue Gene/Q 0.117 XT5 0.23

A. Bhatele et al., Automated mapping of regular communication graphs on

mesh interconnects, Intl. Conf. on High Performance Computing (HiPC), 2010.

Network bytes to flop ratios

P . Kogge et al., Exascale computing study: Technology challenges in achieving exascale systems, Technical Report, 2008.

SLIDE 5

LLNL-PRES-654602 Abhinav Bhatele @ Charm++ Workshop

Communication: the bottleneck at extreme scale

High costs for data movement in

terms of time and energy

Newer platforms stressing

communication further (more cores, bigger networks)

Imperative to minimize data

movement and maximize locality

2

Time (ns) Energy spent (pJ) Floating point operation < 0.25 30-45 Time to access DRAM 50 128 Get data from another node > 1000 128-576 IBM Cray Cray Blue Gene/L 0.375 XT3 8.77 Blue Gene/P 0.375 XT4 1.36 Blue Gene/Q 0.117 XT5 0.23

A. Bhatele et al., Automated mapping of regular communication graphs on

mesh interconnects, Intl. Conf. on High Performance Computing (HiPC), 2010.

Network bytes to flop ratios

P . Kogge et al., Exascale computing study: Technology challenges in achieving exascale systems, Technical Report, 2008.

SLIDE 6

LLNL-PRES-654602 Abhinav Bhatele @ Charm++ Workshop

TASK MAPPING

3

SLIDE 7

LLNL-PRES-654602 Abhinav Bhatele @ Charm++ Workshop

Topology aware task mapping

What is mapping - layout/placement of tasks/processes in an application on

the physical interconnect

Does not require any changes to the application

4

SLIDE 8

LLNL-PRES-654602 Abhinav Bhatele @ Charm++ Workshop

Topology aware task mapping

What is mapping - layout/placement of tasks/processes in an application on

the physical interconnect

Does not require any changes to the application

4

SLIDE 9

LLNL-PRES-654602 Abhinav Bhatele @ Charm++ Workshop

Topology aware task mapping

What is mapping - layout/placement of tasks/processes in an application on

the physical interconnect

Does not require any changes to the application

4

SLIDE 10

LLNL-PRES-654602 Abhinav Bhatele @ Charm++ Workshop

Topology aware task mapping

What is mapping - layout/placement of tasks/processes in an application on

the physical interconnect

Does not require any changes to the application

4

SLIDE 11

LLNL-PRES-654602 Abhinav Bhatele @ Charm++ Workshop

Topology aware task mapping

What is mapping - layout/placement of tasks/processes in an application on

the physical interconnect

Does not require any changes to the application

4

Goals:
Balance computational load
Minimize contention (optimize latency or bandwidth)

SLIDE 12

LLNL-PRES-654602 Abhinav Bhatele @ Charm++ Workshop

Maximize bandwidth?

Traditionally, research has focused on bringing tasks

closer to reduce the number of hops

Minimizes latency, but more importantly link contention
For applications that send large messages this might

not be optimal

5 1D

SLIDE 13

LLNL-PRES-654602 Abhinav Bhatele @ Charm++ Workshop

Maximize bandwidth?

Traditionally, research has focused on bringing tasks

closer to reduce the number of hops

Minimizes latency, but more importantly link contention
For applications that send large messages this might

not be optimal

5 1D 2D

SLIDE 14

LLNL-PRES-654602 Abhinav Bhatele @ Charm++ Workshop

Maximize bandwidth?

Traditionally, research has focused on bringing tasks

closer to reduce the number of hops

Minimizes latency, but more importantly link contention
For applications that send large messages this might

not be optimal

5 1D 2D 3D

SLIDE 15

LLNL-PRES-654602 Abhinav Bhatele @ Charm++ Workshop

Maximize bandwidth?

Traditionally, research has focused on bringing tasks

closer to reduce the number of hops

Minimizes latency, but more importantly link contention
For applications that send large messages this might

not be optimal

5 1D 2D 3D 4D

SLIDE 16

LLNL-PRES-654602 Abhinav Bhatele @ Charm++ Workshop

Rubik

We have developed a mapping tool focusing on:
structured applications that are bandwidth-bound, use collectives
ver sub-communicators
built-in operations that can increase effective bandwidth on torus

networks based on heuristics

Input:
Application topology with subsets identified
Processor topology
Set of operations to perform
Output: map file for job launcher

6

SLIDE 17

LLNL-PRES-654602 Abhinav Bhatele @ Charm++ Workshop

Application example

7

app = box([9,3,8]) # Create app partition tree of 27-task planes app.tile([9,3,1]) network = box([6,6,6]) # Create network partition tree of 27-processor cubes network.tile([3,3,3]) network.map(app) # Map task planes into cubes

map()

=

app network network with mapped application ranks

216 216 27 27 27 27 27 27 27 27 27 27 27 27 27 27 27 27

SLIDE 18

LLNL-PRES-654602 Abhinav Bhatele @ Charm++ Workshop

Mapping pF3D

A laser-plasma interaction code used at the

National Ignition Facility (NIF) at LLNL

Three communication phases over a 3D virtual

topology:

Wave propagation and coupling: 2D FFTs within XY planes
Light advection: Send-recv between consecutive XY planes
Hydrodynamic equations: 3D near-neighbor exchange

8

SLIDE 19

LLNL-PRES-654602 Abhinav Bhatele @ Charm++ Workshop

Mapping pF3D

A laser-plasma interaction code used at the

National Ignition Facility (NIF) at LLNL

Three communication phases over a 3D virtual

topology:

Wave propagation and coupling: 2D FFTs within XY planes
Light advection: Send-recv between consecutive XY planes
Hydrodynamic equations: 3D near-neighbor exchange

8

SLIDE 20

LLNL-PRES-654602 Abhinav Bhatele @ Charm++ Workshop

Mapping pF3D

A laser-plasma interaction code used at the

National Ignition Facility (NIF) at LLNL

Three communication phases over a 3D virtual

topology:

Wave propagation and coupling: 2D FFTs within XY planes
Light advection: Send-recv between consecutive XY planes
Hydrodynamic equations: 3D near-neighbor exchange

8

2048 cores 16384 cores MPI call Total % MPI % Total % MPI % Send 4.90 28.45 23.10 57.21 Alltoall 8.10 46.94 7.30 18.07 Barrier 2.78 16.10 8.13 20.15

SLIDE 21

LLNL-PRES-654602 Abhinav Bhatele @ Charm++ Workshop

Performance benefits

9

5 10 15 20 TXYZ XYZT tile tiltX tiltXY Time (s) Mapping Comparison of different mappings on 2,048 cores Barrier All-to-all Send Receive

A. Bhatele et al. Mapping applications with collectives over sub-communicators on torus networks. In Proceedings of the ACM/IEEE

International Conference for High Performance Computing, Networking, Storage and Analysis, SC '12. IEEE Computer Society, November 2012.

SLIDE 22

LLNL-PRES-654602 Abhinav Bhatele @ Charm++ Workshop

Performance benefits

9

5 10 15 20 TXYZ XYZT tile tiltX tiltXY Time (s) Mapping Comparison of different mappings on 2,048 cores Barrier All-to-all Send Receive

A. Bhatele et al. Mapping applications with collectives over sub-communicators on torus networks. In Proceedings of the ACM/IEEE

International Conference for High Performance Computing, Networking, Storage and Analysis, SC '12. IEEE Computer Society, November 2012.

200 400 600 800 1000 2048 4096 8192 16384 32768 65536 Time per iteration (s) Number of cores Execution time for different mappings of pF3D Default Map Best Map

60%

SLIDE 23

LLNL-PRES-654602 Abhinav Bhatele @ Charm++ Workshop

Visualizing network traffic using Boxfish

10

X Y Y Z Z X 76M 2M

TXYZ XYZT tile tiltX tiltXY

SLIDE 24

LLNL-PRES-654602 Abhinav Bhatele @ Charm++ Workshop 11

MODELING & SIMULATION

X[1] <= 0.4295 X[0] <= 0.0082 X[0] <= 0.2857 X[1] <= 0.0077 X[0] <= 0.0176 X[0] <= 0.1905 leaf leaf X[1] <= 0.0212 leaf Rest of the tree feature 1 feature 2

SLIDE 25

LLNL-PRES-654602 Abhinav Bhatele @ Charm++ Workshop

Predicting execution time without executing the code

Goal: find which mapping gives the best performance
Offline metrics: maximum hops, average bytes,

maximum bytes

Use network hardware counters to propose new

metrics

Supervised learning algorithms to predict

performance

12

N. Jain et al. Predicting application performance using supervised learning on communication features. In Proceedings of the ACM/IEEE

International Conference for High Performance Computing, Networking, Storage and Analysis, SC '13. IEEE Computer Society, November 2013.

SLIDE 26

LLNL-PRES-654602 Abhinav Bhatele @ Charm++ Workshop

Why don’t we run all the mappings?

13

SLIDE 27

LLNL-PRES-654602 Abhinav Bhatele @ Charm++ Workshop

Why don’t we run all the mappings?

Wasted allocation

hours

13

2012 2013 Intrepid 4.16M 0.73M Mira 0.17M 7.67M Total 4.33M 8.40M

SLIDE 28

LLNL-PRES-654602 Abhinav Bhatele @ Charm++ Workshop

Why don’t we run all the mappings?

Wasted allocation

hours

13

2012 2013 Intrepid 4.16M 0.73M Mira 0.17M 7.67M Total 4.33M 8.40M

13 million core hours!

SLIDE 29

LLNL-PRES-654602 Abhinav Bhatele @ Charm++ Workshop

Why don’t we run all the mappings?

Wasted allocation

hours

Wasted time in the

queue

13

2012 2013 Intrepid 4.16M 0.73M Mira 0.17M 7.67M Total 4.33M 8.40M

13 million core hours!

SLIDE 30

LLNL-PRES-654602 Abhinav Bhatele @ Charm++ Workshop

Why don’t we run all the mappings?

Wasted allocation

hours

Wasted time in the

queue

All we need is -

which is the best mapping?

13

2012 2013 Intrepid 4.16M 0.73M Mira 0.17M 7.67M Total 4.33M 8.40M

13 million core hours!

SLIDE 31

LLNL-PRES-654602 Abhinav Bhatele @ Charm++ Workshop

Supervised learning: scikit-learn

Use simulation and other tools to
btain network counters and
ther contention parameters
Exploit supervised learning

algorithms for performance prediction

forests of randomized decision trees

14

X[1] <= 0.4295 X[0] <= 0.0082 X[0] <= 0.2857 X[1] <= 0.0077 X[0] <= 0.0176 X[0] <= 0.1905 leaf leaf X[1] <= 0.0212 leaf Rest of the tree feature 1 feature 2

http://scikit-learn.org

SLIDE 32

LLNL-PRES-654602 Abhinav Bhatele @ Charm++ Workshop

Existing and new metrics

Existing metrics
maximum hops
average bytes
maximum bytes
New metrics:
Buffer length (on intermediate node)
FIFO length (packets in injection FIFOs)
Delay per link (packets in buffers / #received packets)

15

20 30 40 50 60 2 4 6 8 10 Time per iteration (ms) Maximum Dilation 20 30 40 50 60 8e8 1.2e9 1.6e9 2e9 Average bytes per link 20 30 40 50 60 1e09 3e9 6e9 9e9 Maximum bytes on a link

SLIDE 33

LLNL-PRES-654602 Abhinav Bhatele @ Charm++ Workshop

Message life cycle on Blue Gene/Q

16

Injection FIFO Contention Link Contention Receive Buffer Contention Reception FIFO Contention Memory Contention Memory Contention

SLIDE 34

LLNL-PRES-654602 Abhinav Bhatele @ Charm++ Workshop

Results

Three communication

kernels

Five-point 2D Stencil
14-point 3D Stencil
All-to-all over sub-

communicators

17

SLIDE 35

LLNL-PRES-654602 Abhinav Bhatele @ Charm++ Workshop

Results

Three communication

kernels

Five-point 2D Stencil
14-point 3D Stencil
All-to-all over sub-

communicators

17

0.6 0.7 0.8 0.9 1.0 16K 4M 512 16K 4M 8 512 16K 4M R2 Absolute performance correlation max dilation avg bytes max bytes Sub A2A 3D Halo 2D Halo

SLIDE 36

LLNL-PRES-654602 Abhinav Bhatele @ Charm++ Workshop

Results

Three communication

kernels

Five-point 2D Stencil
14-point 3D Stencil
All-to-all over sub-

communicators

17

0.6 0.7 0.8 0.9 1.0 16K 4M 512 16K 4M 8 512 16K 4M R2 Absolute performance correlation max dilation avg bytes max bytes Sub A2A 3D Halo 2D Halo 0.6 0.7 0.8 0.9 1.0 16K 4M 512 16K 4M 8 512 16K 4M R2 Absolute performance correlation H1 H2 H3 H4 H5 H6 Sub A2A 3D Halo 2D Halo

SLIDE 37

LLNL-PRES-654602 Abhinav Bhatele @ Charm++ Workshop 18

Performance prediction for communication kernels

0.01 0.1 1 10 5 10 15 20 25 30 Execution Time (s) Mappings sorted by actual execution times Blue Gene/Q (16,384 cores) Sub A2A Observed Sub A2A Predicted 3D Halo Observed 3D Halo Predicted 2D Halo Observed 2D Halo Predicted

SLIDE 38

LLNL-PRES-654602 Abhinav Bhatele @ Charm++ Workshop

Better correlation than with

existing metrics such as average or maximum bytes

18

Performance prediction for communication kernels

0.01 0.1 1 10 5 10 15 20 25 30 Execution Time (s) Mappings sorted by actual execution times Blue Gene/Q (16,384 cores) Sub A2A Observed Sub A2A Predicted 3D Halo Observed 3D Halo Predicted 2D Halo Observed 2D Halo Predicted

SLIDE 39

LLNL-PRES-654602 Abhinav Bhatele @ Charm++ Workshop

Better correlation than with

existing metrics such as average or maximum bytes

Hybrid metric:
average bytes + maximum bytes +

average buffer length + maximum FIFO length

18

Performance prediction for communication kernels

0.01 0.1 1 10 5 10 15 20 25 30 Execution Time (s) Mappings sorted by actual execution times Blue Gene/Q (16,384 cores) Sub A2A Observed Sub A2A Predicted 3D Halo Observed 3D Halo Predicted 2D Halo Observed 2D Halo Predicted

SLIDE 40

LLNL-PRES-654602 Abhinav Bhatele @ Charm++ Workshop

Better correlation than with

existing metrics such as average or maximum bytes

Hybrid metric:
average bytes + maximum bytes +

average buffer length + maximum FIFO length

Crazy things:
combine all training sets
use 16k training set to predict 64k

performance

18

Performance prediction for communication kernels

0.01 0.1 1 10 5 10 15 20 25 30 Execution Time (s) Mappings sorted by actual execution times Blue Gene/Q (16,384 cores) Sub A2A Observed Sub A2A Predicted 3D Halo Observed 3D Halo Predicted 2D Halo Observed 2D Halo Predicted

SLIDE 41

LLNL-PRES-654602 Abhinav Bhatele @ Charm++ Workshop

22 23 24 25 26 27 28 29 30 31 5 10 15 20 25 30 Execution Time (s) Mappings sorted by actual execution times Blue Gene/Q (16,384 cores) pF3D Observed pF3D Predicted

Predicting the performance of pF3D

Production application
has computation
and multiple phases of

communication

Hybrid metric:
average bytes + average buffer

length + average delay + sum

f hops + maximum FIFO

length

19

SLIDE 42

LLNL-PRES-654602 Abhinav Bhatele @ Charm++ Workshop 20

JOB PLACEMENT & ROUTING

SLIDE 43

LLNL-PRES-654602 Abhinav Bhatele @ Charm++ Workshop

Performance variability

21

Average messaging rates for batch jobs running a laser-plasma interaction code

SLIDE 44

LLNL-PRES-654602 Abhinav Bhatele @ Charm++ Workshop

Performance variability

21

Average messaging rates for batch jobs running a laser-plasma interaction code Total number of bytes sent on the network Time spent sending the messages

SLIDE 45

LLNL-PRES-654602 Abhinav Bhatele @ Charm++ Workshop

Leads to several problems ...

Individual jobs run slower:
More time to complete science simulations
Increased wait time in job queues
Inefficient use of machine time allocation/core-hours
Overall lower throughput
Increase energy usage/costs

22

SLIDE 46

LLNL-PRES-654602 Abhinav Bhatele @ Charm++ Workshop

Also affects software development

Debugging performance issues
Quantifying the effect of various software changes on

performance

code changes
compiler/software stack changes
Requesting time for a batch job
Writing allocation proposals

23

SLIDE 47

LLNL-PRES-654602 Abhinav Bhatele @ Charm++ Workshop

pF3D characterization

24

10 20 30 40 50 60 70 Time (s) Time spent in communication and computation in pF3D Communication 50 100 150 200 250 Hopper Intrepid Mira Time (s) Computation

SLIDE 48

LLNL-PRES-654602 Abhinav Bhatele @ Charm++ Workshop

pF3D characterization

24

5 10 15 20 25 30 35 40 45 Hopper Intrepid Mira Time (s) Time spent in MPI calls on 512 nodes Alltoall Barrier Send Recv Probe

SLIDE 49

LLNL-PRES-654602 Abhinav Bhatele @ Charm++ Workshop

Sources of variability

Operating system noise (OS jitter)
OS daemons running on some cores of each node
Placement/location of the allocated nodes for the job

(Allocation shape)

Contention for shared resources (Inter-job

contention)

Sharing network links with other jobs

25

SLIDE 50

LLNL-PRES-654602 Abhinav Bhatele @ Charm++ Workshop

4x8x8-shaped pF3D job

26

April 11 16

April 11 April 16

https://scalability.llnl.gov/performance-analysis-through-visualization/software.php

SLIDE 51

LLNL-PRES-654602 Abhinav Bhatele @ Charm++ Workshop

4x8x8-shaped pF3D job

26

April 11 16

April 11 April 16 MILC job in green 25% higher messaging rate

https://scalability.llnl.gov/performance-analysis-through-visualization/software.php

SLIDE 52

LLNL-PRES-654602 Abhinav Bhatele @ Charm++ Workshop

4x8x8-shaped pF3D job

27

April 11 April 16b

April 11 16

SLIDE 53

LLNL-PRES-654602 Abhinav Bhatele @ Charm++ Workshop

4x8x8-shaped pF3D job

27

April 11 April 16b

April 11 16

MILC job in green 27.8% higher messaging rate, LSMS is not communication-heavy

SLIDE 54

LLNL-PRES-654602 Abhinav Bhatele @ Charm++ Workshop

Slowest vs. fastest job

28

March 15 April 04

SLIDE 55

LLNL-PRES-654602 Abhinav Bhatele @ Charm++ Workshop

Slowest vs. fastest job

28

March 15 April 04

March 15 April 04 Three conflicting jobs, two MILC 2.29X higher messaging rate

SLIDE 56

LLNL-PRES-654602 Abhinav Bhatele @ Charm++ Workshop

Effect of MILC on pF3D

29

5 10 15 20 25 35-40 40-45 45-50 50-55 55-60 60-65 65-70 70-75 75-80 80-85 85-90 Number of runs Bin sizes (Total messaging rate) Comparing pF3D runs w/ and w/o MILC w/ MILC w/o MILC

SLIDE 57

LLNL-PRES-654602 Abhinav Bhatele @ Charm++ Workshop

Effect of MILC on pF3D

29

5 10 15 20 25 35-40 40-45 45-50 50-55 55-60 60-65 65-70 70-75 75-80 80-85 85-90 Number of runs Bin sizes (Total messaging rate) Comparing pF3D runs w/ and w/o MILC w/ MILC w/o MILC

avg = 58 MB/s σ = 9.12 MB/s

SLIDE 58

LLNL-PRES-654602 Abhinav Bhatele @ Charm++ Workshop

Effect of MILC on pF3D

29

5 10 15 20 25 35-40 40-45 45-50 50-55 55-60 60-65 65-70 70-75 75-80 80-85 85-90 Number of runs Bin sizes (Total messaging rate) Comparing pF3D runs w/ and w/o MILC w/ MILC w/o MILC

avg = 66 MB/s σ = 8.69 MB/s avg = 58 MB/s σ = 9.12 MB/s

SLIDE 59

LLNL-PRES-654602 Abhinav Bhatele @ Charm++ Workshop

Performance tip!

Variability insignificant on IBM Blue Gene systems
OS noise and allocation shape have a weak

correlation with performance

The placement of other jobs around a job can affect

its performance significantly

30

http://www.hpcwire.com/2013/11/16/sc13-research-highlight-goes-performance-neighborhood/

SLIDE 60

LLNL-PRES-654602 Abhinav Bhatele @ Charm++ Workshop

Modeling job placements and message routing

Dragonfly topology: a two-level hierarchical topology
Routing choices: static (deterministic) vs. dynamic

(adaptive), direct vs. indirect (random jumps)

Placement options: random, round-robin, blocked

31 All-to-all network in columns: Level 1

Network Ports Processor Ports

Level-1 network Level-2 network

A GROUP WITH 96 ROUTERS

Compute Nodes

A DRAGONFLY ROUTER

Chassis (All-to-all network in rows: Level 1) Level-2 all-to-all network (not all groups or links are shown)

THE DRAGONFLY TOPOLOGY

SLIDE 61

LLNL-PRES-654602 Abhinav Bhatele @ Charm++ Workshop

Single jobs

All-to-all over sub-

communicators

Various traffic metrics

32

1 10 1E2 1E3 1E4 1E5 RDN RDR RDC RDG RRN RRR RDN RDR RDC RDG RRN RRR RDN RDR RDC RDG RRN RRR RDN RDR RDC RDG RRN RRR RDN RDR RDC RDG RRN RRR

Link Usage (MB) Static Direct Adaptive Direct Static Indirect Adaptive Indirect Adaptive Hybrid Many to Many Pattern (All Links) Median Average Lowest maximum

1 10 1E2 P1 P2 Link Usage (MB) Job placements grouped based on Routing Example Plot minimum 1st quartile average median 3rd quartile maximum minimum and 1st quartile are same Lowest maximum

SLIDE 62

LLNL-PRES-654602 Abhinav Bhatele @ Charm++ Workshop

Parallel job workload

Representative of

NERSC workloads

Static routing out of the

question

Routings with indirect

jumps preferred

33

1 10 1E2 1E3 1E4 1E5

RDN RDR RDC RDG RRN RRR RDN RDR RDC RDG RRN RRR RDN RDR RDC RDG RRN RRR

Link Usage (MB) Adaptive Direct Adaptive Indirect Adaptive Hybrid (b) Workload 2 (All Links) Median Average Lowest maximum

1 10 1E2 1E3 1E4 1E5

RDN RDR RDC RDG RRN RRR RDN RDR RDC RDG RRN RRR RDN RDR RDC RDG RRN RRR

Link Usage (MB) Adaptive Direct Adaptive Indirect Adaptive Hybrid (d) Workload 4 (All Links) Median Average Lowest maximum

N. Jain et al. Maximizing network throughout on the dragonfly
interconnect. In submission to the ACM/IEEE International Conference for

High Performance Computing, Networking, Storage and Analysis, SC '14. 2014.

SLIDE 63

LLNL-PRES-654602 Abhinav Bhatele @ Charm++ Workshop

Summary

Optimizing communication is the #1 priority
Minimize off-node communication
Map remaining off-node communication carefully
Job placements and mapping are non-intrusive

methods for improving performance

Going forward: modeling and simulation will be

crucial for:

designing future networks
predicting application performance

34

SLIDE 64

This work was performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under Contract DE-AC52-07NA27344. Lawrence Livermore National Laboratory, P. O. Box 808, Livermore, CA 94551

Charm++ ¡Workshop ¡◆ ¡April ¡30, ¡2014

http://computation-‑rnd.llnl.gov/extreme-‑computing/ interconnection-‑networks.php

This ¡work ¡was ¡funded ¡by ¡the ¡Laboratory ¡Directed ¡Research ¡and ¡ Development ¡Program ¡at ¡LLNL ¡under ¡project ¡tracking ¡code ¡13-‑ERD-‑055: ¡ STATE ¡-‑ ¡Scalable ¡Topology ¡Aware ¡Task ¡Embedding.

LLNL: Abhinav Bhatele, Peer-Timo Bremer, Todd Gamblin, Katherine E. Isaacs, Steven H. Langer, Kathryn Mohror, Martin Schulz Illinois: Ronak Buch, Nikhil Jain, Harshitha Menon, Laxmikant

V. Kale,

Michael Robson Utah: Amey Desai, Aaditya G. Landge, Valerio Pascucci Purdue: Ahmed Abdel-Gawad, Mithuna Thottethodi LBL: Brian Austin, Nicholas J. Wright