Task mapping, job placements and routing strategies Abhinav - - PowerPoint PPT Presentation

task mapping job placements and routing strategies
SMART_READER_LITE
LIVE PREVIEW

Task mapping, job placements and routing strategies Abhinav - - PowerPoint PPT Presentation

Task mapping, job placements and routing strategies Abhinav Bhatele Center for Applied Scientific Computing Charm++ Workshop April 30, 2014 LLNL: Peer-Timo Bremer, Todd Gamblin,


slide-1
SLIDE 1

This work was performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under Contract DE-AC52-07NA27344. Lawrence Livermore National Laboratory, P. O. Box 808, Livermore, CA 94551

Charm++ ¡Workshop ¡◆ ¡April ¡30, ¡2014

Task ¡mapping, ¡job ¡placements ¡and ¡ routing ¡strategies

Abhinav ¡Bhatele Center ¡for ¡Applied ¡Scientific ¡Computing

LLNL: Peer-Timo Bremer, Todd Gamblin, Katherine E. Isaacs, Steven H. Langer, Kathryn Mohror, Martin Schulz Illinois: Ronak Buch, Nikhil Jain, Harshitha Menon, Laxmikant

  • V. Kale,

Michael Robson Utah: Amey Desai, Aaditya G. Landge, Valerio Pascucci Purdue: Ahmed Abdel-Gawad, Mithuna Thottethodi LBL: Brian Austin, Nicholas J. Wright

slide-2
SLIDE 2

LLNL-PRES-654602 Abhinav Bhatele @ Charm++ Workshop

Communication: the bottleneck at extreme scale

2

Time (ns) Energy spent (pJ) Floating point operation < 0.25 30-45 Time to access DRAM 50 128 Get data from another node > 1000 128-576

P . Kogge et al., Exascale computing study: Technology challenges in achieving exascale systems, Technical Report, 2008.

slide-3
SLIDE 3

LLNL-PRES-654602 Abhinav Bhatele @ Charm++ Workshop

Communication: the bottleneck at extreme scale

  • High costs for data movement in

terms of time and energy

2

Time (ns) Energy spent (pJ) Floating point operation < 0.25 30-45 Time to access DRAM 50 128 Get data from another node > 1000 128-576

P . Kogge et al., Exascale computing study: Technology challenges in achieving exascale systems, Technical Report, 2008.

slide-4
SLIDE 4

LLNL-PRES-654602 Abhinav Bhatele @ Charm++ Workshop

Communication: the bottleneck at extreme scale

  • High costs for data movement in

terms of time and energy

  • Newer platforms stressing

communication further (more cores, bigger networks)

2

Time (ns) Energy spent (pJ) Floating point operation < 0.25 30-45 Time to access DRAM 50 128 Get data from another node > 1000 128-576 IBM Cray Cray Blue Gene/L 0.375 XT3 8.77 Blue Gene/P 0.375 XT4 1.36 Blue Gene/Q 0.117 XT5 0.23

  • A. Bhatele et al., Automated mapping of regular communication graphs on

mesh interconnects, Intl. Conf. on High Performance Computing (HiPC), 2010.

Network bytes to flop ratios

P . Kogge et al., Exascale computing study: Technology challenges in achieving exascale systems, Technical Report, 2008.

slide-5
SLIDE 5

LLNL-PRES-654602 Abhinav Bhatele @ Charm++ Workshop

Communication: the bottleneck at extreme scale

  • High costs for data movement in

terms of time and energy

  • Newer platforms stressing

communication further (more cores, bigger networks)

  • Imperative to minimize data

movement and maximize locality

2

Time (ns) Energy spent (pJ) Floating point operation < 0.25 30-45 Time to access DRAM 50 128 Get data from another node > 1000 128-576 IBM Cray Cray Blue Gene/L 0.375 XT3 8.77 Blue Gene/P 0.375 XT4 1.36 Blue Gene/Q 0.117 XT5 0.23

  • A. Bhatele et al., Automated mapping of regular communication graphs on

mesh interconnects, Intl. Conf. on High Performance Computing (HiPC), 2010.

Network bytes to flop ratios

P . Kogge et al., Exascale computing study: Technology challenges in achieving exascale systems, Technical Report, 2008.

slide-6
SLIDE 6

LLNL-PRES-654602 Abhinav Bhatele @ Charm++ Workshop

TASK MAPPING

3

slide-7
SLIDE 7

LLNL-PRES-654602 Abhinav Bhatele @ Charm++ Workshop

Topology aware task mapping

  • What is mapping - layout/placement of tasks/processes in an application on

the physical interconnect

  • Does not require any changes to the application

4

slide-8
SLIDE 8

LLNL-PRES-654602 Abhinav Bhatele @ Charm++ Workshop

Topology aware task mapping

  • What is mapping - layout/placement of tasks/processes in an application on

the physical interconnect

  • Does not require any changes to the application

4

slide-9
SLIDE 9

LLNL-PRES-654602 Abhinav Bhatele @ Charm++ Workshop

Topology aware task mapping

  • What is mapping - layout/placement of tasks/processes in an application on

the physical interconnect

  • Does not require any changes to the application

4

slide-10
SLIDE 10

LLNL-PRES-654602 Abhinav Bhatele @ Charm++ Workshop

Topology aware task mapping

  • What is mapping - layout/placement of tasks/processes in an application on

the physical interconnect

  • Does not require any changes to the application

4

slide-11
SLIDE 11

LLNL-PRES-654602 Abhinav Bhatele @ Charm++ Workshop

Topology aware task mapping

  • What is mapping - layout/placement of tasks/processes in an application on

the physical interconnect

  • Does not require any changes to the application

4

  • Goals:
  • Balance computational load
  • Minimize contention (optimize latency or bandwidth)
slide-12
SLIDE 12

LLNL-PRES-654602 Abhinav Bhatele @ Charm++ Workshop

Maximize bandwidth?

  • Traditionally, research has focused on bringing tasks

closer to reduce the number of hops

  • Minimizes latency, but more importantly link contention
  • For applications that send large messages this might

not be optimal

5 1D

slide-13
SLIDE 13

LLNL-PRES-654602 Abhinav Bhatele @ Charm++ Workshop

Maximize bandwidth?

  • Traditionally, research has focused on bringing tasks

closer to reduce the number of hops

  • Minimizes latency, but more importantly link contention
  • For applications that send large messages this might

not be optimal

5 1D 2D

slide-14
SLIDE 14

LLNL-PRES-654602 Abhinav Bhatele @ Charm++ Workshop

Maximize bandwidth?

  • Traditionally, research has focused on bringing tasks

closer to reduce the number of hops

  • Minimizes latency, but more importantly link contention
  • For applications that send large messages this might

not be optimal

5 1D 2D 3D

slide-15
SLIDE 15

LLNL-PRES-654602 Abhinav Bhatele @ Charm++ Workshop

Maximize bandwidth?

  • Traditionally, research has focused on bringing tasks

closer to reduce the number of hops

  • Minimizes latency, but more importantly link contention
  • For applications that send large messages this might

not be optimal

5 1D 2D 3D 4D

slide-16
SLIDE 16

LLNL-PRES-654602 Abhinav Bhatele @ Charm++ Workshop

Rubik

  • We have developed a mapping tool focusing on:
  • structured applications that are bandwidth-bound, use collectives
  • ver sub-communicators
  • built-in operations that can increase effective bandwidth on torus

networks based on heuristics

  • Input:
  • Application topology with subsets identified
  • Processor topology
  • Set of operations to perform
  • Output: map file for job launcher

6

slide-17
SLIDE 17

LLNL-PRES-654602 Abhinav Bhatele @ Charm++ Workshop

Application example

7

app = box([9,3,8]) # Create app partition tree of 27-task planes app.tile([9,3,1]) network = box([6,6,6]) # Create network partition tree of 27-processor cubes network.tile([3,3,3]) network.map(app) # Map task planes into cubes

map()

=

app network network with mapped application ranks

216 216 27 27 27 27 27 27 27 27 27 27 27 27 27 27 27 27

slide-18
SLIDE 18

LLNL-PRES-654602 Abhinav Bhatele @ Charm++ Workshop

Mapping pF3D

  • A laser-plasma interaction code used at the

National Ignition Facility (NIF) at LLNL

  • Three communication phases over a 3D virtual

topology:

  • Wave propagation and coupling: 2D FFTs within XY planes
  • Light advection: Send-recv between consecutive XY planes
  • Hydrodynamic equations: 3D near-neighbor exchange

8

slide-19
SLIDE 19

LLNL-PRES-654602 Abhinav Bhatele @ Charm++ Workshop

Mapping pF3D

  • A laser-plasma interaction code used at the

National Ignition Facility (NIF) at LLNL

  • Three communication phases over a 3D virtual

topology:

  • Wave propagation and coupling: 2D FFTs within XY planes
  • Light advection: Send-recv between consecutive XY planes
  • Hydrodynamic equations: 3D near-neighbor exchange

8

slide-20
SLIDE 20

LLNL-PRES-654602 Abhinav Bhatele @ Charm++ Workshop

Mapping pF3D

  • A laser-plasma interaction code used at the

National Ignition Facility (NIF) at LLNL

  • Three communication phases over a 3D virtual

topology:

  • Wave propagation and coupling: 2D FFTs within XY planes
  • Light advection: Send-recv between consecutive XY planes
  • Hydrodynamic equations: 3D near-neighbor exchange

8

2048 cores 16384 cores MPI call Total % MPI % Total % MPI % Send 4.90 28.45 23.10 57.21 Alltoall 8.10 46.94 7.30 18.07 Barrier 2.78 16.10 8.13 20.15

slide-21
SLIDE 21

LLNL-PRES-654602 Abhinav Bhatele @ Charm++ Workshop

Performance benefits

9

5 10 15 20 TXYZ XYZT tile tiltX tiltXY Time (s) Mapping Comparison of different mappings on 2,048 cores Barrier All-to-all Send Receive

  • A. Bhatele et al. Mapping applications with collectives over sub-communicators on torus networks. In Proceedings of the ACM/IEEE

International Conference for High Performance Computing, Networking, Storage and Analysis, SC '12. IEEE Computer Society, November 2012.

slide-22
SLIDE 22

LLNL-PRES-654602 Abhinav Bhatele @ Charm++ Workshop

Performance benefits

9

5 10 15 20 TXYZ XYZT tile tiltX tiltXY Time (s) Mapping Comparison of different mappings on 2,048 cores Barrier All-to-all Send Receive

  • A. Bhatele et al. Mapping applications with collectives over sub-communicators on torus networks. In Proceedings of the ACM/IEEE

International Conference for High Performance Computing, Networking, Storage and Analysis, SC '12. IEEE Computer Society, November 2012.

200 400 600 800 1000 2048 4096 8192 16384 32768 65536 Time per iteration (s) Number of cores Execution time for different mappings of pF3D Default Map Best Map

60%

slide-23
SLIDE 23

LLNL-PRES-654602 Abhinav Bhatele @ Charm++ Workshop

Visualizing network traffic using Boxfish

10

X Y Y Z Z X 76M 2M

TXYZ XYZT tile tiltX tiltXY

slide-24
SLIDE 24

LLNL-PRES-654602 Abhinav Bhatele @ Charm++ Workshop 11

MODELING & SIMULATION

X[1] <= 0.4295 X[0] <= 0.0082 X[0] <= 0.2857 X[1] <= 0.0077 X[0] <= 0.0176 X[0] <= 0.1905 leaf leaf X[1] <= 0.0212 leaf Rest of the tree feature 1 feature 2

slide-25
SLIDE 25

LLNL-PRES-654602 Abhinav Bhatele @ Charm++ Workshop

Predicting execution time without executing the code

  • Goal: find which mapping gives the best performance
  • Offline metrics: maximum hops, average bytes,

maximum bytes

  • Use network hardware counters to propose new

metrics

  • Supervised learning algorithms to predict

performance

12

  • N. Jain et al. Predicting application performance using supervised learning on communication features. In Proceedings of the ACM/IEEE

International Conference for High Performance Computing, Networking, Storage and Analysis, SC '13. IEEE Computer Society, November 2013.

slide-26
SLIDE 26

LLNL-PRES-654602 Abhinav Bhatele @ Charm++ Workshop

Why don’t we run all the mappings?

13

slide-27
SLIDE 27

LLNL-PRES-654602 Abhinav Bhatele @ Charm++ Workshop

Why don’t we run all the mappings?

  • Wasted allocation

hours

13

2012 2013 Intrepid 4.16M 0.73M Mira 0.17M 7.67M Total 4.33M 8.40M

slide-28
SLIDE 28

LLNL-PRES-654602 Abhinav Bhatele @ Charm++ Workshop

Why don’t we run all the mappings?

  • Wasted allocation

hours

13

2012 2013 Intrepid 4.16M 0.73M Mira 0.17M 7.67M Total 4.33M 8.40M

13 million core hours!

slide-29
SLIDE 29

LLNL-PRES-654602 Abhinav Bhatele @ Charm++ Workshop

Why don’t we run all the mappings?

  • Wasted allocation

hours

  • Wasted time in the

queue

13

2012 2013 Intrepid 4.16M 0.73M Mira 0.17M 7.67M Total 4.33M 8.40M

13 million core hours!

slide-30
SLIDE 30

LLNL-PRES-654602 Abhinav Bhatele @ Charm++ Workshop

Why don’t we run all the mappings?

  • Wasted allocation

hours

  • Wasted time in the

queue

  • All we need is -

which is the best mapping?

13

2012 2013 Intrepid 4.16M 0.73M Mira 0.17M 7.67M Total 4.33M 8.40M

13 million core hours!

slide-31
SLIDE 31

LLNL-PRES-654602 Abhinav Bhatele @ Charm++ Workshop

Supervised learning: scikit-learn

  • Use simulation and other tools to
  • btain network counters and
  • ther contention parameters
  • Exploit supervised learning

algorithms for performance prediction

  • forests of randomized decision trees

14

X[1] <= 0.4295 X[0] <= 0.0082 X[0] <= 0.2857 X[1] <= 0.0077 X[0] <= 0.0176 X[0] <= 0.1905 leaf leaf X[1] <= 0.0212 leaf Rest of the tree feature 1 feature 2

http://scikit-learn.org

slide-32
SLIDE 32

LLNL-PRES-654602 Abhinav Bhatele @ Charm++ Workshop

Existing and new metrics

  • Existing metrics
  • maximum hops
  • average bytes
  • maximum bytes
  • New metrics:
  • Buffer length (on intermediate node)
  • FIFO length (packets in injection FIFOs)
  • Delay per link (packets in buffers / #received packets)

15

20 30 40 50 60 2 4 6 8 10 Time per iteration (ms) Maximum Dilation 20 30 40 50 60 8e8 1.2e9 1.6e9 2e9 Average bytes per link 20 30 40 50 60 1e09 3e9 6e9 9e9 Maximum bytes on a link

slide-33
SLIDE 33

LLNL-PRES-654602 Abhinav Bhatele @ Charm++ Workshop

Message life cycle on Blue Gene/Q

16

Injection FIFO Contention Link Contention Receive Buffer Contention Reception FIFO Contention Memory Contention Memory Contention

slide-34
SLIDE 34

LLNL-PRES-654602 Abhinav Bhatele @ Charm++ Workshop

Results

  • Three communication

kernels

  • Five-point 2D Stencil
  • 14-point 3D Stencil
  • All-to-all over sub-

communicators

17

slide-35
SLIDE 35

LLNL-PRES-654602 Abhinav Bhatele @ Charm++ Workshop

Results

  • Three communication

kernels

  • Five-point 2D Stencil
  • 14-point 3D Stencil
  • All-to-all over sub-

communicators

17

0.6 0.7 0.8 0.9 1.0 16K 4M 512 16K 4M 8 512 16K 4M R2 Absolute performance correlation max dilation avg bytes max bytes Sub A2A 3D Halo 2D Halo

slide-36
SLIDE 36

LLNL-PRES-654602 Abhinav Bhatele @ Charm++ Workshop

Results

  • Three communication

kernels

  • Five-point 2D Stencil
  • 14-point 3D Stencil
  • All-to-all over sub-

communicators

17

0.6 0.7 0.8 0.9 1.0 16K 4M 512 16K 4M 8 512 16K 4M R2 Absolute performance correlation max dilation avg bytes max bytes Sub A2A 3D Halo 2D Halo 0.6 0.7 0.8 0.9 1.0 16K 4M 512 16K 4M 8 512 16K 4M R2 Absolute performance correlation H1 H2 H3 H4 H5 H6 Sub A2A 3D Halo 2D Halo

slide-37
SLIDE 37

LLNL-PRES-654602 Abhinav Bhatele @ Charm++ Workshop 18

Performance prediction for communication kernels

0.01 0.1 1 10 5 10 15 20 25 30 Execution Time (s) Mappings sorted by actual execution times Blue Gene/Q (16,384 cores) Sub A2A Observed Sub A2A Predicted 3D Halo Observed 3D Halo Predicted 2D Halo Observed 2D Halo Predicted

slide-38
SLIDE 38

LLNL-PRES-654602 Abhinav Bhatele @ Charm++ Workshop

  • Better correlation than with

existing metrics such as average or maximum bytes

18

Performance prediction for communication kernels

0.01 0.1 1 10 5 10 15 20 25 30 Execution Time (s) Mappings sorted by actual execution times Blue Gene/Q (16,384 cores) Sub A2A Observed Sub A2A Predicted 3D Halo Observed 3D Halo Predicted 2D Halo Observed 2D Halo Predicted

slide-39
SLIDE 39

LLNL-PRES-654602 Abhinav Bhatele @ Charm++ Workshop

  • Better correlation than with

existing metrics such as average or maximum bytes

  • Hybrid metric:
  • average bytes + maximum bytes +

average buffer length + maximum FIFO length

18

Performance prediction for communication kernels

0.01 0.1 1 10 5 10 15 20 25 30 Execution Time (s) Mappings sorted by actual execution times Blue Gene/Q (16,384 cores) Sub A2A Observed Sub A2A Predicted 3D Halo Observed 3D Halo Predicted 2D Halo Observed 2D Halo Predicted

slide-40
SLIDE 40

LLNL-PRES-654602 Abhinav Bhatele @ Charm++ Workshop

  • Better correlation than with

existing metrics such as average or maximum bytes

  • Hybrid metric:
  • average bytes + maximum bytes +

average buffer length + maximum FIFO length

  • Crazy things:
  • combine all training sets
  • use 16k training set to predict 64k

performance

18

Performance prediction for communication kernels

0.01 0.1 1 10 5 10 15 20 25 30 Execution Time (s) Mappings sorted by actual execution times Blue Gene/Q (16,384 cores) Sub A2A Observed Sub A2A Predicted 3D Halo Observed 3D Halo Predicted 2D Halo Observed 2D Halo Predicted

slide-41
SLIDE 41

LLNL-PRES-654602 Abhinav Bhatele @ Charm++ Workshop

22 23 24 25 26 27 28 29 30 31 5 10 15 20 25 30 Execution Time (s) Mappings sorted by actual execution times Blue Gene/Q (16,384 cores) pF3D Observed pF3D Predicted

Predicting the performance of pF3D

  • Production application
  • has computation
  • and multiple phases of

communication

  • Hybrid metric:
  • average bytes + average buffer

length + average delay + sum

  • f hops + maximum FIFO

length

19

slide-42
SLIDE 42

LLNL-PRES-654602 Abhinav Bhatele @ Charm++ Workshop 20

JOB PLACEMENT & ROUTING

slide-43
SLIDE 43

LLNL-PRES-654602 Abhinav Bhatele @ Charm++ Workshop

Performance variability

21

Average messaging rates for batch jobs running a laser-plasma interaction code

slide-44
SLIDE 44

LLNL-PRES-654602 Abhinav Bhatele @ Charm++ Workshop

Performance variability

21

Average messaging rates for batch jobs running a laser-plasma interaction code Total number of bytes sent on the network Time spent sending the messages

slide-45
SLIDE 45

LLNL-PRES-654602 Abhinav Bhatele @ Charm++ Workshop

Leads to several problems ...

  • Individual jobs run slower:
  • More time to complete science simulations
  • Increased wait time in job queues
  • Inefficient use of machine time allocation/core-hours
  • Overall lower throughput
  • Increase energy usage/costs

22

slide-46
SLIDE 46

LLNL-PRES-654602 Abhinav Bhatele @ Charm++ Workshop

Also affects software development

  • Debugging performance issues
  • Quantifying the effect of various software changes on

performance

  • code changes
  • compiler/software stack changes
  • Requesting time for a batch job
  • Writing allocation proposals

23

slide-47
SLIDE 47

LLNL-PRES-654602 Abhinav Bhatele @ Charm++ Workshop

pF3D characterization

24

10 20 30 40 50 60 70 Time (s) Time spent in communication and computation in pF3D Communication 50 100 150 200 250 Hopper Intrepid Mira Time (s) Computation

slide-48
SLIDE 48

LLNL-PRES-654602 Abhinav Bhatele @ Charm++ Workshop

pF3D characterization

24

5 10 15 20 25 30 35 40 45 Hopper Intrepid Mira Time (s) Time spent in MPI calls on 512 nodes Alltoall Barrier Send Recv Probe

slide-49
SLIDE 49

LLNL-PRES-654602 Abhinav Bhatele @ Charm++ Workshop

Sources of variability

  • Operating system noise (OS jitter)
  • OS daemons running on some cores of each node
  • Placement/location of the allocated nodes for the job

(Allocation shape)

  • Contention for shared resources (Inter-job

contention)

  • Sharing network links with other jobs

25

slide-50
SLIDE 50

LLNL-PRES-654602 Abhinav Bhatele @ Charm++ Workshop

4x8x8-shaped pF3D job

26

April 11 16

April 11 April 16

https://scalability.llnl.gov/performance-analysis-through-visualization/software.php

slide-51
SLIDE 51

LLNL-PRES-654602 Abhinav Bhatele @ Charm++ Workshop

4x8x8-shaped pF3D job

26

April 11 16

April 11 April 16 MILC job in green 25% higher messaging rate

https://scalability.llnl.gov/performance-analysis-through-visualization/software.php

slide-52
SLIDE 52

LLNL-PRES-654602 Abhinav Bhatele @ Charm++ Workshop

4x8x8-shaped pF3D job

27

April 11 April 16b

April 11 16

slide-53
SLIDE 53

LLNL-PRES-654602 Abhinav Bhatele @ Charm++ Workshop

4x8x8-shaped pF3D job

27

April 11 April 16b

April 11 16

MILC job in green 27.8% higher messaging rate, LSMS is not communication-heavy

slide-54
SLIDE 54

LLNL-PRES-654602 Abhinav Bhatele @ Charm++ Workshop

Slowest vs. fastest job

28

March 15 April 04

March 15 April 04

slide-55
SLIDE 55

LLNL-PRES-654602 Abhinav Bhatele @ Charm++ Workshop

Slowest vs. fastest job

28

March 15 April 04

March 15 April 04 Three conflicting jobs, two MILC 2.29X higher messaging rate

slide-56
SLIDE 56

LLNL-PRES-654602 Abhinav Bhatele @ Charm++ Workshop

Effect of MILC on pF3D

29

5 10 15 20 25 35-40 40-45 45-50 50-55 55-60 60-65 65-70 70-75 75-80 80-85 85-90 Number of runs Bin sizes (Total messaging rate) Comparing pF3D runs w/ and w/o MILC w/ MILC w/o MILC

slide-57
SLIDE 57

LLNL-PRES-654602 Abhinav Bhatele @ Charm++ Workshop

Effect of MILC on pF3D

29

5 10 15 20 25 35-40 40-45 45-50 50-55 55-60 60-65 65-70 70-75 75-80 80-85 85-90 Number of runs Bin sizes (Total messaging rate) Comparing pF3D runs w/ and w/o MILC w/ MILC w/o MILC

avg = 58 MB/s σ = 9.12 MB/s

slide-58
SLIDE 58

LLNL-PRES-654602 Abhinav Bhatele @ Charm++ Workshop

Effect of MILC on pF3D

29

5 10 15 20 25 35-40 40-45 45-50 50-55 55-60 60-65 65-70 70-75 75-80 80-85 85-90 Number of runs Bin sizes (Total messaging rate) Comparing pF3D runs w/ and w/o MILC w/ MILC w/o MILC

avg = 66 MB/s σ = 8.69 MB/s avg = 58 MB/s σ = 9.12 MB/s

slide-59
SLIDE 59

LLNL-PRES-654602 Abhinav Bhatele @ Charm++ Workshop

Performance tip!

  • Variability insignificant on IBM Blue Gene systems
  • OS noise and allocation shape have a weak

correlation with performance

  • The placement of other jobs around a job can affect

its performance significantly

30

http://www.hpcwire.com/2013/11/16/sc13-research-highlight-goes-performance-neighborhood/

slide-60
SLIDE 60

LLNL-PRES-654602 Abhinav Bhatele @ Charm++ Workshop

Modeling job placements and message routing

  • Dragonfly topology: a two-level hierarchical topology
  • Routing choices: static (deterministic) vs. dynamic

(adaptive), direct vs. indirect (random jumps)

  • Placement options: random, round-robin, blocked

31 All-to-all network in columns: Level 1

Network Ports Processor Ports

Level-1 network Level-2 network

A GROUP WITH 96 ROUTERS

Compute Nodes

A DRAGONFLY ROUTER

Chassis (All-to-all network in rows: Level 1) Level-2 all-to-all network (not all groups or links are shown)

THE DRAGONFLY TOPOLOGY

slide-61
SLIDE 61

LLNL-PRES-654602 Abhinav Bhatele @ Charm++ Workshop

Single jobs

  • All-to-all over sub-

communicators

  • Various traffic metrics

32

1 10 1E2 1E3 1E4 1E5 RDN RDR RDC RDG RRN RRR RDN RDR RDC RDG RRN RRR RDN RDR RDC RDG RRN RRR RDN RDR RDC RDG RRN RRR RDN RDR RDC RDG RRN RRR

Link Usage (MB) Static Direct Adaptive Direct Static Indirect Adaptive Indirect Adaptive Hybrid Many to Many Pattern (All Links) Median Average Lowest maximum

1 10 1E2 P1 P2 Link Usage (MB) Job placements grouped based on Routing Example Plot minimum 1st quartile average median 3rd quartile maximum minimum and 1st quartile are same Lowest maximum

slide-62
SLIDE 62

LLNL-PRES-654602 Abhinav Bhatele @ Charm++ Workshop

Parallel job workload

  • Representative of

NERSC workloads

  • Static routing out of the

question

  • Routings with indirect

jumps preferred

33

1 10 1E2 1E3 1E4 1E5

RDN RDR RDC RDG RRN RRR RDN RDR RDC RDG RRN RRR RDN RDR RDC RDG RRN RRR

Link Usage (MB) Adaptive Direct Adaptive Indirect Adaptive Hybrid (b) Workload 2 (All Links) Median Average Lowest maximum

1 10 1E2 1E3 1E4 1E5

RDN RDR RDC RDG RRN RRR RDN RDR RDC RDG RRN RRR RDN RDR RDC RDG RRN RRR

Link Usage (MB) Adaptive Direct Adaptive Indirect Adaptive Hybrid (d) Workload 4 (All Links) Median Average Lowest maximum

  • N. Jain et al. Maximizing network throughout on the dragonfly
  • interconnect. In submission to the ACM/IEEE International Conference for

High Performance Computing, Networking, Storage and Analysis, SC '14. 2014.

slide-63
SLIDE 63

LLNL-PRES-654602 Abhinav Bhatele @ Charm++ Workshop

Summary

  • Optimizing communication is the #1 priority
  • Minimize off-node communication
  • Map remaining off-node communication carefully
  • Job placements and mapping are non-intrusive

methods for improving performance

  • Going forward: modeling and simulation will be

crucial for:

  • designing future networks
  • predicting application performance

34

slide-64
SLIDE 64

This work was performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under Contract DE-AC52-07NA27344. Lawrence Livermore National Laboratory, P. O. Box 808, Livermore, CA 94551

Charm++ ¡Workshop ¡◆ ¡April ¡30, ¡2014

http://computation-­‑rnd.llnl.gov/extreme-­‑computing/ interconnection-­‑networks.php

This ¡work ¡was ¡funded ¡by ¡the ¡Laboratory ¡Directed ¡Research ¡and ¡ Development ¡Program ¡at ¡LLNL ¡under ¡project ¡tracking ¡code ¡13-­‑ERD-­‑055: ¡ STATE ¡-­‑ ¡Scalable ¡Topology ¡Aware ¡Task ¡Embedding.

LLNL: Abhinav Bhatele, Peer-Timo Bremer, Todd Gamblin, Katherine E. Isaacs, Steven H. Langer, Kathryn Mohror, Martin Schulz Illinois: Ronak Buch, Nikhil Jain, Harshitha Menon, Laxmikant

  • V. Kale,

Michael Robson Utah: Amey Desai, Aaditya G. Landge, Valerio Pascucci Purdue: Ahmed Abdel-Gawad, Mithuna Thottethodi LBL: Brian Austin, Nicholas J. Wright