A practical Approach to the Rating of Barrier Algorithms using the - - PowerPoint PPT Presentation

a practical approach to the rating of barrier algorithms
SMART_READER_LITE
LIVE PREVIEW

A practical Approach to the Rating of Barrier Algorithms using the - - PowerPoint PPT Presentation

LogP Predictions Implementation Motivation Conclusions A practical Approach to the Rating of Barrier Algorithms using the LogP Model and Open MPI Torsten Hfler, Wolfgang Rehm TU Chemnitz, Germany 24.05.2005 Torsten Hfler, Wolfgang Rehm


slide-1
SLIDE 1

LogP Predictions Implementation Conclusions Motivation

A practical Approach to the Rating of Barrier Algorithms using the LogP Model and Open MPI

Torsten Höfler, Wolfgang Rehm TU Chemnitz, Germany 24.05.2005

Torsten Höfler, Wolfgang Rehm TU Chemnitz, Germany Barrier Rating

slide-2
SLIDE 2

LogP Predictions Implementation Conclusions Motivation

Outline

Motivation

1

LogP Predictions

2

Implementation

3

Conclusions

Torsten Höfler, Wolfgang Rehm TU Chemnitz, Germany Barrier Rating

slide-3
SLIDE 3

LogP Predictions Implementation Conclusions Motivation

Outline

Motivation

1

LogP Predictions

2

Implementation

3

Conclusions

Torsten Höfler, Wolfgang Rehm TU Chemnitz, Germany Barrier Rating

slide-4
SLIDE 4

LogP Predictions Implementation Conclusions Motivation

Motivation

  • ptimal solution for the barrier problem

barrier time complexity studies exhaustive comparison of different algorithms framework for general comparison studies Open MPI is easily extensible Question: is LogP accurate enough?

Torsten Höfler, Wolfgang Rehm TU Chemnitz, Germany Barrier Rating

slide-5
SLIDE 5

LogP Predictions Implementation Conclusions Motivation

Problems

unlimited number of architectures

generic optimal solution = holy grail?

definition of several constraints for a given architecure

Fast Ethernet, Extreme Black Diamond Switch, 512 nodes

new architectures have to be added by hand several models available -> LogP should be accurate enough

Torsten Höfler, Wolfgang Rehm TU Chemnitz, Germany Barrier Rating

slide-6
SLIDE 6

LogP Predictions Implementation Conclusions Motivation

Principles

  • ne architecture as example

easy testing of new architectures framework to implement and test new algorithms

Torsten Höfler, Wolfgang Rehm TU Chemnitz, Germany Barrier Rating

slide-7
SLIDE 7

LogP Predictions Implementation Conclusions Motivation

Architectural Assumptions

full bisectional bandwidth full duplex operation unlimited switch forwarding rate constant latency

  • verhead bigger than gap
  • verhead is constant (os = or)

Torsten Höfler, Wolfgang Rehm TU Chemnitz, Germany Barrier Rating

slide-8
SLIDE 8

LogP Predictions Implementation Conclusions

Base Equations

several basic equations and variables : fr = max{or, g} fs = max{os, g} tr = max{fr, os + L + or} = max{max{g, or}, os + L + or} = max{g, os + L + or} simplifying assumptions : fr = fs = o tr = ts = 2o + L

Torsten Höfler, Wolfgang Rehm TU Chemnitz, Germany Barrier Rating

slide-9
SLIDE 9

LogP Predictions Implementation Conclusions

Model Predictions

algorithms are divided into different complexity classes

O(P) ⇒ Central Counter O(n · lognP) ⇒ Combinig Tree, f-way Tournament, MCS O(log2P) + Bcast ⇒ Tournament, BST O(log2P) ⇒ Butterfly, Pairwise Exchange, Dissemination

O(log2P) within the LogP is an optimal solution prove is trivial Assumption: Dissemination Barrier should perform best

Torsten Höfler, Wolfgang Rehm TU Chemnitz, Germany Barrier Rating

slide-10
SLIDE 10

LogP Predictions Implementation Conclusions

Example - Dissemination Barrier

Step 1 [stage 0]:

4 3 2 1 5 4 3 2 1 5

Step 2 [stage 1]:

4 3 2 1 5

Step 3 [stage 2]: a b a b a b c d a b c d

Torsten Höfler, Wolfgang Rehm TU Chemnitz, Germany Barrier Rating

slide-11
SLIDE 11

LogP Predictions Implementation Conclusions

Example - Dissemination Barrier

P0 P1 P2 P3 P4 P5

  • s
  • s
  • s
  • s
  • s
  • s
  • r
  • r
  • r
  • r
  • r
  • r
  • s
  • s
  • s
  • s
  • s
  • s
  • r
  • r
  • r
  • r
  • r
  • r
  • s
  • s
  • s
  • s
  • s
  • s
  • r
  • r
  • r
  • r
  • r
  • r

Torsten Höfler, Wolfgang Rehm TU Chemnitz, Germany Barrier Rating

slide-12
SLIDE 12

LogP Predictions Implementation Conclusions

Example - Dissemination Barrier

P0 P1 P2 P3 P4 P5

  • s
  • s
  • s
  • s
  • s
  • s
  • r
  • r
  • r
  • r
  • r
  • r
  • s
  • s
  • s
  • s
  • s
  • s
  • r
  • r
  • r
  • r
  • r
  • r
  • s
  • s
  • s
  • s
  • s
  • s
  • r
  • r
  • r
  • r
  • r
  • r

rt = max{tr, ts} · ⌈log2P⌉ (tr = max{g, os + L + or})

Torsten Höfler, Wolfgang Rehm TU Chemnitz, Germany Barrier Rating

slide-13
SLIDE 13

LogP Predictions Implementation Conclusions

Example - Dissemination Barrier

P0 P1 P2 P3 P4 P5

  • s
  • s
  • s
  • s
  • s
  • s
  • r
  • r
  • r
  • r
  • r
  • r
  • s
  • s
  • s
  • s
  • s
  • s
  • r
  • r
  • r
  • r
  • r
  • r
  • s
  • s
  • s
  • s
  • s
  • s
  • r
  • r
  • r
  • r
  • r
  • r

assume :

  • > g

rt = (2o + L) · ⌈log2P⌉

Torsten Höfler, Wolfgang Rehm TU Chemnitz, Germany Barrier Rating

slide-14
SLIDE 14

LogP Predictions Implementation Conclusions

Benchmark Results

200 400 600 800 1000 1200 1400 10 20 30 40 50 60 70 runtime in microseconds (rt) # processors (P) Dissemination rt(P)

Dissemination Barrier

Torsten Höfler, Wolfgang Rehm TU Chemnitz, Germany Barrier Rating

slide-15
SLIDE 15

LogP Predictions Implementation Conclusions

Benchmark Results

200 400 600 800 1000 1200 1400 1600 1800 2000 10 20 30 40 50 60 70 runtime in microseconds (rt) # processors (P) Tournament Barrier rt(P)

Tournament Barrier

Torsten Höfler, Wolfgang Rehm TU Chemnitz, Germany Barrier Rating

slide-16
SLIDE 16

LogP Predictions Implementation Conclusions

Benchmark Results

500 1000 1500 2000 2500 3000 3500 4000 4500 10 20 30 40 50 60 70 runtime in microseconds (rt) # processors (P) Central Counter Combining Tree (n=4) Tournament Barrier Dissemination Open MPI

Algorithm Comparison

Torsten Höfler, Wolfgang Rehm TU Chemnitz, Germany Barrier Rating

slide-17
SLIDE 17

LogP Predictions Implementation Conclusions

Benchmark Results

Algorithm 128 nodes 256 nodes Central Counter 4594.50µs 4909.67µs Combining Tree 4009.79µs 4343.63µs Tournament 3642.54µs 4378.77µs Dissemination 1904.57µs 1977.12µs Open MPI 3559.88µs 4226.88µs

Torsten Höfler, Wolfgang Rehm TU Chemnitz, Germany Barrier Rating

slide-18
SLIDE 18

LogP Predictions Implementation Conclusions

Open MPI

also useable for production environments ⇒ Open MPI as MPI framework

Torsten Höfler, Wolfgang Rehm TU Chemnitz, Germany Barrier Rating

slide-19
SLIDE 19

LogP Predictions Implementation Conclusions

Open MPI

also useable for production environments ⇒ Open MPI as MPI framework

User Application MPI API Run Time Environment (RTE) Modular Component Architecture (MCA) PtP Mgmt. Layer (PML) COLL TOPO PTL PTL PTL Torsten Höfler, Wolfgang Rehm TU Chemnitz, Germany Barrier Rating

slide-20
SLIDE 20

LogP Predictions Implementation Conclusions

Component Implementation

initialization returns user-defined priority algorithm selection:

0: automatic benchmark 1: Central Counter 2: Combining Tree 3: Tournament 4: Dissemination 5: Binomial Tree 6: n-way Dissemination

Checkpoint/Restart is handled by lower layers

Torsten Höfler, Wolfgang Rehm TU Chemnitz, Germany Barrier Rating

slide-21
SLIDE 21

LogP Predictions Implementation Conclusions

Conclusions

taken assumptions are valid LogP model is accurate Dissemination is optimal for given scenario different networks exhibit different behavior derivation of new algorithms for different hardware (e.g.

  • ffloading based HW) could require detailed models

⇒ general methodology for developing optimal barrier algorithms has been shown

Torsten Höfler, Wolfgang Rehm TU Chemnitz, Germany Barrier Rating

slide-22
SLIDE 22

LogP Predictions Implementation Conclusions

Future Work

new model for small messages for offloading based NICs (LoP) new barrier algorithms to support hardware parallelism simplification of the LoP model (non linear, >6 parameters)

Torsten Höfler, Wolfgang Rehm TU Chemnitz, Germany Barrier Rating