Awareness of MPI Virtual Process Topologies on the Single-Chip Cloud - - PowerPoint PPT Presentation

awareness of mpi virtual process topologies on the single
SMART_READER_LITE
LIVE PREVIEW

Awareness of MPI Virtual Process Topologies on the Single-Chip Cloud - - PowerPoint PPT Presentation

Awareness of MPI Virtual Process Topologies on the Single-Chip Cloud Computer Steffen Christgau, Bettina Schnor Potsdam University Institute of Computer Science Operating Systems and Distributed Systems HIPS@IPDPS, 21. May 2012 Outline The


slide-1
SLIDE 1

Awareness of MPI Virtual Process Topologies on the Single-Chip Cloud Computer

Steffen Christgau, Bettina Schnor

Potsdam University Institute of Computer Science Operating Systems and Distributed Systems

HIPS@IPDPS, 21. May 2012

slide-2
SLIDE 2

Outline

The Single-Chip Cloud Computer (SCC) RCKMPI – MPI on the SCC Topology-Awareness for RCKMPI – The concept Topology-Awareness for RCKMPI – Evaluation Summary and Future Work

Bettina Schnor (Potsdam University) MPI Topology Awareness on the SCC Frame 2 of 19

slide-3
SLIDE 3

Outline

The Single-Chip Cloud Computer (SCC) RCKMPI – MPI on the SCC Topology-Awareness for RCKMPI – The concept Topology-Awareness for RCKMPI – Evaluation Summary and Future Work

Bettina Schnor (Potsdam University) MPI Topology Awareness on the SCC Frame 2 of 19

slide-4
SLIDE 4

Outline

The Single-Chip Cloud Computer (SCC) RCKMPI – MPI on the SCC Topology-Awareness for RCKMPI – The concept Topology-Awareness for RCKMPI – Evaluation Summary and Future Work

Bettina Schnor (Potsdam University) MPI Topology Awareness on the SCC Frame 2 of 19

slide-5
SLIDE 5

Outline

The Single-Chip Cloud Computer (SCC) RCKMPI – MPI on the SCC Topology-Awareness for RCKMPI – The concept Topology-Awareness for RCKMPI – Evaluation Summary and Future Work

Bettina Schnor (Potsdam University) MPI Topology Awareness on the SCC Frame 2 of 19

slide-6
SLIDE 6

Outline

The Single-Chip Cloud Computer (SCC) RCKMPI – MPI on the SCC Topology-Awareness for RCKMPI – The concept Topology-Awareness for RCKMPI – Evaluation Summary and Future Work

Bettina Schnor (Potsdam University) MPI Topology Awareness on the SCC Frame 2 of 19

slide-7
SLIDE 7
  • 1. The Single-Chip Cloud Computer (SCC)

Many-Core Architecture Research Community (MARC) established by Intel research of universities together with Intel world-wide community (dominated by European institutions) Website, regular symposia (every 6 months, 5 up to now) Our group is MARC member Focus: application scalability Experiences with parallel ASP , climate simulation

Bettina Schnor (Potsdam University) MPI Topology Awareness on the SCC Frame 3 of 19

slide-8
SLIDE 8

Single-Chip Cloud Computer

MC 0 MC 1 MC 3 MC 2 VRC SIF MPB Core0 Core1

L2 Cache L2 Cache 256 KB 256 KB 16 KB Router

24 tiles, 48 P54C cores, connected via Network-on-Chip, no Cache-Coherence fast 16 KB tile SRAM on each tile Message Passing Buffer (MPB)

Bettina Schnor (Potsdam University) MPI Topology Awareness on the SCC Frame 4 of 19

slide-9
SLIDE 9
  • 2. RCKMPI: MPI on the SCC

fork of MPICH2 Application MPICH2 ROMIO

(MPI IO Implementation)

Message Passing Interface ADIO

Abstract Device Interface (ADI3) Cray BG Channel-3 Device Sock Nemesis SCCMPB SCCSHM SCCMulti

Process Management Interface

MPI Implementation MPI

Bettina Schnor (Potsdam University) MPI Topology Awareness on the SCC Frame 5 of 19

slide-10
SLIDE 10
  • 2. RCKMPI: MPI on the SCC

fork of MPICH2 Application MPICH2 ROMIO

(MPI IO Implementation)

Message Passing Interface ADIO

Abstract Device Interface (ADI3) Cray BG Channel-3 Device Sock Nemesis SCCMPB SCCSHM SCCMulti

Process Management Interface

MPI Implementation MPI

Bettina Schnor (Potsdam University) MPI Topology Awareness on the SCC Frame 5 of 19

slide-11
SLIDE 11

SCCMPB uses the fast Message Passing Buffer of each tile as shared memory and divides it into n equal-size Exclusive Write Sections (EWS)

= ⇒ remote write, local read

rank

Core 0 Core n

MPI Rank 0 MPI Rank n MPI_COMM_WORLD rank 1 rank n-1 rank n

MPB of core/rank 0

Bettina Schnor (Potsdam University) MPI Topology Awareness on the SCC Frame 6 of 19

slide-12
SLIDE 12

SCCMPB uses the fast Message Passing Buffer of each tile as shared memory and divides it into n equal-size Exclusive Write Sections (EWS)

= ⇒ remote write, local read

rank

Core 0 Core n

MPI Rank 0 MPI Rank n MPI_COMM_WORLD rank 1 rank n-1 rank n

MPB of core/rank 0

Bettina Schnor (Potsdam University) MPI Topology Awareness on the SCC Frame 6 of 19

slide-13
SLIDE 13

Comparison of different CH3-devices at maximum Manhattan distance 8

0.001 0.01 0.1 1 10 100 4 16 64 256 1 Ki 4 Ki 16 Ki 64 Ki 256 Ki 1 Mi 4 Mi bandwidth / MByte/s message size / Byte RCKMPI sccmulti CH device RCKMPI sccmpb CH device RCKMPI sccshm CH device Bettina Schnor (Potsdam University) MPI Topology Awareness on the SCC Frame 7 of 19

slide-14
SLIDE 14

Bandwidths for Manhattan distance 0, 5 and 8 (two processes started).

10 20 30 40 50 60 70 80 90 4 16 64 256 1 Ki 4 Ki 16 Ki 64 Ki 256 Ki 1 Mi 4 Mi bandwidth / MByte/s message size / Byte Core 00 and 01 Core 00 and 10 Core 00 and 47 Bettina Schnor (Potsdam University) MPI Topology Awareness on the SCC Frame 8 of 19

slide-15
SLIDE 15

Bandwidths for maximum Manhattan distance 8, and varied number of MPI processes

10 20 30 40 50 60 70 4 16 64 256 1 Ki 4 Ki 16 Ki 64 Ki 256 Ki 1 Mi 4 Mi bandwidth / MByte/s message size / Byte 2 MPI processes 12 MPI processes 24 MPI processes 48 MPI processes Bettina Schnor (Potsdam University) MPI Topology Awareness on the SCC Frame 9 of 19

slide-16
SLIDE 16

Remember: SCCMPB uses the fast Message Passing Buffer of each tile as shared memory (total 384 KB)

rank

Core 0 Core n

MPI Rank 0 MPI Rank n MPI_COMM_WORLD rank 1 rank n-1 rank n

MPB of core/rank 0 The MPB is equally devided in n sections =

⇒ depending on the

number of started MPI processes.

Bettina Schnor (Potsdam University) MPI Topology Awareness on the SCC Frame 10 of 19

slide-17
SLIDE 17
  • 3. Topology Awareness for RCKMPI: The Concept

The bandwidth between 2 RCKMPI processes depends on: the number of started MPI processes since a fully-connected network between all MPI processes is managed.

Bettina Schnor (Potsdam University) MPI Topology Awareness on the SCC Frame 11 of 19

slide-18
SLIDE 18

Application Behaviour

process 1 process 2 process n

Tv Tu uu uv vv qu qv

Goal: The bandwidth between communicating processes, so-called neighbors in the Task Interaction Graph should be increased.

Bettina Schnor (Potsdam University) MPI Topology Awareness on the SCC Frame 12 of 19

slide-19
SLIDE 19

Application Behaviour

process 1 process 2 process n

Tv Tu uu uv vv qu qv

Goal: The bandwidth between communicating processes, so-called neighbors in the Task Interaction Graph should be increased.

Bettina Schnor (Potsdam University) MPI Topology Awareness on the SCC Frame 12 of 19

slide-20
SLIDE 20

Requirements:

1 An improved MPB layout must consider both: communication neighbors

and group communication (barriers, broadcasts, gather/scatter, ...)

2 Each MPI process has to know its new offset within all remote MPBs.

Bettina Schnor (Potsdam University) MPI Topology Awareness on the SCC Frame 13 of 19

slide-21
SLIDE 21

Putting things together ...

  • proc. n-1
  • proc. n
  • proc. 1
  • proc. n-1
  • proc. n
  • proc. 2
  • proc. 1
  • proc. 2

p1 p2 pn channel headers payload area write section channel header message payload

  • riginal layout

new layout without topology information

p1

  • proc. 2

p1 p2 pn

with topology information

pn

Internal barrier for recalculation phase of new MPB addresses.

Bettina Schnor (Potsdam University) MPI Topology Awareness on the SCC Frame 14 of 19

slide-22
SLIDE 22

MPI offers API to specify virtual process topology 1 ★❞❡❢✐♥❡ ◆❯▼❴❉■▼❙ ✷ 2 3

✐♥t ❣r✐❞❴❞✐♠s ❬◆❯▼❴❉■▼❙❪ ❀

4

✐♥t ❣r✐❞❴♣❡r✐♦❞s ❬◆❯▼❴❉■▼❙❪ ❀

5

▼P■❴❈♦♠♠ ❝♦♠♠❴t♦♣♦ ❀

6 7

✴✯ ❢♦ r ❛ ❣r✐❞ ✱ s❡t ❛ ❧ ❧ ✐t❡♠s ♦❢ ❣r✐❞❴♣❡r✐♦❞s t♦ ✵ ✯✴

8

❢ ♦r ✭ ✐♥t ✐ ❂ ✵❀ ✐ ❁ ◆❯▼❴❉■▼❙❀ ✐✰✰✮

9

❣r✐❞❴♣❡r✐♦❞s ❬ ✐ ❪ ❂ ✵❀

10 11

▼P■❴❉✐♠s❴❝r❡❛t❡✭♥✉♠Pr♦❝s ✱ ◆❯▼❴❉■▼❙✱ ❣r✐❞❴❞✐♠s ✮ ❀

12

▼P■❴❈❛rt❴❝r❡❛t❡ ✭▼P■❴❈❖▼▼❴❲❖❘▲❉✱ ◆❯▼❴❉■▼❙✱ ❣r✐❞❴❞✐♠s ✱

13

❣r✐❞❴♣❡r✐♦❞s ✱ tr✉❡ ✱ ✫❝♦♠♠❴t♦♣♦ ✮ ❀

Bettina Schnor (Potsdam University) MPI Topology Awareness on the SCC Frame 15 of 19

slide-23
SLIDE 23

MPI offers API to specify virtual process topology 1 ★❞❡❢✐♥❡ ◆❯▼❴❉■▼❙ ✷ 2 3

✐♥t ❣r✐❞❴❞✐♠s ❬◆❯▼❴❉■▼❙❪ ❀

4

✐♥t ❣r✐❞❴♣❡r✐♦❞s ❬◆❯▼❴❉■▼❙❪ ❀

5

▼P■❴❈♦♠♠ ❝♦♠♠❴t♦♣♦ ❀

6 7

✴✯ ❢♦ r ❛ ❣r✐❞ ✱ s❡t ❛ ❧ ❧ ✐t❡♠s ♦❢ ❣r✐❞❴♣❡r✐♦❞s t♦ ✵ ✯✴

8

❢ ♦r ✭ ✐♥t ✐ ❂ ✵❀ ✐ ❁ ◆❯▼❴❉■▼❙❀ ✐✰✰✮

9

❣r✐❞❴♣❡r✐♦❞s ❬ ✐ ❪ ❂ ✵❀

10 11

▼P■❴❉✐♠s❴❝r❡❛t❡✭♥✉♠Pr♦❝s ✱ ◆❯▼❴❉■▼❙✱ ❣r✐❞❴❞✐♠s ✮ ❀

12

▼P■❴❈❛rt❴❝r❡❛t❡ ✭▼P■❴❈❖▼▼❴❲❖❘▲❉✱ ◆❯▼❴❉■▼❙✱ ❣r✐❞❴❞✐♠s ✱

13

❣r✐❞❴♣❡r✐♦❞s ✱ tr✉❡ ✱ ✫❝♦♠♠❴t♦♣♦ ✮ ❀

Bettina Schnor (Potsdam University) MPI Topology Awareness on the SCC Frame 15 of 19

slide-24
SLIDE 24
  • 4. Topology Awareness for RCKMPI: Evaluation

5 10 15 20 25 30 4 16 64 256 1 Ki 4 Ki 16 Ki 64 Ki 256 Ki 1 Mi 4 Mi bandwidth / MByte/s message size / Byte enhanced RCKMPI with 1D topology (48 procs, 2 Cache lines) enhanced RCKMPI with 1D topology (48 procs, 3 Cache lines) enhanced RCKMPI without topology (48 procs)

Bettina Schnor (Potsdam University) MPI Topology Awareness on the SCC Frame 16 of 19

slide-25
SLIDE 25

Results for 2D CFD application with ring topology:

process 1 process 2 process n

Bettina Schnor (Potsdam University) MPI Topology Awareness on the SCC Frame 17 of 19

slide-26
SLIDE 26

5 10 15 20 25 30 4 8 12 16 20 24 28 32 36 40 44 48 speedup number of processes enhanced RCKMPI with topology information, 2 CL

  • riginal RCKMPI

Bettina Schnor (Potsdam University) MPI Topology Awareness on the SCC Frame 18 of 19

slide-27
SLIDE 27

Summary and Future work

SCC is equipped with a fast NoC. Message Passing Buffer on tile is beneficial for fast communication. Bandwidth performance gain for MPI applications by using MPI’s virtual process topologies and rearranging the MPB layout. Current/Future Work: Comparison with

  • I. C. Ureña, and M. Gerndt: Improved RCKMPI’s SCCMPB Channel:

Scaling and Dynamic Processes Support, ARCS 2012 Fixed the One-Sided Communication in RCKMPI =

⇒ support of

applications based on Global Arrays

Bettina Schnor (Potsdam University) MPI Topology Awareness on the SCC Frame 19 of 19

slide-28
SLIDE 28

Summary and Future work

SCC is equipped with a fast NoC. Message Passing Buffer on tile is beneficial for fast communication. Bandwidth performance gain for MPI applications by using MPI’s virtual process topologies and rearranging the MPB layout. Current/Future Work: Comparison with

  • I. C. Ureña, and M. Gerndt: Improved RCKMPI’s SCCMPB Channel:

Scaling and Dynamic Processes Support, ARCS 2012 Fixed the One-Sided Communication in RCKMPI =

⇒ support of

applications based on Global Arrays

Bettina Schnor (Potsdam University) MPI Topology Awareness on the SCC Frame 19 of 19

slide-29
SLIDE 29

Summary and Future work

SCC is equipped with a fast NoC. Message Passing Buffer on tile is beneficial for fast communication. Bandwidth performance gain for MPI applications by using MPI’s virtual process topologies and rearranging the MPB layout. Current/Future Work: Comparison with

  • I. C. Ureña, and M. Gerndt: Improved RCKMPI’s SCCMPB Channel:

Scaling and Dynamic Processes Support, ARCS 2012 Fixed the One-Sided Communication in RCKMPI =

⇒ support of

applications based on Global Arrays

Bettina Schnor (Potsdam University) MPI Topology Awareness on the SCC Frame 19 of 19