Awareness of MPI Virtual Process Topologies on the Single-Chip Cloud - - PowerPoint PPT Presentation
Awareness of MPI Virtual Process Topologies on the Single-Chip Cloud - - PowerPoint PPT Presentation
Awareness of MPI Virtual Process Topologies on the Single-Chip Cloud Computer Steffen Christgau, Bettina Schnor Potsdam University Institute of Computer Science Operating Systems and Distributed Systems HIPS@IPDPS, 21. May 2012 Outline The
Outline
The Single-Chip Cloud Computer (SCC) RCKMPI – MPI on the SCC Topology-Awareness for RCKMPI – The concept Topology-Awareness for RCKMPI – Evaluation Summary and Future Work
Bettina Schnor (Potsdam University) MPI Topology Awareness on the SCC Frame 2 of 19
Outline
The Single-Chip Cloud Computer (SCC) RCKMPI – MPI on the SCC Topology-Awareness for RCKMPI – The concept Topology-Awareness for RCKMPI – Evaluation Summary and Future Work
Bettina Schnor (Potsdam University) MPI Topology Awareness on the SCC Frame 2 of 19
Outline
The Single-Chip Cloud Computer (SCC) RCKMPI – MPI on the SCC Topology-Awareness for RCKMPI – The concept Topology-Awareness for RCKMPI – Evaluation Summary and Future Work
Bettina Schnor (Potsdam University) MPI Topology Awareness on the SCC Frame 2 of 19
Outline
The Single-Chip Cloud Computer (SCC) RCKMPI – MPI on the SCC Topology-Awareness for RCKMPI – The concept Topology-Awareness for RCKMPI – Evaluation Summary and Future Work
Bettina Schnor (Potsdam University) MPI Topology Awareness on the SCC Frame 2 of 19
Outline
The Single-Chip Cloud Computer (SCC) RCKMPI – MPI on the SCC Topology-Awareness for RCKMPI – The concept Topology-Awareness for RCKMPI – Evaluation Summary and Future Work
Bettina Schnor (Potsdam University) MPI Topology Awareness on the SCC Frame 2 of 19
- 1. The Single-Chip Cloud Computer (SCC)
Many-Core Architecture Research Community (MARC) established by Intel research of universities together with Intel world-wide community (dominated by European institutions) Website, regular symposia (every 6 months, 5 up to now) Our group is MARC member Focus: application scalability Experiences with parallel ASP , climate simulation
Bettina Schnor (Potsdam University) MPI Topology Awareness on the SCC Frame 3 of 19
Single-Chip Cloud Computer
MC 0 MC 1 MC 3 MC 2 VRC SIF MPB Core0 Core1
L2 Cache L2 Cache 256 KB 256 KB 16 KB Router
24 tiles, 48 P54C cores, connected via Network-on-Chip, no Cache-Coherence fast 16 KB tile SRAM on each tile Message Passing Buffer (MPB)
Bettina Schnor (Potsdam University) MPI Topology Awareness on the SCC Frame 4 of 19
- 2. RCKMPI: MPI on the SCC
fork of MPICH2 Application MPICH2 ROMIO
(MPI IO Implementation)
Message Passing Interface ADIO
Abstract Device Interface (ADI3) Cray BG Channel-3 Device Sock Nemesis SCCMPB SCCSHM SCCMulti
Process Management Interface
MPI Implementation MPI
Bettina Schnor (Potsdam University) MPI Topology Awareness on the SCC Frame 5 of 19
- 2. RCKMPI: MPI on the SCC
fork of MPICH2 Application MPICH2 ROMIO
(MPI IO Implementation)
Message Passing Interface ADIO
Abstract Device Interface (ADI3) Cray BG Channel-3 Device Sock Nemesis SCCMPB SCCSHM SCCMulti
Process Management Interface
MPI Implementation MPI
Bettina Schnor (Potsdam University) MPI Topology Awareness on the SCC Frame 5 of 19
SCCMPB uses the fast Message Passing Buffer of each tile as shared memory and divides it into n equal-size Exclusive Write Sections (EWS)
= ⇒ remote write, local read
rank
Core 0 Core n
MPI Rank 0 MPI Rank n MPI_COMM_WORLD rank 1 rank n-1 rank n
MPB of core/rank 0
Bettina Schnor (Potsdam University) MPI Topology Awareness on the SCC Frame 6 of 19
SCCMPB uses the fast Message Passing Buffer of each tile as shared memory and divides it into n equal-size Exclusive Write Sections (EWS)
= ⇒ remote write, local read
rank
Core 0 Core n
MPI Rank 0 MPI Rank n MPI_COMM_WORLD rank 1 rank n-1 rank n
MPB of core/rank 0
Bettina Schnor (Potsdam University) MPI Topology Awareness on the SCC Frame 6 of 19
Comparison of different CH3-devices at maximum Manhattan distance 8
0.001 0.01 0.1 1 10 100 4 16 64 256 1 Ki 4 Ki 16 Ki 64 Ki 256 Ki 1 Mi 4 Mi bandwidth / MByte/s message size / Byte RCKMPI sccmulti CH device RCKMPI sccmpb CH device RCKMPI sccshm CH device Bettina Schnor (Potsdam University) MPI Topology Awareness on the SCC Frame 7 of 19
Bandwidths for Manhattan distance 0, 5 and 8 (two processes started).
10 20 30 40 50 60 70 80 90 4 16 64 256 1 Ki 4 Ki 16 Ki 64 Ki 256 Ki 1 Mi 4 Mi bandwidth / MByte/s message size / Byte Core 00 and 01 Core 00 and 10 Core 00 and 47 Bettina Schnor (Potsdam University) MPI Topology Awareness on the SCC Frame 8 of 19
Bandwidths for maximum Manhattan distance 8, and varied number of MPI processes
10 20 30 40 50 60 70 4 16 64 256 1 Ki 4 Ki 16 Ki 64 Ki 256 Ki 1 Mi 4 Mi bandwidth / MByte/s message size / Byte 2 MPI processes 12 MPI processes 24 MPI processes 48 MPI processes Bettina Schnor (Potsdam University) MPI Topology Awareness on the SCC Frame 9 of 19
Remember: SCCMPB uses the fast Message Passing Buffer of each tile as shared memory (total 384 KB)
rank
Core 0 Core n
MPI Rank 0 MPI Rank n MPI_COMM_WORLD rank 1 rank n-1 rank n
MPB of core/rank 0 The MPB is equally devided in n sections =
⇒ depending on the
number of started MPI processes.
Bettina Schnor (Potsdam University) MPI Topology Awareness on the SCC Frame 10 of 19
- 3. Topology Awareness for RCKMPI: The Concept
The bandwidth between 2 RCKMPI processes depends on: the number of started MPI processes since a fully-connected network between all MPI processes is managed.
Bettina Schnor (Potsdam University) MPI Topology Awareness on the SCC Frame 11 of 19
Application Behaviour
process 1 process 2 process n
Tv Tu uu uv vv qu qv
Goal: The bandwidth between communicating processes, so-called neighbors in the Task Interaction Graph should be increased.
Bettina Schnor (Potsdam University) MPI Topology Awareness on the SCC Frame 12 of 19
Application Behaviour
process 1 process 2 process n
Tv Tu uu uv vv qu qv
Goal: The bandwidth between communicating processes, so-called neighbors in the Task Interaction Graph should be increased.
Bettina Schnor (Potsdam University) MPI Topology Awareness on the SCC Frame 12 of 19
Requirements:
1 An improved MPB layout must consider both: communication neighbors
and group communication (barriers, broadcasts, gather/scatter, ...)
2 Each MPI process has to know its new offset within all remote MPBs.
Bettina Schnor (Potsdam University) MPI Topology Awareness on the SCC Frame 13 of 19
Putting things together ...
- proc. n-1
- proc. n
- proc. 1
- proc. n-1
- proc. n
- proc. 2
- proc. 1
- proc. 2
p1 p2 pn channel headers payload area write section channel header message payload
- riginal layout
new layout without topology information
p1
- proc. 2
p1 p2 pn
with topology information
pn
Internal barrier for recalculation phase of new MPB addresses.
Bettina Schnor (Potsdam University) MPI Topology Awareness on the SCC Frame 14 of 19
MPI offers API to specify virtual process topology 1 ★❞❡❢✐♥❡ ◆❯▼❴❉■▼❙ ✷ 2 3
✐♥t ❣r✐❞❴❞✐♠s ❬◆❯▼❴❉■▼❙❪ ❀
4
✐♥t ❣r✐❞❴♣❡r✐♦❞s ❬◆❯▼❴❉■▼❙❪ ❀
5
▼P■❴❈♦♠♠ ❝♦♠♠❴t♦♣♦ ❀
6 7
✴✯ ❢♦ r ❛ ❣r✐❞ ✱ s❡t ❛ ❧ ❧ ✐t❡♠s ♦❢ ❣r✐❞❴♣❡r✐♦❞s t♦ ✵ ✯✴
8
❢ ♦r ✭ ✐♥t ✐ ❂ ✵❀ ✐ ❁ ◆❯▼❴❉■▼❙❀ ✐✰✰✮
9
❣r✐❞❴♣❡r✐♦❞s ❬ ✐ ❪ ❂ ✵❀
10 11
▼P■❴❉✐♠s❴❝r❡❛t❡✭♥✉♠Pr♦❝s ✱ ◆❯▼❴❉■▼❙✱ ❣r✐❞❴❞✐♠s ✮ ❀
12
▼P■❴❈❛rt❴❝r❡❛t❡ ✭▼P■❴❈❖▼▼❴❲❖❘▲❉✱ ◆❯▼❴❉■▼❙✱ ❣r✐❞❴❞✐♠s ✱
13
❣r✐❞❴♣❡r✐♦❞s ✱ tr✉❡ ✱ ✫❝♦♠♠❴t♦♣♦ ✮ ❀
Bettina Schnor (Potsdam University) MPI Topology Awareness on the SCC Frame 15 of 19
MPI offers API to specify virtual process topology 1 ★❞❡❢✐♥❡ ◆❯▼❴❉■▼❙ ✷ 2 3
✐♥t ❣r✐❞❴❞✐♠s ❬◆❯▼❴❉■▼❙❪ ❀
4
✐♥t ❣r✐❞❴♣❡r✐♦❞s ❬◆❯▼❴❉■▼❙❪ ❀
5
▼P■❴❈♦♠♠ ❝♦♠♠❴t♦♣♦ ❀
6 7
✴✯ ❢♦ r ❛ ❣r✐❞ ✱ s❡t ❛ ❧ ❧ ✐t❡♠s ♦❢ ❣r✐❞❴♣❡r✐♦❞s t♦ ✵ ✯✴
8
❢ ♦r ✭ ✐♥t ✐ ❂ ✵❀ ✐ ❁ ◆❯▼❴❉■▼❙❀ ✐✰✰✮
9
❣r✐❞❴♣❡r✐♦❞s ❬ ✐ ❪ ❂ ✵❀
10 11
▼P■❴❉✐♠s❴❝r❡❛t❡✭♥✉♠Pr♦❝s ✱ ◆❯▼❴❉■▼❙✱ ❣r✐❞❴❞✐♠s ✮ ❀
12
▼P■❴❈❛rt❴❝r❡❛t❡ ✭▼P■❴❈❖▼▼❴❲❖❘▲❉✱ ◆❯▼❴❉■▼❙✱ ❣r✐❞❴❞✐♠s ✱
13
❣r✐❞❴♣❡r✐♦❞s ✱ tr✉❡ ✱ ✫❝♦♠♠❴t♦♣♦ ✮ ❀
Bettina Schnor (Potsdam University) MPI Topology Awareness on the SCC Frame 15 of 19
- 4. Topology Awareness for RCKMPI: Evaluation
5 10 15 20 25 30 4 16 64 256 1 Ki 4 Ki 16 Ki 64 Ki 256 Ki 1 Mi 4 Mi bandwidth / MByte/s message size / Byte enhanced RCKMPI with 1D topology (48 procs, 2 Cache lines) enhanced RCKMPI with 1D topology (48 procs, 3 Cache lines) enhanced RCKMPI without topology (48 procs)
Bettina Schnor (Potsdam University) MPI Topology Awareness on the SCC Frame 16 of 19
Results for 2D CFD application with ring topology:
process 1 process 2 process n
Bettina Schnor (Potsdam University) MPI Topology Awareness on the SCC Frame 17 of 19
5 10 15 20 25 30 4 8 12 16 20 24 28 32 36 40 44 48 speedup number of processes enhanced RCKMPI with topology information, 2 CL
- riginal RCKMPI
Bettina Schnor (Potsdam University) MPI Topology Awareness on the SCC Frame 18 of 19
Summary and Future work
SCC is equipped with a fast NoC. Message Passing Buffer on tile is beneficial for fast communication. Bandwidth performance gain for MPI applications by using MPI’s virtual process topologies and rearranging the MPB layout. Current/Future Work: Comparison with
- I. C. Ureña, and M. Gerndt: Improved RCKMPI’s SCCMPB Channel:
Scaling and Dynamic Processes Support, ARCS 2012 Fixed the One-Sided Communication in RCKMPI =
⇒ support of
applications based on Global Arrays
Bettina Schnor (Potsdam University) MPI Topology Awareness on the SCC Frame 19 of 19
Summary and Future work
SCC is equipped with a fast NoC. Message Passing Buffer on tile is beneficial for fast communication. Bandwidth performance gain for MPI applications by using MPI’s virtual process topologies and rearranging the MPB layout. Current/Future Work: Comparison with
- I. C. Ureña, and M. Gerndt: Improved RCKMPI’s SCCMPB Channel:
Scaling and Dynamic Processes Support, ARCS 2012 Fixed the One-Sided Communication in RCKMPI =
⇒ support of
applications based on Global Arrays
Bettina Schnor (Potsdam University) MPI Topology Awareness on the SCC Frame 19 of 19
Summary and Future work
SCC is equipped with a fast NoC. Message Passing Buffer on tile is beneficial for fast communication. Bandwidth performance gain for MPI applications by using MPI’s virtual process topologies and rearranging the MPB layout. Current/Future Work: Comparison with
- I. C. Ureña, and M. Gerndt: Improved RCKMPI’s SCCMPB Channel:
Scaling and Dynamic Processes Support, ARCS 2012 Fixed the One-Sided Communication in RCKMPI =
⇒ support of
applications based on Global Arrays
Bettina Schnor (Potsdam University) MPI Topology Awareness on the SCC Frame 19 of 19