[PPT] - Scaling Alltoall Collective on Multi-core Systems Rahul Kumar, Amith PowerPoint Presentation

SLIDE 1

Scaling Alltoall Collective

n Multi-core Systems

Rahul Kumar, Amith R Mamidala, Dhabaleswar K Panda

Department of Computer Science & Engineering The Ohio State University

{kumarra, mamidala, panda}@cse.ohio-state.edu

SLIDE 2

Presentation Outline

Introduction
Motivation & Problem Statement
Proposed Design
Performance Evaluation
Conclusion & Future Work

SLIDE 3

Introduction

Multi-core architectures being widely used for

high performance computing

Ranger cluster at TACC has 16 core/node and in total more than 60,000 cores

Message Passing is the default programming

model for distributed memory systems

MPI provides many communication primitives
MPI Collective operations are widely used in

applications

SLIDE 4

Introduction

MPI_alltoall is the most intensive collective

and is widely used in many applications such as CPMD, NAMD, FFT, Matrix transpose.

In MPI_Alltoall every process has a different

data to be sent to every other process.

An efficient alltoall is highly desirable for

multi-core systems as the number of processes have increased dramatically due to cheap cost ratio of multi-core architecture

SLIDE 5

Introduction

24% of the top 500 supercomputers use

InfiniBand as their interconnect (based on Nov „07 rankings).

Several different implementations of

InfiniBand Network Interfaces

Offload implementation e.g. InfiniHost III(3rd generation cards from Mellanox) Onload implementation e.g. Qlogic InfiniPath Combination of both onload and offload e.g. ConnectX from Mellanox.

SLIDE 6

Offload & Onload Architecture

NIC NIC INFINIBAND Offload architecture Onload architecture NIC NIC INFINIBAND Core Node Node Node Node In an offload architecture, the network processing is offloaded to network

interface. The NIC is able to send message relieving the CPU of communication

In an onload architecture, the CPU is involved in communication in addition to performing the computation In onload architecture, the faster CPU is able to speed up the communication. However, ability to overlap communication with computation is not possible

SLIDE 7

Characteristics of various Network Interfaces

Some basic experiments were performed on

various network architectures and the following observations were made

The bi-directional bandwidth of onload

network interfaces increases with more number of cores used to push the data on the network

This is shown in the following slides

SLIDE 8

Bi-directional Bandwidth: InfiniPath (onload)

Bidirectional Bandwidth increases with more cores used to push data
In onload interface, more cores help achieve better network utilization

SLIDE 9

Bi-directional Bandwidth: ConnectX

A similar trend is also observed for connectX network interfaces

SLIDE 10

Bi-directional Bandwidth: InfiniHost III (offload)

However, in Offload network interfaces the bandwidth drops on using more

cores

We feel this to be due to congestion at the network interface on using many

cores simultaneously

SLIDE 11

Results from the Experiments

Depending on the interface

implementation, their characteristics differ

– Qlogic onload implementations: Using more cores simultaneously for inter-node communication is beneficial – Mellanox offload implementations: Using less cores at the same time for inter-node communication is beneficial – Mellanox ConnectX architecture: Using more cores simultaneously is beneficial

SLIDE 12

Presentation Outline

Introduction
Motivation & Problem Statement
Proposed Design
Performance Evaluation
Conclusion & Future Work

SLIDE 13

To evaluate the performance of existing alltoall

algorithm we conduct the following experiment

In the experiment alltoall time is measured on a

set of nodes.

The number of cores per node participating in

alltoall are increased gradually.

Motivation

SLIDE 14

Motivation

The alltoall time doubles on doubling the number of cores in the nodes

SLIDE 15

What is the problem with the Algorithm?

Alltoall between two nodes involves one communication step

Node 1 Node 2

So on doubling the core alltoall time is almost doubled.
This is exactly what we obtained from the previous experiment.
With two cores per node, the number of inter-node communication

by each core increases to two Cores

SLIDE 16

Problem Statement

Can low cost shared memory help to avoid

network transactions?

Can the performance of alltoall be improved

especially for multi-core systems?

What algorithms to choose for different

infiniband implementations?

SLIDE 17

Related Work

There have been studies that propose a leader-

based hierarchical scheme for other collectives A leader is chosen on each node Only the leader is involved in inter-node communication The communication takes place in three stages

The cores aggregate data at the leader of the

node

The leader perform inter-node communication
The leader distributes the data to the cores
We implemented the above scheme for Alltoall as

illistrated in the diagram in next slide

SLIDE 18

Leader-based Scheme for Alltoall

Node 0 Node 1 Node 0 Node 1 GROUP Node 0 Node 1 Node 0 Node 1 Step 1 Step 2 Step 3

step 1: all cores send data to the leader
step 2: the leader performs alltoall with other leader
step 3: the leader distributes the respective data to other cores

SLIDE 19

Issues with Leader-based Scheme

It uses only one core to send the data out on

the network

Does not take advantage of increase in

bandwidth with the use of more cores to send the data out of the node

SLIDE 20

Presentation Outline

Introduction
Motivation & Problem Statement
Proposed Design
Performance Evaluation
Conclusion & Future Work

SLIDE 21

GROUP 2

Proposed Design

21

Node 0 Node 1 Step 1 Step 2 Node 0 Node 1 GROUP 1 Cores

All the cores take part in the communication
Each core communicates with one and only one core from other nodes

Node 0 Node 1

Step 1: Intra-node Communication
The data destined for other nodes is exchanged among the cores
The core which communicates with the respective core of the other node

receives the data

Step 2: Inter-node Communication
Alltoall is called among each group

SLIDE 22

Advantages of the Proposed Scheme

The scheme takes advantage of low cost

shared memory

It uses multiple cores to send the data out on

the network, thus achieving better network utilization

Each core issues same number of sends as

the leader-based scheme, hence start-up costs are lower

SLIDE 23

Presentation Outline

Introduction
Motivation & Problem Statement
Proposed Design
Performance Evaluation
Conclusion & Future Work

SLIDE 24

Evaluation Framework

Testbed

– Cluster A: 64 node (512 cores)

dual 2.33 GHz Intel Xeon “Clovertown” quad-core
InfiniPath SDR network interface QLE7140
InfiniHost III DDR network interface card MT25208

– Cluster B: 4 node (32 cores)

dual 2.33 GHz Intel Xeon “Clovertown” quad-core
Mellanox DDR ConnectX network interface
Experiments

– Alltoall collective time

Onload InfiniPath network interface
Offload InfiniHost III network interface
ConnectX network interface

– CPMD Application performance

SLIDE 25

Alltoall: InfiniPath

5000 10000 15000 20000 25000 30000 1 2 4 8 16 32 64 128 256 512 1K 2K Time (us) Msg Size

Alltoall Time

riginal

Leader-based proposed

The figure shows the alltoall time for different message size on 512 core system
Leader-based reduces the alltoall time
Proposed design gives the best performance on onload network interfaces

SLIDE 26

Alltoall-InfiniPath: 512 Bytes Message

2000 4000 6000 8000 10000 12000 2 4 8 16 32 64 Time (us) # of Nodes

Alltoall Time

riginal

leader-based proposed

The figure shows the alltoall time for 512 Bytes message on varying system size
The proposed scheme scales much better than other schemes on increase in

system size

SLIDE 27

Alltoall: InfiniHost III

10000 20000 30000 40000 50000 60000 70000 80000 90000 1 2 4 8 16 32 64 128 256 512 1K 2K Time (us) Msg Size

Alltoall Time

riginal

Leader-based proposed

The figure shows the performance of the schemes on offload network interfaces
Leader-based scheme performs best on offload NIC as it avoids congestion.
This matches our expectations

SLIDE 28

Alltoall: ConnectX

500 1000 1500 2000 2500 3000 3500 4000 4500 1 2 4 8 16 32 64 128 256 512 1K 2K 4K 8K Time (us) Msg Size

Alltoall Time

Leader-based

riginal

proposed

As seen earlier, bi-directional bandwidth increases with the use of more

cores on ConnectX architecture

Therfore, the proposed scheme attains the best performance

SLIDE 29

CPMD Application

20 40 60 80 100 120 140 160 180 200 32-wat si63-10ryd si63-70ryd si63-120ryd

riginal

Leader-based proposed Execution Time (sec)

CPMD is designed for ab-initio molecular dynamics. CPMD makes

extensive use of alltoall communication.

Figure shows the performance improvement of CPMD Application on

128 core system

The proposed design delivers the best execution time

SLIDE 30

CPMD Application Performance on Varying System Size

100 200 300 400 500 600 8X8 16X8 32X8 64X8 Time (secs) System Size

CPMD

riginal

Leader-based proposed

This figure shows the application execution time on different system sizes.
The reduction in application execution time increases with increasing system
sizes. Proposed design scales very well.

SLIDE 31

Presentation Outline

Introduction
Motivation & Problem Statement
Proposed Design
Performance Evaluation
Conclusion & Future Work

SLIDE 32

Conclusion & Future Work

Interfaces implemented for the same interconnect, exhibit

different network characteristics.

A single collective algorithm does not perform optimally for

all network interfaces.

We have proposed an optimized alltoall collective algorithm

for multi-core systems connected using modern InfiniBand network interfaces.

The proposed design achieves a reduction in MPI_Alltoall

time by 55% and speeds up the CPMD application by 33%.

We plan to evaluate our designs on 10GigE-based

systems.

And also extend the study to other collectives like

broadcast and allgather.

SLIDE 33

33

Web Pointers

http://nowlab.cse.ohio-state.edu/ MVAPICH Web Page http://mvapich.cse.ohio-state.edu/

SLIDE 34

34

Acknowledgements

Our research is supported by the following organizations

Current Funding support by
Current Equipment support by