Scaling Alltoall Collective on Multi-core Systems Rahul Kumar, Amith - - PowerPoint PPT Presentation

scaling alltoall collective on multi core systems
SMART_READER_LITE
LIVE PREVIEW

Scaling Alltoall Collective on Multi-core Systems Rahul Kumar, Amith - - PowerPoint PPT Presentation

Scaling Alltoall Collective on Multi-core Systems Rahul Kumar, Amith R Mamidala, Dhabaleswar K Panda Department of Computer Science & Engineering The Ohio State University {kumarra, mamidala, panda}@cse.ohio-state.edu Presentation Outline


slide-1
SLIDE 1

Scaling Alltoall Collective

  • n Multi-core Systems

Rahul Kumar, Amith R Mamidala, Dhabaleswar K Panda

Department of Computer Science & Engineering The Ohio State University

{kumarra, mamidala, panda}@cse.ohio-state.edu

slide-2
SLIDE 2

Presentation Outline

  • Introduction
  • Motivation & Problem Statement
  • Proposed Design
  • Performance Evaluation
  • Conclusion & Future Work
slide-3
SLIDE 3

Introduction

  • Multi-core architectures being widely used for

high performance computing

Ranger cluster at TACC has 16 core/node and in total more than 60,000 cores

  • Message Passing is the default programming

model for distributed memory systems

  • MPI provides many communication primitives
  • MPI Collective operations are widely used in

applications

slide-4
SLIDE 4

Introduction

  • MPI_alltoall is the most intensive collective

and is widely used in many applications such as CPMD, NAMD, FFT, Matrix transpose.

  • In MPI_Alltoall every process has a different

data to be sent to every other process.

  • An efficient alltoall is highly desirable for

multi-core systems as the number of processes have increased dramatically due to cheap cost ratio of multi-core architecture

slide-5
SLIDE 5

Introduction

  • 24% of the top 500 supercomputers use

InfiniBand as their interconnect (based on Nov „07 rankings).

  • Several different implementations of

InfiniBand Network Interfaces

Offload implementation e.g. InfiniHost III(3rd generation cards from Mellanox) Onload implementation e.g. Qlogic InfiniPath Combination of both onload and offload e.g. ConnectX from Mellanox.

slide-6
SLIDE 6

Offload & Onload Architecture

NIC NIC INFINIBAND Offload architecture Onload architecture NIC NIC INFINIBAND Core Node Node Node Node In an offload architecture, the network processing is offloaded to network

  • interface. The NIC is able to send message relieving the CPU of communication

In an onload architecture, the CPU is involved in communication in addition to performing the computation In onload architecture, the faster CPU is able to speed up the communication. However, ability to overlap communication with computation is not possible

slide-7
SLIDE 7

Characteristics of various Network Interfaces

  • Some basic experiments were performed on

various network architectures and the following observations were made

  • The bi-directional bandwidth of onload

network interfaces increases with more number of cores used to push the data on the network

  • This is shown in the following slides
slide-8
SLIDE 8

Bi-directional Bandwidth: InfiniPath (onload)

  • Bidirectional Bandwidth increases with more cores used to push data
  • In onload interface, more cores help achieve better network utilization
slide-9
SLIDE 9

Bi-directional Bandwidth: ConnectX

  • A similar trend is also observed for connectX network interfaces
slide-10
SLIDE 10

Bi-directional Bandwidth: InfiniHost III (offload)

  • However, in Offload network interfaces the bandwidth drops on using more

cores

  • We feel this to be due to congestion at the network interface on using many

cores simultaneously

slide-11
SLIDE 11

Results from the Experiments

  • Depending on the interface

implementation, their characteristics differ

– Qlogic onload implementations: Using more cores simultaneously for inter-node communication is beneficial – Mellanox offload implementations: Using less cores at the same time for inter-node communication is beneficial – Mellanox ConnectX architecture: Using more cores simultaneously is beneficial

slide-12
SLIDE 12

Presentation Outline

  • Introduction
  • Motivation & Problem Statement
  • Proposed Design
  • Performance Evaluation
  • Conclusion & Future Work
slide-13
SLIDE 13
  • To evaluate the performance of existing alltoall

algorithm we conduct the following experiment

  • In the experiment alltoall time is measured on a

set of nodes.

  • The number of cores per node participating in

alltoall are increased gradually.

Motivation

slide-14
SLIDE 14

Motivation

  • The alltoall time doubles on doubling the number of cores in the nodes
slide-15
SLIDE 15

What is the problem with the Algorithm?

  • Alltoall between two nodes involves one communication step

Node 1 Node 2

  • So on doubling the core alltoall time is almost doubled.
  • This is exactly what we obtained from the previous experiment.
  • With two cores per node, the number of inter-node communication

by each core increases to two Cores

slide-16
SLIDE 16

Problem Statement

  • Can low cost shared memory help to avoid

network transactions?

  • Can the performance of alltoall be improved

especially for multi-core systems?

  • What algorithms to choose for different

infiniband implementations?

slide-17
SLIDE 17

Related Work

  • There have been studies that propose a leader-

based hierarchical scheme for other collectives A leader is chosen on each node Only the leader is involved in inter-node communication The communication takes place in three stages

  • The cores aggregate data at the leader of the

node

  • The leader perform inter-node communication
  • The leader distributes the data to the cores
  • We implemented the above scheme for Alltoall as

illistrated in the diagram in next slide

slide-18
SLIDE 18

Leader-based Scheme for Alltoall

Node 0 Node 1 Node 0 Node 1 GROUP Node 0 Node 1 Node 0 Node 1 Step 1 Step 2 Step 3

  • step 1: all cores send data to the leader
  • step 2: the leader performs alltoall with other leader
  • step 3: the leader distributes the respective data to other cores
slide-19
SLIDE 19

Issues with Leader-based Scheme

  • It uses only one core to send the data out on

the network

  • Does not take advantage of increase in

bandwidth with the use of more cores to send the data out of the node

slide-20
SLIDE 20

Presentation Outline

  • Introduction
  • Motivation & Problem Statement
  • Proposed Design
  • Performance Evaluation
  • Conclusion & Future Work
slide-21
SLIDE 21

GROUP 2

Proposed Design

21

Node 0 Node 1 Step 1 Step 2 Node 0 Node 1 GROUP 1 Cores

  • All the cores take part in the communication
  • Each core communicates with one and only one core from other nodes

Node 0 Node 1

  • Step 1: Intra-node Communication
  • The data destined for other nodes is exchanged among the cores
  • The core which communicates with the respective core of the other node

receives the data

  • Step 2: Inter-node Communication
  • Alltoall is called among each group
slide-22
SLIDE 22

Advantages of the Proposed Scheme

  • The scheme takes advantage of low cost

shared memory

  • It uses multiple cores to send the data out on

the network, thus achieving better network utilization

  • Each core issues same number of sends as

the leader-based scheme, hence start-up costs are lower

slide-23
SLIDE 23

Presentation Outline

  • Introduction
  • Motivation & Problem Statement
  • Proposed Design
  • Performance Evaluation
  • Conclusion & Future Work
slide-24
SLIDE 24

Evaluation Framework

  • Testbed

– Cluster A: 64 node (512 cores)

  • dual 2.33 GHz Intel Xeon “Clovertown” quad-core
  • InfiniPath SDR network interface QLE7140
  • InfiniHost III DDR network interface card MT25208

– Cluster B: 4 node (32 cores)

  • dual 2.33 GHz Intel Xeon “Clovertown” quad-core
  • Mellanox DDR ConnectX network interface
  • Experiments

– Alltoall collective time

  • Onload InfiniPath network interface
  • Offload InfiniHost III network interface
  • ConnectX network interface

– CPMD Application performance

slide-25
SLIDE 25

Alltoall: InfiniPath

5000 10000 15000 20000 25000 30000 1 2 4 8 16 32 64 128 256 512 1K 2K Time (us) Msg Size

Alltoall Time

  • riginal

Leader-based proposed

  • The figure shows the alltoall time for different message size on 512 core system
  • Leader-based reduces the alltoall time
  • Proposed design gives the best performance on onload network interfaces
slide-26
SLIDE 26

Alltoall-InfiniPath: 512 Bytes Message

2000 4000 6000 8000 10000 12000 2 4 8 16 32 64 Time (us) # of Nodes

Alltoall Time

  • riginal

leader-based proposed

  • The figure shows the alltoall time for 512 Bytes message on varying system size
  • The proposed scheme scales much better than other schemes on increase in

system size

slide-27
SLIDE 27

Alltoall: InfiniHost III

10000 20000 30000 40000 50000 60000 70000 80000 90000 1 2 4 8 16 32 64 128 256 512 1K 2K Time (us) Msg Size

Alltoall Time

  • riginal

Leader-based proposed

  • The figure shows the performance of the schemes on offload network interfaces
  • Leader-based scheme performs best on offload NIC as it avoids congestion.
  • This matches our expectations
slide-28
SLIDE 28

Alltoall: ConnectX

500 1000 1500 2000 2500 3000 3500 4000 4500 1 2 4 8 16 32 64 128 256 512 1K 2K 4K 8K Time (us) Msg Size

Alltoall Time

Leader-based

  • riginal

proposed

  • As seen earlier, bi-directional bandwidth increases with the use of more

cores on ConnectX architecture

  • Therfore, the proposed scheme attains the best performance
slide-29
SLIDE 29

CPMD Application

20 40 60 80 100 120 140 160 180 200 32-wat si63-10ryd si63-70ryd si63-120ryd

  • riginal

Leader-based proposed Execution Time (sec)

  • CPMD is designed for ab-initio molecular dynamics. CPMD makes

extensive use of alltoall communication.

  • Figure shows the performance improvement of CPMD Application on

128 core system

  • The proposed design delivers the best execution time
slide-30
SLIDE 30

CPMD Application Performance on Varying System Size

100 200 300 400 500 600 8X8 16X8 32X8 64X8 Time (secs) System Size

CPMD

  • riginal

Leader-based proposed

  • This figure shows the application execution time on different system sizes.
  • The reduction in application execution time increases with increasing system
  • sizes. Proposed design scales very well.
slide-31
SLIDE 31

Presentation Outline

  • Introduction
  • Motivation & Problem Statement
  • Proposed Design
  • Performance Evaluation
  • Conclusion & Future Work
slide-32
SLIDE 32

Conclusion & Future Work

  • Interfaces implemented for the same interconnect, exhibit

different network characteristics.

  • A single collective algorithm does not perform optimally for

all network interfaces.

  • We have proposed an optimized alltoall collective algorithm

for multi-core systems connected using modern InfiniBand network interfaces.

  • The proposed design achieves a reduction in MPI_Alltoall

time by 55% and speeds up the CPMD application by 33%.

  • We plan to evaluate our designs on 10GigE-based

systems.

  • And also extend the study to other collectives like

broadcast and allgather.

slide-33
SLIDE 33

33

Web Pointers

http://nowlab.cse.ohio-state.edu/ MVAPICH Web Page http://mvapich.cse.ohio-state.edu/

slide-34
SLIDE 34

34

Acknowledgements

Our research is supported by the following organizations

  • Current Funding support by
  • Current Equipment support by