[PPT] - Design challenges of High- performance and Scalable MPI over PowerPoint Presentation

SLIDE 1

Design challenges of High- performance and Scalable MPI over InfiniBand

Presented by

Karthik

SLIDE 2

Presentation Overview

In depth analysis of High-Performance and

scalable MPI with Reduced Memory Usage

Zero Copy protocol using Unreliable Datagram
MVAPICH-Aptus : A scalable High performance

Multi-Transport MPI over InfiniBand

SLIDE 3

High Performance and Scalable MPI with Reduced Memory usage

Motivation

Does aggressively reducing communication buffer memory lead to degradation
f end application performance?
How much memory can we expect the MPI library to consume during execution
f a typical application, while still proving the best available performance ?

SLIDE 4

High Performance and Scalable MPI with Reduced Memory usage IB provides several types of transport services –

Reliable Connection (RC)
Used as the primary transport for MVAPICH and other MPIs over InfiniBand.
Most feature-rich -- supports RDMA and provides reliable service.
Dedicated QP must be created for each communicating peer.
Reliable Datagram (RD)
Most of the same features as RC, however, a dedicated QP is not required.
Not implemented with current hardware.
Unreliable Connection (UC)
Provides RDMA capability.
No guarantees on ordering or reliability.
Dedicated QP must be created for each communicating peer.
Unreliable Datagram (UD)
Connection-less. Single QP can communicate with any other peer QP.
Limited message size.
No guarantees on ordering or reliability.

SLIDE 5

Upper level software service High Performance and Scalable MPI with Reduced Memory usage

Shared Receive Queue

This allows multiple QPs to be attached to one receive queue

(even for connection oriented transport)

This approach is memory efficient

SLIDE 6

High Performance and Scalable MPI with Reduced Memory usage

Remote Direct Memory Access (RDMA)

Application can directly access the memory of the remove process.
RDMA has very low latency.

SLIDE 7

High Performance and Scalable MPI with Reduced Memory usage

MVAPICH Design Overview

MVAPICH uses two major protocols –

1. Eager Protocol
It is used to transfer small messages.
The messages are buffered inside the MPI library.
“pre-allocated” communication buffers are required on the sender and

receiver side

2. Rendezvous Protocol
It is used to transfer large messages.
The message are sent directly to receiver’s user memory.

SLIDE 8

High Performance and Scalable MPI with Reduced Memory usage

1 . Adaptive RDMA with Send/Receive

In order to avoid a memory-scalability problem when the number of nodes

increase, this channel is adaptive.

Limited buffers are allocated initially.
Once a threshold number of messages are exchanged, next messages are

transferred using RDMA.

SLIDE 9

High Performance and Scalable MPI with Reduced Memory usage

2. Adaptive RDMA with SQR Channel
Idea is based on ARDMA-SR. Only Difference is the Shared Queue Receiver is used.
Drawback : Sender doesn’t know the receiver buffer availability.
Solution : Setting a “low-watermark” for the SQR.

SLIDE 10

High Performance and Scalable MPI with Reduced Memory usage

3. Shared Receive Queue
This channel exclusively utilizes the SRQ feature.
This follows the same “low-watermark technique as the ARDMA-SRQ.
Even though RDMA has low latency, they consume more memory.

SLIDE 11

High Performance and Scalable MPI with Reduced Memory usage

NAS Benchmark

SLIDE 12

High Performance and Scalable MPI with Reduced Memory usage

High Performance Linpack

Benchmark for solving linear equations.
It is used as the primary measure for ranking biannual Top 500 list of

the world’s fastest supercomputers

SLIDE 13

Zero-Copy Protocol for MPI using InfiniBand Unreliable Datagram

SLIDE 14

1. Performance Scalability
Memory copies are detrimental to the overall performance of the application.
HCA cache can only hold a limited number of QPs
2. Resource Scalability
With a connection oriented transport the memory requirements increase

linearly with the number of connected processes. Zero-Copy Protocol for MPI using InfiniBand Unreliable Datagram

Motivation

SLIDE 15

Zero-Copy Protocol for MPI using InfiniBand Unreliable Datagram

Traditional Zero-Copy

1. Matched Queues Interface
The receiver deciphers the message tag from the sent message and

matches it with the posted receive operations.

2. Rendezvous Protocol using RDMA
Initially a handshake protocol is used, followed by RDMA.

SLIDE 16

Zero-Copy Protocol for MPI using InfiniBand Unreliable Datagram

UD vs RC memory usage

For 16k connections – UD = 40 MB / process RC = 240 MB / process

SLIDE 17

Zero-Copy Protocol for MPI using InfiniBand Unreliable Datagram

Challenges for true zero copy design

Limited MTU Size
UD transport has a Maximum Transfer Unit(MTU) limit of 2KB.
Segmentation required.
Lack of dedicated Receive Buffers
Difficult to post receive buffers for a particular peer as they are all shared.
If no buffer is posted to a QP, message sent is silently dropped.
Lack of Reliability
There is no guarantee that a message will arrive at the receiver
Lack of ordering
Message may not arrive in the same order they are sent.
Lack of RDMA
RDMA only works for connection oriented transport.

SLIDE 18

Zero-Copy Protocol for MPI using InfiniBand Unreliable Datagram

Proposed Design

Design is based on serialized communication since RDMA is not specified for UD

transport

Serialized implies that the order of transfer is agreed beforehand, and only sender

transmit to a QP at a single time.

SLIDE 19

Zero-Copy Protocol for MPI using InfiniBand Unreliable Datagram

Solutions to design challenges

1. Efficient Segmentation
The design chooses to get completion signal only for the last packet.
The underlying reliability layer would mark packets as missing at the

receiver’s end and the sender is notified.

2. Zero Copy Pool
A pool of QPs are maintained.
When a message transfer is initiated, a QP is taken from the pool and

the application receive buffer is posted to it.

3. Optimized Reliability and Ordering for Large Messages
One approach is the perform a checksum for the entire receive buffer.
Each operation can specify a 32-bit immediate field that will be available

to the receiver as part of the completion entry.

SLIDE 20

Zero-Copy Protocol for MPI using InfiniBand Unreliable Datagram

Experimental Evaluation

Ping Pong Latency

SLIDE 21

Zero-Copy Protocol for MPI using InfiniBand Unreliable Datagram Uni-Directional Bandwidth

SLIDE 22

Zero-Copy Protocol for MPI using InfiniBand Unreliable Datagram Bi-Directional Bandwidth

SLIDE 23

MVAPICH-Aptus : Scalable High-Performance Multi-Transport MPI over InfiniBand

SLIDE 24

MVAPICH-Aptus : Scalable High-Performance Multi-Transport MPI over InfiniBand

Motivation

This paper seeks to address two mains questions -

1. What are the different protocols developed for MPI over IB ?

How well do they perform at scale ?

2. Given this knowledge, can the MPI Library be designed to

dynamically select protocols to optimized for performance and scalability ?

SLIDE 25

MVAPICH-Aptus : Scalable High-Performance Multi-Transport MPI over InfiniBand IB provides several types of transport services –

Reliable Connection (RC)
Used as the primary transport for MVAPICH and other MPIs over InfiniBand.
Most feature-rich -- supports RDMA and provides reliable service.
Dedicated QP must be created for each communicating peer.
Reliable Datagram (RD)
Most of the same features as RC, however, a dedicated QP is not required.
Not implemented with current hardware.
Unreliable Connection (UC)
Provides RDMA capability.
No guarantees on ordering or reliability.
Dedicated QP must be created for each communicating peer.
Unreliable Datagram (UD)
Connection-less. Single QP can communicate with any other peer QP.
Limited message size.
No guarantees on ordering or reliability.

SLIDE 26

MVAPICH-Aptus : Scalable High-Performance Multi-Transport MPI over InfiniBand

Message Channel

Eager Protocol Channel

SLIDE 27

MVAPICH-Aptus : Scalable High-Performance Multi-Transport MPI over InfiniBand

Message Channel

Rendezvous Protocol Channel

SLIDE 28

MVAPICH-Aptus : Scalable High-Performance Multi-Transport MPI over InfiniBand

Channel Evaluation

Performance : Eager Latency

SLIDE 29

MVAPICH-Aptus : Scalable High-Performance Multi-Transport MPI over InfiniBand

Channel Evaluation

Performance : Uni-Directional Bandwidth

SLIDE 30

MVAPICH-Aptus : Scalable High-Performance Multi-Transport MPI over InfiniBand

Channel Evaluation

Scalability Test : Memory Usage

SLIDE 31

MVAPICH-Aptus : Scalable High-Performance Multi-Transport MPI over InfiniBand

Channel Evaluation

Scalability Test : Latency

SLIDE 32

MVAPICH-Aptus : Scalable High-Performance Multi-Transport MPI over InfiniBand

Channel Characteristics Summary

SLIDE 33

Zero-Copy Protocol for MPI using InfiniBand Unreliable Datagram

Overview of Design

As seen from the experimental results, using only one channel is not sufficient

to achieve performance and scalability.

The solution is to use a combination of message channels and transports to optimize

for performance as well as scalability. Design Challenges

1. When should a channel be created ?
2. When should a channel be used ?

SLIDE 34

Zero-Copy Protocol for MPI using InfiniBand Unreliable Datagram

Channel Allocation

SLIDE 35

From the experimental results we can see the channels behave differently for

different message size

A flexible form is defined when sending a message

Zero-Copy Protocol for MPI using InfiniBand Unreliable Datagram

Channel Usage

Using this flexible framework, send rules can be changed on a per-system or job

level to meet application needs without changing the code within MPI library.

SLIDE 36

Zero-Copy Protocol for MPI using InfiniBand Unreliable Datagram

Performance Evaluation

SLIDE 37

Design challenges of High- performance and Scalable MPI over - - PowerPoint PPT Presentation

Design challenges of High- performance and Scalable MPI over InfiniBand

Karthik

Presentation Overview

scalable MPI with Reduced Memory Usage

Multi-Transport MPI over InfiniBand

Motivation

Motivation

Motivation

QUESTIONS ?