Design challenges of High- performance and Scalable MPI over - - PowerPoint PPT Presentation

design challenges of high performance and scalable mpi
SMART_READER_LITE
LIVE PREVIEW

Design challenges of High- performance and Scalable MPI over - - PowerPoint PPT Presentation

Design challenges of High- performance and Scalable MPI over InfiniBand Presented by Karthik Presentation Overview In depth analysis of High-Performance and scalable MPI with Reduced Memory Usage Zero Copy protocol using Unreliable


slide-1
SLIDE 1

Design challenges of High- performance and Scalable MPI over InfiniBand

Presented by

Karthik

slide-2
SLIDE 2

Presentation Overview

  • In depth analysis of High-Performance and

scalable MPI with Reduced Memory Usage

  • Zero Copy protocol using Unreliable Datagram
  • MVAPICH-Aptus : A scalable High performance

Multi-Transport MPI over InfiniBand

slide-3
SLIDE 3

High Performance and Scalable MPI with Reduced Memory usage

Motivation

  • Does aggressively reducing communication buffer memory lead to degradation
  • f end application performance?
  • How much memory can we expect the MPI library to consume during execution
  • f a typical application, while still proving the best available performance ?
slide-4
SLIDE 4

High Performance and Scalable MPI with Reduced Memory usage IB provides several types of transport services –

  • Reliable Connection (RC)
  • Used as the primary transport for MVAPICH and other MPIs over InfiniBand.
  • Most feature-rich -- supports RDMA and provides reliable service.
  • Dedicated QP must be created for each communicating peer.
  • Reliable Datagram (RD)
  • Most of the same features as RC, however, a dedicated QP is not required.
  • Not implemented with current hardware.
  • Unreliable Connection (UC)
  • Provides RDMA capability.
  • No guarantees on ordering or reliability.
  • Dedicated QP must be created for each communicating peer.
  • Unreliable Datagram (UD)
  • Connection-less. Single QP can communicate with any other peer QP.
  • Limited message size.
  • No guarantees on ordering or reliability.
slide-5
SLIDE 5

Upper level software service High Performance and Scalable MPI with Reduced Memory usage

Shared Receive Queue

  • This allows multiple QPs to be attached to one receive queue

(even for connection oriented transport)

  • This approach is memory efficient
slide-6
SLIDE 6

High Performance and Scalable MPI with Reduced Memory usage

Remote Direct Memory Access (RDMA)

  • Application can directly access the memory of the remove process.
  • RDMA has very low latency.
slide-7
SLIDE 7

High Performance and Scalable MPI with Reduced Memory usage

MVAPICH Design Overview

MVAPICH uses two major protocols –

  • 1. Eager Protocol
  • It is used to transfer small messages.
  • The messages are buffered inside the MPI library.
  • “pre-allocated” communication buffers are required on the sender and

receiver side

  • 2. Rendezvous Protocol
  • It is used to transfer large messages.
  • The message are sent directly to receiver’s user memory.
slide-8
SLIDE 8

High Performance and Scalable MPI with Reduced Memory usage

1 . Adaptive RDMA with Send/Receive

  • In order to avoid a memory-scalability problem when the number of nodes

increase, this channel is adaptive.

  • Limited buffers are allocated initially.
  • Once a threshold number of messages are exchanged, next messages are

transferred using RDMA.

slide-9
SLIDE 9

High Performance and Scalable MPI with Reduced Memory usage

  • 2. Adaptive RDMA with SQR Channel
  • Idea is based on ARDMA-SR. Only Difference is the Shared Queue Receiver is used.
  • Drawback : Sender doesn’t know the receiver buffer availability.
  • Solution : Setting a “low-watermark” for the SQR.
slide-10
SLIDE 10

High Performance and Scalable MPI with Reduced Memory usage

  • 3. Shared Receive Queue
  • This channel exclusively utilizes the SRQ feature.
  • This follows the same “low-watermark technique as the ARDMA-SRQ.
  • Even though RDMA has low latency, they consume more memory.
slide-11
SLIDE 11

High Performance and Scalable MPI with Reduced Memory usage

NAS Benchmark

slide-12
SLIDE 12

High Performance and Scalable MPI with Reduced Memory usage

High Performance Linpack

  • Benchmark for solving linear equations.
  • It is used as the primary measure for ranking biannual Top 500 list of

the world’s fastest supercomputers

slide-13
SLIDE 13

Zero-Copy Protocol for MPI using InfiniBand Unreliable Datagram

slide-14
SLIDE 14
  • 1. Performance Scalability
  • Memory copies are detrimental to the overall performance of the application.
  • HCA cache can only hold a limited number of QPs
  • 2. Resource Scalability
  • With a connection oriented transport the memory requirements increase

linearly with the number of connected processes. Zero-Copy Protocol for MPI using InfiniBand Unreliable Datagram

Motivation

slide-15
SLIDE 15

Zero-Copy Protocol for MPI using InfiniBand Unreliable Datagram

Traditional Zero-Copy

  • 1. Matched Queues Interface
  • The receiver deciphers the message tag from the sent message and

matches it with the posted receive operations.

  • 2. Rendezvous Protocol using RDMA
  • Initially a handshake protocol is used, followed by RDMA.
slide-16
SLIDE 16

Zero-Copy Protocol for MPI using InfiniBand Unreliable Datagram

UD vs RC memory usage

For 16k connections – UD = 40 MB / process RC = 240 MB / process

slide-17
SLIDE 17

Zero-Copy Protocol for MPI using InfiniBand Unreliable Datagram

Challenges for true zero copy design

  • Limited MTU Size
  • UD transport has a Maximum Transfer Unit(MTU) limit of 2KB.
  • Segmentation required.
  • Lack of dedicated Receive Buffers
  • Difficult to post receive buffers for a particular peer as they are all shared.
  • If no buffer is posted to a QP, message sent is silently dropped.
  • Lack of Reliability
  • There is no guarantee that a message will arrive at the receiver
  • Lack of ordering
  • Message may not arrive in the same order they are sent.
  • Lack of RDMA
  • RDMA only works for connection oriented transport.
slide-18
SLIDE 18

Zero-Copy Protocol for MPI using InfiniBand Unreliable Datagram

Proposed Design

  • Design is based on serialized communication since RDMA is not specified for UD

transport

  • Serialized implies that the order of transfer is agreed beforehand, and only sender

transmit to a QP at a single time.

slide-19
SLIDE 19

Zero-Copy Protocol for MPI using InfiniBand Unreliable Datagram

Solutions to design challenges

  • 1. Efficient Segmentation
  • The design chooses to get completion signal only for the last packet.
  • The underlying reliability layer would mark packets as missing at the

receiver’s end and the sender is notified.

  • 2. Zero Copy Pool
  • A pool of QPs are maintained.
  • When a message transfer is initiated, a QP is taken from the pool and

the application receive buffer is posted to it.

  • 3. Optimized Reliability and Ordering for Large Messages
  • One approach is the perform a checksum for the entire receive buffer.
  • Each operation can specify a 32-bit immediate field that will be available

to the receiver as part of the completion entry.

slide-20
SLIDE 20

Zero-Copy Protocol for MPI using InfiniBand Unreliable Datagram

Experimental Evaluation

Ping Pong Latency

slide-21
SLIDE 21

Zero-Copy Protocol for MPI using InfiniBand Unreliable Datagram Uni-Directional Bandwidth

slide-22
SLIDE 22

Zero-Copy Protocol for MPI using InfiniBand Unreliable Datagram Bi-Directional Bandwidth

slide-23
SLIDE 23

MVAPICH-Aptus : Scalable High-Performance Multi-Transport MPI over InfiniBand

slide-24
SLIDE 24

MVAPICH-Aptus : Scalable High-Performance Multi-Transport MPI over InfiniBand

Motivation

This paper seeks to address two mains questions -

  • 1. What are the different protocols developed for MPI over IB ?

How well do they perform at scale ?

  • 2. Given this knowledge, can the MPI Library be designed to

dynamically select protocols to optimized for performance and scalability ?

slide-25
SLIDE 25

MVAPICH-Aptus : Scalable High-Performance Multi-Transport MPI over InfiniBand IB provides several types of transport services –

  • Reliable Connection (RC)
  • Used as the primary transport for MVAPICH and other MPIs over InfiniBand.
  • Most feature-rich -- supports RDMA and provides reliable service.
  • Dedicated QP must be created for each communicating peer.
  • Reliable Datagram (RD)
  • Most of the same features as RC, however, a dedicated QP is not required.
  • Not implemented with current hardware.
  • Unreliable Connection (UC)
  • Provides RDMA capability.
  • No guarantees on ordering or reliability.
  • Dedicated QP must be created for each communicating peer.
  • Unreliable Datagram (UD)
  • Connection-less. Single QP can communicate with any other peer QP.
  • Limited message size.
  • No guarantees on ordering or reliability.
slide-26
SLIDE 26

MVAPICH-Aptus : Scalable High-Performance Multi-Transport MPI over InfiniBand

Message Channel

Eager Protocol Channel

slide-27
SLIDE 27

MVAPICH-Aptus : Scalable High-Performance Multi-Transport MPI over InfiniBand

Message Channel

Rendezvous Protocol Channel

slide-28
SLIDE 28

MVAPICH-Aptus : Scalable High-Performance Multi-Transport MPI over InfiniBand

Channel Evaluation

Performance : Eager Latency

slide-29
SLIDE 29

MVAPICH-Aptus : Scalable High-Performance Multi-Transport MPI over InfiniBand

Channel Evaluation

Performance : Uni-Directional Bandwidth

slide-30
SLIDE 30

MVAPICH-Aptus : Scalable High-Performance Multi-Transport MPI over InfiniBand

Channel Evaluation

Scalability Test : Memory Usage

slide-31
SLIDE 31

MVAPICH-Aptus : Scalable High-Performance Multi-Transport MPI over InfiniBand

Channel Evaluation

Scalability Test : Latency

slide-32
SLIDE 32

MVAPICH-Aptus : Scalable High-Performance Multi-Transport MPI over InfiniBand

Channel Characteristics Summary

slide-33
SLIDE 33

Zero-Copy Protocol for MPI using InfiniBand Unreliable Datagram

Overview of Design

  • As seen from the experimental results, using only one channel is not sufficient

to achieve performance and scalability.

  • The solution is to use a combination of message channels and transports to optimize

for performance as well as scalability. Design Challenges

  • 1. When should a channel be created ?
  • 2. When should a channel be used ?
slide-34
SLIDE 34

Zero-Copy Protocol for MPI using InfiniBand Unreliable Datagram

Channel Allocation

slide-35
SLIDE 35
  • From the experimental results we can see the channels behave differently for

different message size

  • A flexible form is defined when sending a message

Zero-Copy Protocol for MPI using InfiniBand Unreliable Datagram

Channel Usage

  • Using this flexible framework, send rules can be changed on a per-system or job

level to meet application needs without changing the code within MPI library.

slide-36
SLIDE 36

Zero-Copy Protocol for MPI using InfiniBand Unreliable Datagram

Performance Evaluation

slide-37
SLIDE 37

QUESTIONS ?