Support for GPUs with GPUDirect RDMA in MVAPICH2 SC13 NVIDIA Booth - - PowerPoint PPT Presentation

support for gpus with gpudirect rdma in mvapich2
SMART_READER_LITE
LIVE PREVIEW

Support for GPUs with GPUDirect RDMA in MVAPICH2 SC13 NVIDIA Booth - - PowerPoint PPT Presentation

Support for GPUs with GPUDirect RDMA in MVAPICH2 SC13 NVIDIA Booth by D.K. Panda The Ohio State University E-mail: panda@cse.ohio-state.edu http://www.cse.ohio-state.edu/~panda Outline Overview of MVAPICH2-GPU Project GPUDirect


slide-1
SLIDE 1

Support for GPUs with GPUDirect RDMA in MVAPICH2

D.K. Panda The Ohio State University E-mail: panda@cse.ohio-state.edu http://www.cse.ohio-state.edu/~panda

SC’13 NVIDIA Booth

by

slide-2
SLIDE 2
  • Overview of MVAPICH2-GPU Project
  • GPUDirect RDMA with Mellanox IB adaptors
  • Other Optimizations for GPU Communication
  • Support for MPI + OpenACC
  • CUDA and OpenACC extensions in OMB

2

Outline

SC'13 NVIDIA Booth presentation

slide-3
SLIDE 3

Drivers of Modern HPC Cluster Architectures

  • Multi-core processors are ubiquitous and InfiniBand is widely accepted
  • MVAPICH2 has constantly evolved to provide superior performance
  • Accelerators/Coprocessors are becoming common in high-end systems
  • How does MVAPICH2 help development on these emerging architectures?

Accelerators / Coprocessors high compute density, high performance/watt >1 TFlop DP on a chip High Performance Interconnects - InfiniBand <1usec latency, >100Gbps Bandwidth

Tianhe – 2 (1) Titan (2) Stampede (6) Tianhe – 1A (10)

3

Multi-core Processors

SC'13 NVIDIA Booth presentation

slide-4
SLIDE 4
  • Many systems today have both GPUs and

high-speed networks such as InfiniBand

  • Problem: Lack of a common memory

registration mechanism

– Each device has to pin the host memory it will use – Many operating systems do not allow multiple devices to register the same memory pages

  • Previous solution:

– Use different buffer for each device and copy data

SC'13 NVIDIA Booth presentation

4

InfiniBand + GPU systems (Past)

slide-5
SLIDE 5
  • Collaboration between Mellanox and

NVIDIA to converge on one memory registration technique

  • Both devices register a common

host buffer

– GPU copies data to this buffer, and the network adapter can directly read from this buffer (or vice-versa)

  • Note that GPU-Direct does not allow you to

bypass host memory

5

GPU-Direct

SC'13 NVIDIA Booth presentation

slide-6
SLIDE 6

PCIe GPU CPU NIC Switch

At Sender: cudaMemcpy(s_hostbuf, s_devbuf, . . .);

MPI_Send(s_hostbuf, size, . . .);

At Receiver: MPI_Recv(r_hostbuf, size, . . .);

cudaMemcpy(r_devbuf, r_hostbuf, . . .);

  • Data movement in applications with standard MPI and CUDA interfaces

High Productivity and Low Performance

6

SC'13 NVIDIA Booth presentation

MPI + CUDA

  • Users can do the Pipelining at the application level using non-blocking

MPI and CUDA interfaces

Low Productivity and High Performance

slide-7
SLIDE 7

At Sender: At Receiver:

MPI_Recv(r_devbuf, size, …); inside MVAPICH2

  • Standard MPI interfaces used for unified data movement
  • Takes advantage of Unified Virtual Addressing (>= CUDA 4.0)
  • Optimizes data movement from GPU memory

High Performance and High Productivity

MPI_Send(s_devbuf, size, …);

7

SC'13 NVIDIA Booth presentation

GPU-Aware MPI Library: MVAPICH2-GPU

slide-8
SLIDE 8

Pipelined Data Movement in MVAPICH2

  • 45% improvement compared

with a naïve (Memcpy+Send)

  • 24% improvement compared

with an advanced user-level implementation (MemcpyAsync+Isend)

500 1000 1500 2000 2500 3000

32K 128K 512K 2M Time (us)

Message Size (bytes)

Memcpy+Send MemcpyAsync+Isend MVAPICH2-GPU

8

SC'13 NVIDIA Booth presentation

Better

  • Pipelines data movement from the GPU, overlaps
  • device-to-host CUDA copies
  • inter-process data movement (network transfers or shared memory

copies)

  • host-to-device CUDA copies

Internode osu_latency large

slide-9
SLIDE 9
  • Overview of MVAPICH2-GPU Project
  • GPUDirect RDMA with Mellanox IB adaptors
  • Other Optimizations for GPU Communication
  • Support for MPI + OpenACC
  • CUDA and OpenACC extensions in OMB

9

Outline

SC'13 NVIDIA Booth presentation

slide-10
SLIDE 10
  • Network adapter can directly

read/write data from/to GPU device memory

  • Avoids copies through the host
  • Fastest possible communication

between GPU and IB HCA

  • Allows for better asynchronous

communication

  • OFED with GDR support is under

development by Mellanox and NVIDIA

GPU-Direct RDMA (GDR) with CUDA

InfiniBand GPU

GPU Memory

CPU

Chip set

System Memory

10 MVAPICH User Group Meeting 2013 SC'13 NVIDIA Booth presentation

slide-11
SLIDE 11
  • OFED with support for GPUDirect RDMA is

under work by NVIDIA and Mellanox

  • OSU has an initial design of MVAPICH2 using

GPUDirect RDMA

– Hybrid design using GPU-Direct RDMA

  • GPUDirect RDMA and Host-based pipelining
  • Alleviates P2P bandwidth bottlenecks on

SandyBridge and IvyBridge

– Support for communication using multi-rail – Support for Mellanox Connect-IB and ConnectX VPI adapters – Support for RoCE with Mellanox ConnectX VPI adapters

SC'13 NVIDIA Booth presentation

11

GPU-Direct RDMA (GDR) with CUDA

IB Adapter

System Memory GPU Memory GPU CPU Chipset

P2P write: 5.2 GB/s P2P read: < 1.0 GB/s SNB E5-2670 P2P write: 6.4 GB/s P2P read: 3.5 GB/s IVB E5-2680V2 SNB E5-2670 / IVB E5-2680V2

slide-12
SLIDE 12

12

Performance of MVAPICH2 with GPU-Direct-RDMA: Latency

GPU-GPU Internode MPI Latency

SC'13 NVIDIA Booth presentation

100 200 300 400 500 600 700 800 8K 32K 128K 512K 2M 1-Rail 2-Rail 1-Rail-GDR 2-Rail-GDR

Large Message Latency

Message Size (bytes)

Latency (us)

Based on MVAPICH2-2.0b Intel Ivy Bridge (E5-2680 v2) node with 20 cores NVIDIA Telsa K40c GPU, Mellanox Connect-IB Dual-FDR HCA CUDA 5.5, Mellanox OFED 2.0 with GPU-Direct-RDMA Patch

10 %

5 10 15 20 25 1 4 16 64 256 1K 4K 1-Rail 2-Rail 1-Rail-GDR 2-Rail-GDR

Small Message Latency

Message Size (bytes)

Latency (us) 67 % 5.49 usec

slide-13
SLIDE 13

13

Performance of MVAPICH2 with GPU-Direct-RDMA: Bandwidth

GPU-GPU Internode MPI Uni-Directional Bandwidth

SC'13 NVIDIA Booth presentation

200 400 600 800 1000 1200 1400 1600 1800 2000 1 4 16 64 256 1K 4K 1-Rail 2-Rail 1-Rail-GDR 2-Rail-GDR

Small Message Bandwidth

Message Size (bytes)

Bandwidth (MB/s)

2000 4000 6000 8000 10000 12000 8K 32K 128K 512K 2M 1-Rail 2-Rail 1-Rail-GDR 2-Rail-GDR

Large Message Bandwidth

Message Size (bytes)

Bandwidth (MB/s)

Based on MVAPICH2-2.0b Intel Ivy Bridge (E5-2680 v2) node with 20 cores NVIDIA Telsa K40c GPU, Mellanox Connect-IB Dual-FDR HCA CUDA 5.5, Mellanox OFED 2.0 with GPU-Direct-RDMA Patch

5x 9.8 GB/s

slide-14
SLIDE 14

14

Performance of MVAPICH2 with GPU-Direct-RDMA: Bi-Bandwidth

Based on MVAPICH2-2.0b Intel Ivy Bridge (E5-2680 v2) node with 20 cores NVIDIA Telsa K40c GPU, Mellanox Connect-IB Dual-FDR HCA CUDA 5.5, Mellanox OFED 2.0 with GPU-Direct-RDMA Patch

GPU-GPU Internode MPI Bi-directional Bandwidth

SC'13 NVIDIA Booth presentation

200 400 600 800 1000 1200 1400 1600 1800 2000 1 4 16 64 256 1K 4K 1-Rail 2-Rail 1-Rail-GDR 2-Rail-GDR

Small Message Bi-Bandwidth

Message Size (bytes) Bi-Bandwidth (MB/s) 5000 10000 15000 20000 25000 8K 32K 128K 512K 2M 1-Rail 2-Rail 1-Rail-GDR 2-Rail-GDR

Large Message Bi-Bandwidth

Message Size (bytes) Bi-Bandwidth (MB/s)

4.3x 19 % 19 GB/s

slide-15
SLIDE 15

How can I get started with GDR Experimentation?

  • Two modules are needed

– Alpha version of OFED kernel and libraries with GPUDirect RDMA (GDR) support from Mellanox – Alpha version of MVAPICH2-GDR from OSU (currently a separate distribution)

  • Send a note to hpc@mellanox.com
  • You will get alpha versions of GDR driver and MVAPICH2-GDR

(based on MVAPICH2 2.0a release)

  • You can get started with this version
  • MVAPICH2 team is working on multiple enhancements (collectives,

datatypes, one-sided) to exploit the advantages of GDR

  • As GDR driver matures, successive versions of MVAPICH2-GDR with

enhancements will be made available to the community

15

SC'13 NVIDIA Booth presentation

slide-16
SLIDE 16
  • Overview of MVAPICH2-GPU Project
  • GPUDirect RDMA with Mellanox IB adaptors
  • Other Optimizations for GPU Communication
  • Support for MPI + OpenACC
  • CUDA and OpenACC extensions in OMB

16

Outline

SC'13 NVIDIA Booth presentation

slide-17
SLIDE 17

Multi-GPU Configurations

17

CPU

GPU 1 GPU 0

Memory I/O Hub Process 0 Process 1

  • Multi-GPU node architectures are

becoming common

  • Until CUDA 3.2

– Communication between processes staged through the host – Shared Memory (pipelined) – Network Loopback [asynchronous)

  • CUDA 4.0 and later

– Inter-Process Communication (IPC) – Host bypass – Handled by a DMA Engine – Low latency and Asynchronous – Requires creation, exchange and mapping of memory handles - overhead

HCA

SC'13 NVIDIA Booth presentation

slide-18
SLIDE 18

500 1000 1500 2000 4K 16K 64K 256K 1M 4M Latency (usec) Message Size (Bytes) 10 20 30 40 50 1 4 16 64 256 1K Latency (usec) Message Size (Bytes)

SHARED-MEM CUDA IPC

18

Designs in MVAPICH2 and Performance

70% 46%

SC'13 NVIDIA Booth presentation

  • MVAPICH2 takes advantage of CUDA

IPC for MPI communication between GPUs

  • Hides the complexity and overheads of

handle creation, exchange and mapping

  • Available in standard releases from

MVAPICH2 1.8 Intranode osu_latency large Intranode osu_latency small

1000 2000 3000 4000 5000 6000 1 16 256 4K 64K 1M Bandwidth (MBps) Message Size (Bytes) 78%

Intranode osu_bw

slide-19
SLIDE 19

19

Collectives Optimizations in MVAPICH2: Overview

SC'13 NVIDIA Booth presentation

  • Optimizes data movement at the collective level for small

messages

  • Pipelines data movement in each send/recv operation for

large messages

  • Several collectives have been optimized
  • Bcast, Gather, Scatter, Allgather, Alltoall, Scatterv, Gatherv,

Allgatherv, Alltoallv

  • Collective level optimizations are completely transparent to

the user

  • Pipelining can be tuned using point-to-point parameters
slide-20
SLIDE 20

MPI Datatype Support in MVAPICH2

20

  • Multi-dimensional data

– Row based organization – Contiguous on one dimension – Non-contiguous on other dimensions

  • Halo data exchange

– Duplicate the boundary – Exchange the boundary in each iteration

Halo data exchange

SC'13 NVIDIA Booth presentation

  • Non-contiguous Data Exchange
slide-21
SLIDE 21

MPI Datatype Support in MVAPICH2

  • Datatypes support in MPI

– Operate on customized datatypes to improve productivity – Enable MPI library to optimize non-contiguous data

SC'13 NVIDIA Booth presentation

21

At Sender:

MPI_Type_vector (n_blocks, n_elements, stride, old_type, &new_type); MPI_Type_commit(&new_type); … MPI_Send(s_buf, size, new_type, dest, tag, MPI_COMM_WORLD);

  • Inside MVAPICH2
  • Use datatype specific CUDA Kernels to pack data in chunks
  • Optimized vector datatypes Kernel based pack/unpack in MVAPICH2 2.0b
  • Efficiently move data between nodes using RDMA
  • Transparent to the user
  • H. Wang, S. Potluri, D. Bureddy, C. Rosales and D. K. Panda, GPU-aware MPI on RDMA-Enabled Clusters: Design, Implementation

and Evaluation, IEEE Transactions on Parallel and Distributed Systems, Accepted for Publication.

slide-22
SLIDE 22

22

Application Level Evaluation (LBMGPU-3D)

  • LBM-CUDA (Courtesy: Carlos Rosale, TACC)
  • Lattice Boltzmann Method for multiphase flows with large density ratios
  • 3D LBM-CUDA: one process/GPU per node, 512x512x512 data grid, up to 64 nodes
  • Oakley cluster at OSC: two hex-core Intel Westmere processors, two NVIDIA Tesla

M2070, one Mellanox IB QDR MT26428 adapter and 48 GB of main memory

50 100 150 200 250 300 350 400 8 16 32 64 Total Execution Time (sec)

Number of GPUs MPI MPI-GPU

5.6% 8.2% 13.5% 15.5%

3D LBM-CUDA

SC'13 NVIDIA Booth presentation

slide-23
SLIDE 23
  • Overview of MVAPICH2-GPU Project
  • GPUDirect RDMA with Mellanox IB adaptors
  • Other Optimizations for GPU Communication
  • Support for MPI + OpenACC
  • CUDA and OpenACC extensions in OMB

23

Outline

SC'13 NVIDIA Booth presentation

slide-24
SLIDE 24
  • OpenACC is gaining popularity
  • Several sessions during GTC
  • A set of compiler directives (#pragma)
  • Offload specific loops or parallelizable sections in code onto accelerators

#pragma acc region { for(i = 0; i < size; i++) { A[i] = B[i] + C[i]; } }

  • Routines to allocate/free memory on accelerators

buffer = acc_malloc(MYBUFSIZE); acc_free(buffer);

  • Supported for C, C++ and Fortran
  • Huge list of modifiers – copy, copyout, private, independent, etc..

OpenACC

24

SC'13 NVIDIA Booth presentation

slide-25
SLIDE 25
  • acc_malloc to allocate device memory

– No changes to MPI calls – MVAPICH2 detects the device pointer and optimizes data movement – Delivers the same performance as with CUDA

Using MVAPICH2 with OpenACC 1.0

25

A = acc_malloc(sizeof(int) * N); …... #pragma acc parallel loop deviceptr(A) . . . //compute for loop MPI_Send (A, N, MPI_INT, 0, 1, MPI_COMM_WORLD); …… acc_free(A);

SC'13 NVIDIA Booth presentation

slide-26
SLIDE 26
  • acc_deviceptr to get device pointer (in OpenACC 2.0)

– Enables MPI communication from memory allocated by compiler when it is available in OpenACC 2.0 implementations – MVAPICH2 will detect the device pointer and optimize communication – Delivers the same performance as with CUDA

Using MVAPICH2 with OpenACC 2.0

26

SC'13 NVIDIA Booth presentation

A = malloc(sizeof(int) * N); …... #pragma acc data copyin(A) . . . { #pragma acc parallel loop . . . //compute for loop MPI_Send(acc_deviceptr(A), N, MPI_INT, 0, 1, MPI_COMM_WORLD); } …… free(A);

slide-27
SLIDE 27
  • Overview of MVAPICH2-GPU Project
  • GPUDirect RDMA with Mellanox IB adaptors
  • Other Optimizations for GPU Communication
  • Support for MPI + OpenACC
  • CUDA and OpenACC extensions in OMB

27

Outline

SC'13 NVIDIA Booth presentation

slide-28
SLIDE 28

CUDA and OpenACC Extensions in OMB

  • OSU Micro-benchmarks are widely used to compare

performance of different MPI stacks and networks

  • Enhancements to measure performance of MPI communication

from GPU memory

– Point-to-point: Latency, Bandwidth and Bi-directional Bandwidth – Collectives: support all collectives.

  • Support for CUDA and OpenACC
  • Flexible selection of data movement between CPU(H) and

GPU(D): D->D, D->H and H->D

  • Available from http://mvapich.cse.ohio-state.edu/benchmarks
  • Available in an integrated manner with MVAPICH2 stack

28

SC'13 NVIDIA Booth presentation

slide-29
SLIDE 29

Summary

29

SC'13 NVIDIA Booth presentation

  • MVAPICH2 evolving to efficiently support MPI communication on

heterogeneous clusters with NVIDIA GPU

  • Simplifying task of porting MPI applications to these new architectures
  • Optimizing data movement while hiding system complexity from the user
  • Users have to still be aware of system configurations and the knobs

MVAPICH2 have to offer

  • User feedback critical as the implementations mature
slide-30
SLIDE 30

SC'13 NVIDIA Booth presentation

Web Pointers

NOWLAB Web Page http://nowlab.cse.ohio-state.edu MVAPICH Web Page http://mvapich.cse.ohio-state.edu

30