GPU Technology Conference GTC 2016 by Dhabaleswar K. (DK) Panda - - PowerPoint PPT Presentation

gpu technology conference gtc 2016 by dhabaleswar k dk
SMART_READER_LITE
LIVE PREVIEW

GPU Technology Conference GTC 2016 by Dhabaleswar K. (DK) Panda - - PowerPoint PPT Presentation

Coupling GPUDirect RDMA and InfiniBand Hardware Multicast Technologies for Streaming Applications GPU Technology Conference GTC 2016 by Dhabaleswar K. (DK) Panda The Ohio State University E-mail: panda@cse.ohio-state.edu


slide-1
SLIDE 1

Coupling GPUDirect RDMA and InfiniBand Hardware Multicast Technologies for Streaming Applications

Dhabaleswar K. (DK) Panda The Ohio State University E-mail: panda@cse.ohio-state.edu http://www.cse.ohio-state.edu/~panda

GPU Technology Conference GTC 2016 by

slide-2
SLIDE 2

GTC 2016 2 Network Based Computing Laboratory

  • Examples - surveillance, habitat

monitoring, etc..

  • Require efficient transport of data

from/to distributed sources/sinks

  • Sensitive to latency and throughput

metrics

  • Require HPC resources to

efficiently carry out compute- intensive tasks

Streaming Applications

slide-3
SLIDE 3

GTC 2016 3 Network Based Computing Laboratory

  • Pipelined data parallel compute phases that form

the crux of streaming applications lend themselves for GPGPUs

  • Data distribution to GPGPU sites occur over PCIe

within the node and over InfiniBand interconnects across nodes

Courtesy: Agarwalla, Bikash, et al. "Streamline: A scheduling heuristic for streaming applications on the grid." Electronic Imaging 2006

  • Broadcast operation is a key dictator of

throughput of streaming applications

  • Reduced latency for each operation
  • Support multiple back-to-back operations

Nature of Streaming Applications

slide-4
SLIDE 4

GTC 2016 4 Network Based Computing Laboratory

Drivers of Modern HPC Cluster Architectures

  • Multi-core processors are ubiquitous
  • InfiniBand very popular in HPC clusters
  • Accelerators/Coprocessors becoming common in high-end systems
  • Pushing the envelope for Exascale computing

Accelerators / Coprocessors high compute density, high performance/watt >1 Tflop/s DP on a chip High Performance Interconnects - InfiniBand <1usec latency, >100Gbps Bandwidth

Tianhe – 2 Titan Stampede Tianhe – 1A

Multi-core Processors

slide-5
SLIDE 5

GTC 2016 5 Network Based Computing Laboratory

  • 235 IB Clusters (47%) in the Nov’ 2015 Top500 list

(http://www.top500.org)

  • Installations in the Top 50 (21 systems):

Large-scale InfiniBand Installations

462,462 cores (Stampede) at TACC (10th) 76,032 cores (Tsubame 2.5) at Japan/GSIC (25th) 185,344 cores (Pleiades) at NASA/Ames (13th) 194,616 cores (Cascade) at PNNL (27th) 72,800 cores Cray CS-Storm in US (15th) 76,032 cores (Makman-2) at Saudi Aramco (32nd) 72,800 cores Cray CS-Storm in US (16th) 110,400 cores (Pangea) in France (33rd) 265,440 cores SGI ICE at Tulip Trading Australia (17th) 37,120 cores (Lomonosov-2) at Russia/MSU (35th) 124,200 cores (Topaz) SGI ICE at ERDC DSRC in US (18th) 57,600 cores (SwiftLucy) in US (37th) 72,000 cores (HPC2) in Italy (19th) 55,728 cores (Prometheus) at Poland/Cyfronet (38th) 152,692 cores (Thunder) at AFRL/USA (21st ) 50,544 cores (Occigen) at France/GENCI-CINES (43rd) 147,456 cores (SuperMUC) in Germany (22nd) 76,896 cores (Salomon) SGI ICE in Czech Republic (47th) 86,016 cores (SuperMUC Phase 2) in Germany (24th) and many more!

slide-6
SLIDE 6

GTC 2016 6 Network Based Computing Laboratory

  • Introduced in Oct 2000
  • High Performance Point-to-point Data Transfer

– Interprocessor communication and I/O – Low latency (<1.0 microsec), High bandwidth (up to 12.5 GigaBytes/sec -> 100Gbps), and low CPU utilization (5-10%)

  • Multiple Features

– Offloaded Send/Recv – RDMA Read/Write – Atomic Operations – Hardware Multicast support through Unreliable Datagram (UD)

  • A message sent from a single source (host memory) can reach all destinations (host memory) in

a single pass over the network through switch-based replication

  • Restricted to one MTU
  • Large messages need to be sent in a chunked manner
  • Unreliable, Reliability needs to be addressed
  • Leading to big changes in designing HPC clusters, file systems, cloud computing

systems, grid computing systems, ….

InfiniBand Networking Technology

slide-7
SLIDE 7

GTC 2016 7 Network Based Computing Laboratory

InfiniBand Hardware Multicast Example

slide-8
SLIDE 8

GTC 2016 8 Network Based Computing Laboratory

Multicast-aware CPU-Based MPI_Bcast on Stampede using MVAPICH2 (6K nodes with 102K cores)

10 20 30 40 2 8 32 128 512 Latency (µs) Message Size (Bytes)

Small Messages (102,400 Cores)

Default Multicast ConnectX-3-FDR (54 Gbps): 2.7 GHz Dual Octa-core (SandyBridge) Intel PCIe Gen3 with Mellanox IB FDR switch 100 200 300 400 500 2K 8K 32K 128K Latency (µs) Message Size (Bytes)

Large Messages (102,400 Cores)

Default Multicast 10 20 30 Latency (µs) Number of Nodes

16 Byte Message

Default Multicast 50 100 150 200 Latency (µs) Number of Nodes

32 KByte Message

Default Multicast

slide-9
SLIDE 9

GTC 2016 9 Network Based Computing Laboratory

GPU Memory

  • Before CUDA 4: Additional copies
  • Low performance and low productivity
  • After CUDA 4: Host-based pipeline
  • Unified Virtual Address
  • Pipeline CUDA copies with IB transfers
  • High performance and high productivity
  • After CUDA 5.5: GPUDirect-RDMA support
  • GPU to GPU direct transfer
  • Bypass the host memory
  • Hybrid design to avoid PCI bottlenecks

InfiniBand

GPU CPU Chip set

GPUDirect RDMA (GDR) and CUDA-Aware MPI

slide-10
SLIDE 10

GTC 2016 10 Network Based Computing Laboratory MVAPICH2-GDR-2.2b Intel Ivy Bridge (E5-2680 v2) node - 20 cores NVIDIA Tesla K40c GPU Mellanox Connect-IB Dual-FDR HCA CUDA 7 Mellanox OFED 2.4 with GPU-Direct-RDMA 10x 2X 11x 2x

Performance of MVAPICH2-GPU with GPU-Direct RDMA (GDR)

5 10 15 20 25 30 2 8 32 128 512 2K MV2-GDR2.2b MV2-GDR2.0b MV2 w/o GDR

GPU-GPU internode latency

Message Size (bytes)

Latency (us)

2.18us

500 1000 1500 2000 2500 3000 1 4 16 64 256 1K 4K MV2-GDR2.2b MV2-GDR2.0b MV2 w/o GDR

GPU-GPU Internode Bandwidth

Message Size (bytes)

Bandwidth (MB/s)

11X

1000 2000 3000 4000 1 4 16 64 256 1K 4K MV2-GDR2.2b MV2-GDR2.0b MV2 w/o GDR

GPU-GPU Internode Bi-Bandwidth

Message Size (bytes) Bi-Bandwidth (MB/s)

More details in 2:30pm session today

slide-11
SLIDE 11

GTC 2016 11 Network Based Computing Laboratory

  • Traditional short message broadcast
  • peration between GPU buffers involves a

Host-Staged Multicast (HSM)

  • Data copied from GPU buffers to host

memory

  • Using InfiniBand Unreliable Datagram(UD)-

based hardware multicast

  • Sub-optimal use of near-scale invariant

UD-multicast performance

  • PCIe resources wasted and benefits of

multicast nullified

  • GPUDirect RDMA capabilities unused

Broadcasting Data from One GPU Memory to Other GPU Memory: Shortcomings

slide-12
SLIDE 12

GTC 2016 12 Network Based Computing Laboratory

  • Can we design a new GPU broadcast scheme that can deliver low latency for

streaming applications?

  • Can we combine GDR and IB MCAST features to
  • Achieve the best performance
  • Free the Host-Device PCIe bandwidth for application needs
  • Can such design be extended to support heterogeneous configurations?
  • Host-to-Device
  • Camera connected to host and devices used for computation
  • Device-to-device
  • Device-to-Host
  • How to support such a design on systems with multiple GPUs/node?
  • How much performance benefits can be achieved with the new designs?

Problem Statement

slide-13
SLIDE 13

GTC 2016 13 Network Based Computing Laboratory

  • Copy user GPU data to host buffers
  • Perform Multicast and copy back

GPU

HCA

Host

Vbuf

user

NW

  • CudaMemcpy dictates

performance

  • Requires PCIe Host-Device

resources

CUDAMEMCPY MCAST

Existing Protocol for GPU Multicast

slide-14
SLIDE 14

GTC 2016 14 Network Based Computing Laboratory

  • Can we substitute the cudaMemcpy with a better design?

Host Memory GPU Memory Host Buf GPU Buf HCA PCI-E

Alternative Approaches

  • CudaMemcpy: Default Scheme
  • Big overhead for small message
  • Loopback-based design: Uses GDR feature
  • Process establishes self-connection
  • Copy H-D ⇒ RDMA write (H, D)
  • Copy D-H ⇒ RDMA write (D, H)
  • P2P bottleneck ⇒ good for small and medium

sizes

  • GDRCOPY-based design: New module for fast copies
  • Involves GPU PCIe BAR1 mapping
  • CPU performing the copy ⇒ block until completion
  • Very good performance for H-D for small and medium sizes
  • Very good performance for D-H only for very small sizes
slide-15
SLIDE 15

GTC 2016 15 Network Based Computing Laboratory

  • Copy user GPU data to host buffers
  • Perform Multicast and copy back

GPU

HCA

Host

Vbuf

user

NW

  • D-H operation limits

performance

  • Can we avoid GDRCOPY for

D-H copies?

GDRCOPY

MCAST

GDRCOPY-based design

slide-16
SLIDE 16

GTC 2016 16 Network Based Computing Laboratory

  • Copy user GPU data to host buffers using

loopback scheme

  • Perform Multicast
  • Copy back the data to GPU using

GDRCOPY scheme GPU

HCA

Host

Vbuf

user

NW

  • Good performance for both

H-D and D-H copies

  • Expected performance only for

small message

  • Still using the PCIe H-D resources

GDRCOPY

MCAST

(GDRCOPY + Loopback)-based design

LoopBack

slide-17
SLIDE 17

GTC 2016 17 Network Based Computing Laboratory

  • Experiments were run on Wilkes @ University of Cambridge

– 12-core IvyBridge Intel(R) Xeon(R) E5-2630 @ 2.60 GHz with 64 GB RAM – FDR ConnectX2 HCAs + NVIDIA K20c GPUs – Mellanox OFED version MLNX OFED LINUX-2.1-1.0.6 which supports GPUDirect RDMA (GDR) required – Use only one GPU and one HCA per node (same socket) configuration

  • Based on latest MVAPICH2-GDR 2.1 release

(http://mvapich.cse.ohio-state.edu/downloads)

  • Use OSU MicroBenchmark test suit

  • su_bcast benchmark

– A modified version mimicking back-to-back broadcasts

Experimental Setup and Details of Benchmarks

slide-18
SLIDE 18

GTC 2016 18 Network Based Computing Laboratory

Performance of Naive (CudaMemcpy-based) scheme

Big overhead, 20 µs for 1 Byte Reasonably good for large

  • Big overhead for small messages due the the overhead of the copies
  • Good scalability with small messages as it is true MCAST based
slide-19
SLIDE 19

GTC 2016 19 Network Based Computing Laboratory

Performance of GDRCOPY-based scheme

Very good performance Limited D-H performance

  • Achieves 3 µs for small message broadcast
  • GDRCOPY D-H operation has big overhead for large message
slide-20
SLIDE 20

GTC 2016 20 Network Based Computing Laboratory

Performance of LoopBack-based scheme

Acceptable/Good performance

  • Achieves less than 6 µs for small message broadcast
  • Uses IB LoopBack path for both D-H and H-D copies
  • Sender and receiver might share the same network bandwidth
slide-21
SLIDE 21

GTC 2016 21 Network Based Computing Laboratory

Performance of Hybrid (GDRCOPY+Loopback+Naïve) scheme

Switch to loopback design

  • Takes advantage of the best of each scheme: Loopback for D-H and GDRCOPY

for H-D

  • Good scalability up to 64 GPU system
slide-22
SLIDE 22

GTC 2016 22 Network Based Computing Laboratory

Comparing Different Schemes

  • Up to 3X performance improvement
  • Good scalability
  • However, all these schemes still use Host-based staging ⇒ Use PCIe

Host-Device resources 3X

slide-23
SLIDE 23

GTC 2016 23 Network Based Computing Laboratory

  • Can we have enhanced designs that:
  • Delivers good performance (low latency for throughput broadcast operations)
  • Frees PCIe host-device resources
  • Provides good support for all message sizes (small and large)

Can we do better?

slide-24
SLIDE 24

GTC 2016 24 Network Based Computing Laboratory

  • How to handle control messages and data which belong to

two different memories (control on Host, data on GPU)?

  • How to efficiently handle multi-GPU configurations
  • How to handle reliability as MCAST is UD-based transport? ⇒

Can we provide MPI_Bcast semantic support?

Challenges in Combining GDR and MCAST Features

slide-25
SLIDE 25

GTC 2016 25 Network Based Computing Laboratory

  • MCAST two separate addresses (control on the host + data on GPU) in one IB

message

  • Direct IB read/write from/to GPU using GDR feature for low latency ⇒ Zerocopy

based schemes

  • MCAST feature to provide scalability ⇒ Switch based message duplication
  • No extra copy between Host and GPU ⇒ frees-up PCIe resource for application

needs

Combining GDR and MCAST Features: Scatter-Gather List (SGL) Approach

  • A. Venkatesh, H. Subramoni, K. Hamidouche and D. K. Panda, A High Performance Broadcast Design with Hardware Multicast and

GPUDirect RDMA for Streaming Applications on InfiniBand Clusters IEEE International Conference on High Performance Computing (HiPC'2014)

slide-26
SLIDE 26

GTC 2016 26 Network Based Computing Laboratory

  • One time registration of window of

persistent buffers in streaming apps GPU

HCA

Host

control

user

NW

Gather Scatter Scatter

  • Gather control and user data at

the source and scatter them at the destinations using Scatter- Gather-List abstraction

MCAST

  • Scheme lends itself for

pipelined phases abundant in Streaming Applications and avoids stressing PCIe

Overview of the envisioned SGL-based approach

slide-27
SLIDE 27

GTC 2016 27 Network Based Computing Laboratory

SGL-based design Evaluation

  • SGL-based design is able to deliver:
  • Low latency and high scalability (less than 4us)
  • Free PCIe resource for applications
slide-28
SLIDE 28

GTC 2016 28 Network Based Computing Laboratory

  • HSM (Host Staged), GSM (GPU Staged)=SGL
  • Based on a synthetic benchmark that mimics

broadcast patterns in Streaming Applications

  • Long window of persistent m-byte buffers

with 1,000 back-to-back multicast operations issued

  • Execution time reduces by 3x-4x

Benefits of SGL-based design with Streaming Benchmark

slide-29
SLIDE 29

GTC 2016 29 Network Based Computing Laboratory

  • Limit the support to only Device to Device

broadcast

  • Requires the copy from the host to device

at the source

  • Big overhead of the copy
  • Breaks the pipeline view of the streaming

application

  • Not scalable for multi-GPUs nodes
  • Flat design with one-to-one MCAST

connection for each GPU

Limits of SGL-based Broadcast Designs

(Host) IB Network GPU GPU GPU GPU GPU GPU GPU

Explicit Copy H-D

slide-30
SLIDE 30

GTC 2016 30 Network Based Computing Laboratory

  • Can MCAST+GDR be combined for heterogeneous configurations?
  • Source on the Host and destination on Device
  • Heterogeneity: Control+Data are contiguous on one side and non-contiguous on other side
  • Combine MCAST and GDR => No use of PCIe resources (free for application usage)
  • How about multi-GPU nodes? Can intra-node topology-awareness help?
  • Hierarchical and complex PCIe interconnects
  • How to maximize the resource utilization of both PCIe and IB interconnects?
  • Looking forward: Solution should benefits current generation

systems and maximal benefits for next-generation systems

On-going Work

slide-31
SLIDE 31

GTC 2016 31 Network Based Computing Laboratory

  • IB MCAST feature provides high scalability and low latency
  • GDR feature provides a direct access between IB and GPUs
  • MVAPICH2-GDR provides several schemes to efficiently broadcast from/to GPU

memories using host staged techniques

  • Naïve design + Host-based MCAST
  • GDRCOPY + Host-based MCAST
  • GDRCOPY + Loopback + Host-based MCAST
  • Presented a set of designs to couple GDR and IB MCAST features
  • Results are promising
  • Designs need to be extended to support heterogeneity and multi-GPU support
  • New designs will be available in future MVAPICH2-GDR library

Conclusions

slide-32
SLIDE 32

GTC 2016 32 Network Based Computing Laboratory

Two Additional Talks

  • S6411 - MVAPICH2-GDR: Pushing the Frontier of Designing MPI Libraries Enabling

GPUDirect Technologies

– Day: Wednesday, 04/06 – Time: 14:30 - 14:55 – Location: Room 211A

  • S6418 - Bringing NVIDIA GPUs to the PGAS/OpenSHMEM World: Challenges and Solutions

– Day: Wednesday, 04/06 – Time: 16:30 - 16:55 – Location: Room 211A

slide-33
SLIDE 33

GTC 2016 33 Network Based Computing Laboratory

panda@cse.ohio-state.edu

Thank You!

Network-Based Computing Laboratory http://nowlab.cse.ohio-state.edu/ The MVAPICH2 Project http://mvapich.cse.ohio-state.edu/