Efficient Performance™ 1
High-Performance GPU Clustering: GPUDirect RDMA over 40GbE iWARP
Tom Reu Consulting Applications Engineer Chelsio Communications tomreu@chelsio.com
High-Performance GPU Clustering: GPUDirect RDMA over 40GbE iWARP - - PowerPoint PPT Presentation
High-Performance GPU Clustering: GPUDirect RDMA over 40GbE iWARP Tom Reu Consulting Applications Engineer Chelsio Communications tomreu@chelsio.com 1 Efficient Performance Chelsio Corporate Snapshot Leader in High Speed Converged
Efficient Performance™ 1
High-Performance GPU Clustering: GPUDirect RDMA over 40GbE iWARP
Tom Reu Consulting Applications Engineer Chelsio Communications tomreu@chelsio.com
Efficient Performance™ 2
Chelsio Corporate Snapshot
Leader in High Speed Converged Ethernet Adapters Market Coverage
Manufacturing Oil and Gas Finance Service/Cloud Storage Media HPC Security
OEM Snapshot
Efficient Performance™
NIC
and remote enforcement
RDMA Overview
Performance and efficiency in return for new communication paradigm
Chelsio T5 RNIC Chelsio T5 RNIC
Efficient Performance™ 4
iWARP
What is it?
Efficient Performance™ 5
STORM (Storage Maintenance) working group
Specification
Transports
Specification
Protocol Extensions
iWARP
Efficient Performance™ 6
iWARP
Increasing Interest in iWARP as of late
Efficient Performance™ 7
and wireless
iWARP
Advantages over Other RDMA Transports
Efficient Performance™
iWARP vs RoCE
iWARP RoCE
Native TCP/IP over Ethernet, no different from NFS
Difficult to install and configure - “needs a team of experts” - Plug-and-Debug Works with ANY Ethernet switches Requires DCB - expensive equipment upgrade Works with ALL Ethernet equipment Poor interoperability - may not work with switches from different vendors No need for special QoS or configuration - TRUE Plug-and-Play Fixed QoS configuration - DCB must be setup identically across all switches No need for special configuration, preserves network robustness Easy to break - switch configuration can cause performance collapse TCP/IP allows reach to Cloud scale Does not scale - requires PFC, limited to single subnet No distance limitations. Ideal for remote communication and HA Short distance - PFC range is limited to few hundred meters maximum WAN routable, uses any IP infrastructure RoCEv1 not routable. RoCE v2 requires lossless IP infrastructure and restricts router configuration Standard for whole stack has been stable for a decade ROCEv2 incompatible with v1. More fixes to missing reliability and scalability layers required and expected Transparent and open IETF standards process Incomplete specification and opaque process
Efficient Performance™ 9
IMAGE
today’s investment in iSCSI
Chelsio’s T5
Single ASIC does it all
Efficient Performance™ 10
T5 ASIC Architecture
▪ Single processor data-flow pipelined architecture ▪ Up to 1M connections ▪ Concurrent Multi-Protocol Operation
1G/10G/40G MAC
Embedded Layer 2 Ethernet Switch
Lookup, filtering and Firewall
Cut-Through RX Memory Cut-Through TX Memory Data-flow Protocol Engine Traffic Manager
Application Co- Processor TX Application Co- Processor RX
DMA Engine PCI-e, X8, Gen 3
General Purpose Processor Optional external DDR3 memory
1G/10G/40G MAC 100M/1G/10G MAC 100M/1G/10G MAC
On-Chip DRAM
Memory Controller
Single connection at 40Gb. Low Latency. High Performance Purpose Built Protocol Processor
Efficient Performance™ 11
Leading Unified Wire™ Architecture
Converged Network Architecture with all-in-one Adapter and Software
Networking
▪ 4x10GbE/2x40GbE NIC ▪ Full Protocol Offload ▪ Data Center Bridging ▪ Hardware firewall ▪ Wire Analytics ▪ DPDK/netmap
HFT
▪ WireDirect technology ▪ Ultra low latency ▪ Highest messages/sec ▪ Wire rate classification
Storage
▪ NVMe/Fabrics ▪ SMB Direct ▪ iSCSI and FCoE with T10-DIX ▪ iSER and NFS over RDMA ▪ pNFS (NFS 4.1) and Lustre ▪ NAS Offload ▪ Diskless boot ▪ Replication and failover
Virtualization & Cloud
▪ Hypervisor offload ▪ SR-IOV with embedded VEB ▪ VEPA, VN-TAGs ▪ VXLAN/NVGRE ▪ NFV and SDN ▪ OpenStack storage ▪ Hadoop RDMA
HPC
▪ iWARP RDMA over Ethernet ▪ GPUDirect RDMA ▪ Lustre RDMA ▪ pNFS (NFS 4.1) ▪ OpenMPI ▪ MVAPICH
Media Streaming
▪ Traffic Management ▪ Video segmentation Offload ▪ Large stream capacity
Single Qualification – Single SKU Concurrent Multi-Protocol Operation
Efficient Performance™
today on Tesla and Quadro GPUs as well.
and other devices to read and write CUDA host and device memory
CPU overhead by copying data directly to and from pinned GPU memory
upstream PCIe root complex
GPUDirect RDMA
Efficient Performance™ 13
directly from network adapter
communication
GPUDirect RDMA
T5 iWARP RDMA over Ethernet certified with NVIDIA GPUDirect
RNIC
LAN/Datacenter/WAN Network
MEMORY MEMORY
Payload Notifications
CPU
Payload Host Host
CPU
Notifications Packets Packets GPU RNIC GPU
Efficient Performance™
Modules required for GPUDirect RMDA with iWARP
Efficient Performance™ 16
Dynamics - Blue Edition
CODE - AT ALL!
HOOMD-blue
Efficient Performance™ 17
$MPI_HOME/bin/mpirun --allow-run-as-root -mca btl_openib_want_cuda_gdr 1 -np X -hostfile /root/hosts -mca btl openib,sm,self -mca btl_openib_if_include cxgb4_0:1 --mca btl_openib_cuda_rdma_limit 65538
/root/hoomd-install/bin/hoomd ./bmark.py --mode=gpu|cpu
HOOMD-blue
Test Configuration
Efficient Performance™ 18
achieves for straight pair potential simulations
HOOMD-blue
Lennard-Jones Liquid 64K Particles Benchmark
Efficient Performance™ 19
HOOMD-blue
Lennard-Jones Liquid 64K Particles Benchmark Results
Average Timesteps per Second Test 1 Test 2 Test 3 450 900 1350 1800 1,771 1,403 1,230 1,089 503 488 214 88 26 CPU GPU w/o GPUDirect RDMA GPU w/ GPUDirect RDMA
Longer is Better
2 CPU Cores 2 GPUs 2 GPUs 8 CPU Cores 4 GPUs 4 GPUs 40 CPU Cores 8 GPUs 8 GPUs
Efficient Performance™ 20
HOOMD-blue
Lennard-Jones Liquid 64K Particles Benchmark Results
Hours to complete 10e6 steps Test 1 Test 2 Test 3 30 60 90 120 1.5 1.7 2.2 2.5 5.5 6 13 32 108 CPU GPU w/o GPUDirect RDMA GPU w/ GPUDirect RDMA
Shorter is Better
2 CPU Cores 8 CPU Cores 40 CPU Cores 2 GPUs 4 GPUs 8 GPUs 2 GPUs 4 GPUs 8 GPUs
Efficient Performance™ 21
potential that forms a icosahedral quasicrystal
component icosahedral quasicrystal, Nature materials 14(January), p. 109-116.
HOOMD-blue
Quasicrystal Benchmark
Efficient Performance™ 22
HOOMD-blue
Quasicrystal results
Average Timesteps per Second Test 1 Test 2 Test 3 300 600 900 1200 1,158 728 407 915 656 308 31 43 11 CPU GPU w/o GPUDirect RDMA GPU w/ GPUDirect RDMA
Longer is Better
2 CPU Cores 2 GPUs 2 GPUs 8 CPU Cores 4 GPUs 4 GPUs 40 CPU Cores 8 GPUs 8 GPUs
Efficient Performance™ 23
HOOMD-blue
Quasicrystal results
Hours to complete 10e6 steps Test 1 Test 2 Test 3 75 150 225 300 2.4 3.5 7 3 4 9 86 63 264 CPU GPU w/o GPUDirect RDMA GPU w/ GPUDirect RDMA
Shorter is Better
2 CPU Cores 8 CPU Cores 40 CPU Cores 2 GPUs 4 GPUs 8 GPUs 2 GPUs 4 GPUs 8 GPUs
Efficient Performance™
Learning Center
Caffe
Deep Learning Framework
Efficient Performance™ 25
network
transport mechanism
infrastructure and the management requirements that come along with it
GPUDirect RDMA which NO CHANGES required to the code
performance improvements
Gb on the horizon
Summary
GPUDirect RDMA over 40GbE iWARP
Efficient Performance™ 26
Papers, Benchmarks, etc.
wp-content/uploads/resources/T5-40Gb-Linux- GPUDirect.pdf
13671/189427
More information
GPUDirect RDMA over 40GbE iWARP
27