Design Guidelines
for
High Performance RDMA Systems
Anuj Kalia (CMU) Michael Kaminsky (Intel Labs) David Andersen (CMU)
1
Design Guidelines for High Performance RDMA Systems Anuj Kalia - - PowerPoint PPT Presentation
Design Guidelines for High Performance RDMA Systems Anuj Kalia (CMU) Michael Kaminsky (Intel Labs) David Andersen (CMU) 1 RDMA is cheap (and fast!) Mellanox Connect-IB 2x 56 Gbps InfiniBand ~2 s RTT RDMA $1300 Problem
for
Anuj Kalia (CMU) Michael Kaminsky (Intel Labs) David Andersen (CMU)
1
2
Mellanox Connect-IB
Problem Performance depends on complex low-level factors
3
NIC CPU
Core Core
L3 PCI Express DMA read RDMA read request RDMA read response
4
Server Client Client
87 88
5
Remote CPU bypass (one-sided)
Remote CPU involved (messaging, two-sided)
2.2 M/s Perf?
6
7
Operations
READ WRITE ATOMIC SEND, RECV
Optimizations
Inlined Unsignaled Doorbell batching 0B-RECVs WQE shrinking
Transports
Reliable Unreliable Connected Datagram Remote bypass (one-sided) Two-sided
8
NICs have multiple processing units (PUs)
Avoid contention Exploit parallelism
PCI Express messages are expensive
Reduce CPU-to-NIC messages (MMIOs) Reduce NIC-to-CPU messages (DMAs)
9
PCI Express Fetch&Add(A, 1)
PU PU
DMA read DMA write
Latency ~500ns Throughput ~2 M/s
CPU
Core Core
L3
A
Sequence counter
10
NIC
Core Core
L3 PCI Express (500 ns) DMA write RDMA write (RPC req)
A
Core to L3: 20 ns SEND (RPC resp)
[HERD, SIGCOMM 14]
11
Throughput (M/s) 30 60 90 120 150
Sequencer throughput7 2.2
Atomics RPC (1 core)
Sequencer throughput
50x
12
SEND
NIC CPU MMIOs ⇒ lots of CPU cycles
SEND SEND
NIC CPU
SEND
DMA Push Pull
13
CPU NIC Requests Responses CPU NIC Requests Responses
Push Pull (Doorbell batching)
14
Sequencer throughput
Throughput (M/s) 30 60 90 120 150
Sequencer throughput16.6 7 2.2
Atomics RPC (1 C) +Dbell batching
50x
15
CPU
Core Core
L3 PCI Express
A
Idle SEND (RPC resp) Bottleneck
16
Sequencer throughput
Throughput (M/s)
30 60 90 120 150
Sequencer throughput27.4 16.6 7 2.2
Atomics RPC (1 C) +3 queues +Dbell batching
50x
Throughput (M/s) 30 60 90 120 150
Sequencer throughput97.2 27.4 16.6 7 2.2
Sequencer throughput
Atomics RPC (1 C) +6 cores +Batching
17
Bottleneck = PCIe DMA bandwidth (paper)
+3 queues
50x
18
SEND
NIC CPU
128 64
Header Size Data Unused
64 128 64B 4B 8B 52B
Imm
Move payload
Header Imm
64
Throughput (M/s) 30 60 90 120 150
Sequencer throughput122 97.2 27.4 7 2.2
Sequencer throughput
Atomics RPC (1 C) +6 cores +4 Queues, Dbell batching
19
+Header-only
50x
20
21
Throughput (M/s)
25 50 75 100
Number of cores
2 4 6 8 10 12 14
Baseline +Doorbell Batching
HERD[SIGCOMM 14]
16B keys, 32B values, 5% PUTs
14 resps/doorbell 9 resps/doorbell
22
Code: https://github.com/anujkaliaiitd/rdma_bench NICs have multiple processing units (PUs)
Avoid contention Exploit parallelism
PCI Express messages are expensive
Reduce CPU-to-NIC messages (MMIOs) Reduce NIC-to-CPU messages (DMAs)