Design Guidelines for High Performance RDMA Systems Anuj Kalia - - PowerPoint PPT Presentation

▶

Dec 11, 2022 29 likes •253 views

Design Guidelines for High Performance RDMA Systems Anuj Kalia (CMU) Michael Kaminsky (Intel Labs) David Andersen (CMU) 1 RDMA is cheap (and fast!) Mellanox Connect-IB 2x 56 Gbps InfiniBand ~2 s RTT RDMA $1300 Problem

SLIDE 1

Design Guidelines

for

High Performance RDMA Systems

Anuj Kalia (CMU) Michael Kaminsky (Intel Labs) David Andersen (CMU)

SLIDE 2

RDMA is cheap (and fast!)

Mellanox Connect-IB

2x 56 Gbps InfiniBand
~2 µs RTT
RDMA
$1300

Problem Performance depends on complex low-level factors

SLIDE 3

Background: RDMA read

NIC CPU

Core Core

L3 PCI Express DMA read RDMA read request RDMA read response

SLIDE 4

How to design a sequencer?

Server Client Client

87 88

SLIDE 5

Which RDMA ops to use?

Remote CPU bypass (one-sided)

Read
Write
Fetch-and-add
Compare-and-swap

Remote CPU involved (messaging, two-sided)

Send
Recv

2.2 M/s Perf?

SLIDE 6

How we sped up the sequencer

by 50X

SLIDE 7

Large RDMA design space

Operations

READ WRITE ATOMIC SEND, RECV

Optimizations

Inlined Unsignaled Doorbell batching 0B-RECVs WQE shrinking

Transports

Reliable Unreliable Connected Datagram Remote bypass (one-sided) Two-sided

SLIDE 8

Guidelines

NICs have multiple processing units (PUs)

Avoid contention Exploit parallelism

PCI Express messages are expensive

Reduce CPU-to-NIC messages (MMIOs) Reduce NIC-to-CPU messages (DMAs)

SLIDE 9

High contention w/ atomics

PCI Express Fetch&Add(A, 1)

PU PU

DMA read DMA write

Latency ~500ns Throughput ~2 M/s

CPU

Core Core

Sequence counter

SLIDE 10

Reduce contention: use CPU cores

NIC

Core Core

L3 PCI Express (500 ns) DMA write RDMA write (RPC req)

Core to L3: 20 ns SEND (RPC resp)

[HERD, SIGCOMM 14]

SLIDE 11

Throughput (M/s) 30 60 90 120 150

Sequencer throughput

7 2.2

Atomics RPC (1 core)

Sequencer throughput

50x

SLIDE 12

Reduce MMIOs w/ Doorbell batching

SEND

NIC CPU MMIOs ⇒ lots of CPU cycles

SEND SEND

NIC CPU

SEND

DMA Push Pull

SLIDE 13

RPCs w/ Doorbell batching

CPU NIC Requests Responses CPU NIC Requests Responses

Push Pull (Doorbell batching)

SLIDE 14

Sequencer throughput

Throughput (M/s) 30 60 90 120 150

Sequencer throughput

16.6 7 2.2

Atomics RPC (1 C) +Dbell batching

50x

SLIDE 15

Exploit NIC parallelism w/ multiQ

CPU

Core Core

L3 PCI Express

Idle SEND (RPC resp) Bottleneck

SLIDE 16

Sequencer throughput

Throughput (M/s)

30 60 90 120 150

Sequencer throughput

27.4 16.6 7 2.2

Atomics RPC (1 C) +3 queues +Dbell batching

50x

SLIDE 17

Throughput (M/s) 30 60 90 120 150

Sequencer throughput

97.2 27.4 16.6 7 2.2

Sequencer throughput

Atomics RPC (1 C) +6 cores +Batching

Bottleneck = PCIe DMA bandwidth (paper)

+3 queues

50x

SLIDE 18

Reduce DMA size: Header-only

SEND

NIC CPU

128 64

Header Size Data Unused

64 128 64B 4B 8B 52B

Imm

Move payload

Header Imm

SLIDE 19

Throughput (M/s) 30 60 90 120 150

Sequencer throughput

122 97.2 27.4 7 2.2

Sequencer throughput

Atomics RPC (1 C) +6 cores +4 Queues, Dbell batching

+Header-only

50x

SLIDE 20

Evaluation

Evaluation of optimizations on 3 RDMA generations
PCIe models, bottlenecks
More atomics experiments
Example: atomic operations on multiple addresses

SLIDE 21

RPC-based key-value store

Throughput (M/s)

25 50 75 100

Number of cores

2 4 6 8 10 12 14

Baseline +Doorbell Batching

HERD[SIGCOMM 14]

16B keys, 32B values, 5% PUTs

14 resps/doorbell 9 resps/doorbell

SLIDE 22

Conclusion

Code: https://github.com/anujkaliaiitd/rdma_bench NICs have multiple processing units (PUs)

Avoid contention Exploit parallelism

PCI Express messages are expensive

Reduce CPU-to-NIC messages (MMIOs) Reduce NIC-to-CPU messages (DMAs)