Design Guidelines for High Performance RDMA Systems Anuj Kalia - - PowerPoint PPT Presentation

design guidelines
SMART_READER_LITE
LIVE PREVIEW

Design Guidelines for High Performance RDMA Systems Anuj Kalia - - PowerPoint PPT Presentation

Design Guidelines for High Performance RDMA Systems Anuj Kalia (CMU) Michael Kaminsky (Intel Labs) David Andersen (CMU) 1 RDMA is cheap (and fast!) Mellanox Connect-IB 2x 56 Gbps InfiniBand ~2 s RTT RDMA $1300 Problem


slide-1
SLIDE 1

Design Guidelines

for

High Performance RDMA Systems

Anuj Kalia (CMU) Michael Kaminsky (Intel Labs) David Andersen (CMU)

1

slide-2
SLIDE 2

2

RDMA is cheap (and fast!)

Mellanox Connect-IB

  • 2x 56 Gbps InfiniBand
  • ~2 µs RTT
  • RDMA
  • $1300

Problem Performance depends on complex low-level factors

slide-3
SLIDE 3

Background: RDMA read

3

NIC CPU

Core Core

L3 PCI Express DMA read RDMA read request RDMA read response

slide-4
SLIDE 4

How to design a sequencer?

4

Server Client Client

87 88

slide-5
SLIDE 5

Which RDMA ops to use?

5

Remote CPU bypass (one-sided)

  • Read
  • Write
  • Fetch-and-add
  • Compare-and-swap

Remote CPU involved (messaging, two-sided)

  • Send
  • Recv

2.2 M/s Perf?

slide-6
SLIDE 6

6

How we sped up the sequencer

by 50X

slide-7
SLIDE 7

Large RDMA design space

7

Operations

READ WRITE ATOMIC SEND, RECV

Optimizations

Inlined Unsignaled Doorbell batching 0B-RECVs WQE shrinking

Transports

Reliable Unreliable Connected Datagram Remote bypass (one-sided) Two-sided

slide-8
SLIDE 8

Guidelines

8

NICs have multiple processing units (PUs)

Avoid contention Exploit parallelism

PCI Express messages are expensive

Reduce CPU-to-NIC messages (MMIOs) Reduce NIC-to-CPU messages (DMAs)

slide-9
SLIDE 9

High contention w/ atomics

9

PCI Express Fetch&Add(A, 1)

PU PU

DMA read DMA write

Latency ~500ns Throughput ~2 M/s

CPU

Core Core

L3

A

Sequence counter

slide-10
SLIDE 10

Reduce contention: use CPU cores

10

NIC

Core Core

L3 PCI Express (500 ns) DMA write RDMA write (RPC req)

A

Core to L3: 20 ns SEND (RPC resp)

[HERD, SIGCOMM 14]

slide-11
SLIDE 11

11

Throughput (M/s) 30 60 90 120 150

Sequencer throughput

7 2.2

Atomics RPC (1 core)

Sequencer throughput

50x

slide-12
SLIDE 12

Reduce MMIOs w/ Doorbell batching

12

SEND

NIC CPU MMIOs ⇒ lots of CPU cycles

SEND SEND

NIC CPU

SEND

DMA Push Pull

slide-13
SLIDE 13

RPCs w/ Doorbell batching

13

CPU NIC Requests Responses CPU NIC Requests Responses

Push Pull (Doorbell batching)

slide-14
SLIDE 14

14

Sequencer throughput

Throughput (M/s) 30 60 90 120 150

Sequencer throughput

16.6 7 2.2

Atomics RPC (1 C) +Dbell batching

50x

slide-15
SLIDE 15

Exploit NIC parallelism w/ multiQ

15

CPU

Core Core

L3 PCI Express

A

Idle SEND (RPC resp) Bottleneck

slide-16
SLIDE 16

16

Sequencer throughput

Throughput (M/s)

30 60 90 120 150

Sequencer throughput

27.4 16.6 7 2.2

Atomics RPC (1 C) +3 queues +Dbell batching

50x

slide-17
SLIDE 17

Throughput (M/s) 30 60 90 120 150

Sequencer throughput

97.2 27.4 16.6 7 2.2

Sequencer throughput

Atomics RPC (1 C) +6 cores +Batching

17

Bottleneck = PCIe DMA bandwidth (paper)

+3 queues

50x

slide-18
SLIDE 18

Reduce DMA size: Header-only

18

SEND

NIC CPU

128 64

Header Size Data Unused

64 128 64B 4B 8B 52B

Imm

Move payload

Header Imm

64

slide-19
SLIDE 19

Throughput (M/s) 30 60 90 120 150

Sequencer throughput

122 97.2 27.4 7 2.2

Sequencer throughput

Atomics RPC (1 C) +6 cores +4 Queues, Dbell batching

19

+Header-only

50x

slide-20
SLIDE 20

Evaluation

  • Evaluation of optimizations on 3 RDMA generations
  • PCIe models, bottlenecks
  • More atomics experiments
  • Example: atomic operations on multiple addresses

20

slide-21
SLIDE 21

21

RPC-based key-value store

Throughput (M/s)

25 50 75 100

Number of cores

2 4 6 8 10 12 14

Baseline +Doorbell Batching

HERD[SIGCOMM 14]

16B keys, 32B values, 5% PUTs

14 resps/doorbell 9 resps/doorbell

slide-22
SLIDE 22

Conclusion

22

Code: https://github.com/anujkaliaiitd/rdma_bench NICs have multiple processing units (PUs)

Avoid contention Exploit parallelism

PCI Express messages are expensive

Reduce CPU-to-NIC messages (MMIOs) Reduce NIC-to-CPU messages (DMAs)