[PPT] - CompSci 514: Computer Networks Lecture 17: Network Support for PowerPoint Presentation

SLIDE 1

CompSci 514: Computer Networks Lecture 17: Network Support for Remote Direct Memory Access

Xiaowei Yang Some slides adapted from http://www.cs.unh.edu/~rdr/rdma- intro-module.ppt

SLIDE 2

Overview

Introduction to RDMA
DCQCN: congestion control for large-scale

RDMA deployments

Experience of deploying RDMA at a large scale

datacenter network

2

SLIDE 3

What is RDMA?

A (relatively) new method for high-speed inter-

machine communication – new standards – new protocols – new hardware interface cards and switches – new software

SLIDE 4

SLIDE 5

Remote Direction Memory Access

Read, write, send, receive etc. do not go

through CPU

SLIDE 6

two machines (Intel Xeon E5-2660 2.2GHz, 16

core, 128GB RAM, 40Gbps NICs, Windows Server 2012R2) connected via a 40Gbps switch.

SLIDE 7

SLIDE 8

SLIDE 9

Remote Direct Memory Access

vRemote

–data transfers between nodes in a network

vDirect

–no Operating System Kernel involvement in transfers –everything about a transfer offloaded onto Interface Card

vMemory

–transfers between user space application virtual memory –no extra copying or buffering

vAccess

–send, receive, read, write, atomic operations

SLIDE 10

RDMA Benefits

vHigh throughput vLow latency vHigh messaging rate vLow CPU utilization vLow memory bus contention vMessage boundaries preserved vAsynchronous operation

SLIDE 11

RDMA Technologies

vInfiniBand – (41.8% of top 500 supercomputers)

–SDR 4x – 8 Gbps –DDR 4x – 16 Gbps –QDR 4x – 32 Gbps –FDR 4x – 54 Gbps

viWarp – internet Wide Area RDMA Protocol

–10 Gbps

vRoCE – RDMA over Converged Ethernet

–10 Gbps –40 Gbps

SLIDE 12

RDMA architecture layers

SLIDE 13

Software RDMA Drivers

vSoftiwarp

– www.zurich.ibm.com/sys/rdma – open source kernel module that implements iWARP protocols on top of ordinary kernel TCP sockets – interoperates with hardware iWARP at other end of wire

vSoft RoCE

– www.systemfabricworks.com/downloads/roce – open source IB transport and network layers in software

ver ordinary Ethernet

– interoperates with hardware RoCE at other end of wire

SLIDE 14

Similarities between TCP and RDMA

vBoth utilize the client-server model vBoth require a connection for reliable transport vBoth provide a reliable transport mode

– TCP provides a reliable in-order sequence of bytes – RDMA provides a reliable in-order sequence of messages

SLIDE 15

How RDMA differs from TCP/IP

v“zero copy” – data transferred directly from virtual memory on one node to virtual memory

n another node

v“kernel bypass” – no operating system involvement during data transfers vasynchronous operation – threads not blocked during I/O transfers

SLIDE 16

User App Kernel Stack CA Wire

TCP/IP setup

client server

setup setup connect listen accept bind User App Kernel Stack CA Wire

blue lines: control information red lines: user data green lines: control and data

SLIDE 17

User App Kernel Stack CA Wire

RDMA setup

client server

setup setup rdma_ connect rdma_listen rdma_accept rdma_bind User App Kernel Stack CA Wire

blue lines: control information red lines: user data green lines: control and data

SLIDE 18

User App Kernel Stack CA Wire

TCP/IP setup

client server

setup setup connect listen accept bind User App Kernel Stack CA Wire

blue lines: control information red lines: user data green lines: control and data

SLIDE 19

User App Kernel Stack CA Wire

TCP/IP transfer

client server

setup setup connect listen accept bind User App Kernel Stack CA Wire

blue lines: control information

data send data recv

red lines: user data green lines: control and data

transfer transfer

data copy data copy

SLIDE 20

User App Kernel Stack CA Wire

RDMA transfer

client server

setup setup rdma_ connect rdma_listen rdma_accept rdma_bind User App Kernel Stack CA Wire

blue lines: control information

data rdma_ post_ send data rdma_ post_ recv

red lines: user data green lines: control and data

transfer transfer

SLIDE 21

“Normal” TCP/IP socket access model

vByte streams – requires application to delimit / recover message boundaries vSynchronous – blocks until data is sent/received

–O_NONBLOCK, MSG_DONTWAIT are not asynchronous, are “try” and “try again”

v send() and recv() are paired

–both sides must participate in the transfer

v Requires data copy into system buffers

–order and timing of send() and recv() are irrelevant –user memory accessible immediately before and immediately after each send() and recv() call

SLIDE 22

virtual memory allocate add to tables sleep wakeup access TCP buffers

metadata control

copy

data packets ACKs

TCP RECV()

blocked

status

recv()

USER OPERATING SYSTEM NIC WIRE

SLIDE 23

allocate access

metadata control

data packets ACK

RDMA RECV()

status

recv()

USER CHANNEL ADAPTER WIRE

register

poll_cq()

recv queue completion queue

. . . . . . virtual memory parallel activity

SLIDE 24

RDMA access model

vMessages – preserves user's message boundaries vAsynchronous – no blocking during a transfer, which

–starts when metadata added to work queue –finishes when status available in completion queue

v 1-sided (unpaired) and 2-sided (paired) transfers v No data copying into system buffers

–order and timing of send() and recv() are relevant

recv() must be waiting before issuing send()

–memory involved in transfer is untouchable between start and completion of transfer

SLIDE 25

Congestion Control for Large- Scale RDMA Deployments

By Yibo Zhu et al.

SLIDE 26

Problem

RDMA requires a lossless data link layer
Ethernet is not lossless
Solution à RDMA over Converged Ethernet

RoCE

SLIDE 27

RoCE details

Priority-based Flow Control (PFC)

– When busy, send Pause – When not busy, send Resume

SLIDE 28

SLIDE 29

Problems with PFC

Per-port, not per flow
Unfairness: port-fair, not flow-fair
Collateral damage: head-of-line blocking for

some flows

SLIDE 30

Experimental topology

SLIDE 31

Unfairness

H1-H4 write to R
H4 has no contention at

port P2

H1,H2, and H3 has

contention on P3, and P4

SLIDE 32

Head of line blocking

VS à VR
H11,J14, H31-H32 à

R

T4 congested, sends

PAUSE messages

T1 Pauses all its

incoming links regardless of their destinations

SLIDE 33

Solution

Per-flow congestion control control
Existing work:

– QCN (Quantized Congestion Notification)

Using Ethernet SRC/DST and a flow ID to define a flow
Switch sends congestion notification to sender based
n source MAC address
Only works at L2
This work: DCQCN

– Works for IP-routed networks

SLIDE 34

Why QCN does not work for IP networks?

Same packet has different SRC/DST MAC addresses.

SLIDE 35

DCQCN

DCQCN is a rate-based, end-to-end congestion

protocol

Most of the DCQCN functionality is

implemented in the NICs

SLIDE 36

High level Ideas

ECN-mark packets at an egress queue
Receiver sends Congestion Notification to

sender

Sender reduces sending rates

SLIDE 37

Challenges

How to set buffer sizes at the egress queue
How often to send congestion notifications
How a sender should reduce its sending rate

to ensure both convergence and fairness

SLIDE 38

Solutions provided by the paper

ECN must be set before PFC is triggered

– Use PFC queue sizes to set ECN buffer

Use a fluid model to tune congestion

parameters

SLIDE 39

RDMA over Commodity Ethernet at Scale

Chuanxiong Guo, Haitao Wu, Zhong Deng, Gaurav Soni, Jianxi Ye, Jitendra Padhye, Marina Lipshteyn Microsoft

SLIDE 40

What this paper is about

Extending PFC to IP-routed network
Safety issues of RDMA

– Livelock – Deadlock – Pause frame storm – Slower receiver symptoms

Performance observed in production networks

SLIDE 41

SLIDE 42

4MB message, 1K packets Drop packets with IP ID’s last byte 0xff (1/256)

SLIDE 43

SLIDE 44

SLIDE 45

S3 is dead. T1.p2 is congested Pause is sent to T1.p3, La.p1, To.p2, S1.

SLIDE 46

S4àS2, S2 is dead Blue packet flooded to T0.p2 To.p2 is paused. Ingress T0.p3 pauses Lb.p0 Lb.p1 pauses T1.p4. T1.p1 pauses S4

SLIDE 47

SLIDE 48

Summary

What is RDMA
DCQCN: congestion control for RDMA
Deployment issues for RDMA