CompSci 514: Computer Networks Lecture 17: Network Support for - - PowerPoint PPT Presentation

compsci 514 computer networks lecture 17 network support
SMART_READER_LITE
LIVE PREVIEW

CompSci 514: Computer Networks Lecture 17: Network Support for - - PowerPoint PPT Presentation

CompSci 514: Computer Networks Lecture 17: Network Support for Remote Direct Memory Access Xiaowei Yang Some slides adapted from http://www.cs.unh.edu/~rdr/rdma- intro-module.ppt Overview Introduction to RDMA DCQCN: congestion control


slide-1
SLIDE 1

CompSci 514: Computer Networks Lecture 17: Network Support for Remote Direct Memory Access

Xiaowei Yang Some slides adapted from http://www.cs.unh.edu/~rdr/rdma- intro-module.ppt

slide-2
SLIDE 2

Overview

  • Introduction to RDMA
  • DCQCN: congestion control for large-scale

RDMA deployments

  • Experience of deploying RDMA at a large scale

datacenter network

2

slide-3
SLIDE 3

What is RDMA?

  • A (relatively) new method for high-speed inter-

machine communication – new standards – new protocols – new hardware interface cards and switches – new software

slide-4
SLIDE 4
slide-5
SLIDE 5

Remote Direction Memory Access

  • Read, write, send, receive etc. do not go

through CPU

slide-6
SLIDE 6
  • two machines (Intel Xeon E5-2660 2.2GHz, 16

core, 128GB RAM, 40Gbps NICs, Windows Server 2012R2) connected via a 40Gbps switch.

slide-7
SLIDE 7
slide-8
SLIDE 8
slide-9
SLIDE 9

Remote Direct Memory Access

vRemote

–data transfers between nodes in a network

vDirect

–no Operating System Kernel involvement in transfers –everything about a transfer offloaded onto Interface Card

vMemory

–transfers between user space application virtual memory –no extra copying or buffering

vAccess

–send, receive, read, write, atomic operations

slide-10
SLIDE 10

RDMA Benefits

vHigh throughput vLow latency vHigh messaging rate vLow CPU utilization vLow memory bus contention vMessage boundaries preserved vAsynchronous operation

slide-11
SLIDE 11

RDMA Technologies

vInfiniBand – (41.8% of top 500 supercomputers)

–SDR 4x – 8 Gbps –DDR 4x – 16 Gbps –QDR 4x – 32 Gbps –FDR 4x – 54 Gbps

viWarp – internet Wide Area RDMA Protocol

–10 Gbps

vRoCE – RDMA over Converged Ethernet

–10 Gbps –40 Gbps

slide-12
SLIDE 12

RDMA architecture layers

slide-13
SLIDE 13

Software RDMA Drivers

vSoftiwarp

– www.zurich.ibm.com/sys/rdma – open source kernel module that implements iWARP protocols on top of ordinary kernel TCP sockets – interoperates with hardware iWARP at other end of wire

vSoft RoCE

– www.systemfabricworks.com/downloads/roce – open source IB transport and network layers in software

  • ver ordinary Ethernet

– interoperates with hardware RoCE at other end of wire

slide-14
SLIDE 14

Similarities between TCP and RDMA

vBoth utilize the client-server model vBoth require a connection for reliable transport vBoth provide a reliable transport mode

– TCP provides a reliable in-order sequence of bytes – RDMA provides a reliable in-order sequence of messages

slide-15
SLIDE 15

How RDMA differs from TCP/IP

v“zero copy” – data transferred directly from virtual memory on one node to virtual memory

  • n another node

v“kernel bypass” – no operating system involvement during data transfers vasynchronous operation – threads not blocked during I/O transfers

slide-16
SLIDE 16

User App Kernel Stack CA Wire

TCP/IP setup

client server

setup setup connect listen accept bind User App Kernel Stack CA Wire

blue lines: control information red lines: user data green lines: control and data

slide-17
SLIDE 17

User App Kernel Stack CA Wire

RDMA setup

client server

setup setup rdma_ connect rdma_listen rdma_accept rdma_bind User App Kernel Stack CA Wire

blue lines: control information red lines: user data green lines: control and data

slide-18
SLIDE 18

User App Kernel Stack CA Wire

TCP/IP setup

client server

setup setup connect listen accept bind User App Kernel Stack CA Wire

blue lines: control information red lines: user data green lines: control and data

slide-19
SLIDE 19

User App Kernel Stack CA Wire

TCP/IP transfer

client server

setup setup connect listen accept bind User App Kernel Stack CA Wire

blue lines: control information

data send data recv

red lines: user data green lines: control and data

transfer transfer

data copy data copy

slide-20
SLIDE 20

User App Kernel Stack CA Wire

RDMA transfer

client server

setup setup rdma_ connect rdma_listen rdma_accept rdma_bind User App Kernel Stack CA Wire

blue lines: control information

data rdma_ post_ send data rdma_ post_ recv

red lines: user data green lines: control and data

transfer transfer

slide-21
SLIDE 21

“Normal” TCP/IP socket access model

vByte streams – requires application to delimit / recover message boundaries vSynchronous – blocks until data is sent/received

–O_NONBLOCK, MSG_DONTWAIT are not asynchronous, are “try” and “try again”

v send() and recv() are paired

–both sides must participate in the transfer

v Requires data copy into system buffers

–order and timing of send() and recv() are irrelevant –user memory accessible immediately before and immediately after each send() and recv() call

slide-22
SLIDE 22

virtual memory allocate add to tables sleep wakeup access TCP buffers

metadata control

copy

data packets ACKs

TCP RECV()

blocked

status

recv()

USER OPERATING SYSTEM NIC WIRE

slide-23
SLIDE 23

allocate access

metadata control

data packets ACK

RDMA RECV()

status

recv()

USER CHANNEL ADAPTER WIRE

register

poll_cq()

recv queue completion queue

. . . . . . virtual memory parallel activity

slide-24
SLIDE 24

RDMA access model

vMessages – preserves user's message boundaries vAsynchronous – no blocking during a transfer, which

–starts when metadata added to work queue –finishes when status available in completion queue

v 1-sided (unpaired) and 2-sided (paired) transfers v No data copying into system buffers

–order and timing of send() and recv() are relevant

  • recv() must be waiting before issuing send()

–memory involved in transfer is untouchable between start and completion of transfer

slide-25
SLIDE 25

Congestion Control for Large- Scale RDMA Deployments

By Yibo Zhu et al.

slide-26
SLIDE 26

Problem

  • RDMA requires a lossless data link layer
  • Ethernet is not lossless
  • Solution à RDMA over Converged Ethernet

RoCE

slide-27
SLIDE 27

RoCE details

  • Priority-based Flow Control (PFC)

– When busy, send Pause – When not busy, send Resume

slide-28
SLIDE 28
slide-29
SLIDE 29

Problems with PFC

  • Per-port, not per flow
  • Unfairness: port-fair, not flow-fair
  • Collateral damage: head-of-line blocking for

some flows

slide-30
SLIDE 30

Experimental topology

slide-31
SLIDE 31

Unfairness

  • H1-H4 write to R
  • H4 has no contention at

port P2

  • H1,H2, and H3 has

contention on P3, and P4

slide-32
SLIDE 32

Head of line blocking

  • VS à VR
  • H11,J14, H31-H32 à

R

  • T4 congested, sends

PAUSE messages

  • T1 Pauses all its

incoming links regardless of their destinations

slide-33
SLIDE 33

Solution

  • Per-flow congestion control control
  • Existing work:

– QCN (Quantized Congestion Notification)

  • Using Ethernet SRC/DST and a flow ID to define a flow
  • Switch sends congestion notification to sender based
  • n source MAC address
  • Only works at L2
  • This work: DCQCN

– Works for IP-routed networks

slide-34
SLIDE 34

Why QCN does not work for IP networks?

  • Same packet has different SRC/DST MAC addresses.
slide-35
SLIDE 35

DCQCN

  • DCQCN is a rate-based, end-to-end congestion

protocol

  • Most of the DCQCN functionality is

implemented in the NICs

slide-36
SLIDE 36

High level Ideas

  • ECN-mark packets at an egress queue
  • Receiver sends Congestion Notification to

sender

  • Sender reduces sending rates
slide-37
SLIDE 37

Challenges

  • How to set buffer sizes at the egress queue
  • How often to send congestion notifications
  • How a sender should reduce its sending rate

to ensure both convergence and fairness

slide-38
SLIDE 38

Solutions provided by the paper

  • ECN must be set before PFC is triggered

– Use PFC queue sizes to set ECN buffer

  • Use a fluid model to tune congestion

parameters

slide-39
SLIDE 39

RDMA over Commodity Ethernet at Scale

Chuanxiong Guo, Haitao Wu, Zhong Deng, Gaurav Soni, Jianxi Ye, Jitendra Padhye, Marina Lipshteyn Microsoft

slide-40
SLIDE 40

What this paper is about

  • Extending PFC to IP-routed network
  • Safety issues of RDMA

– Livelock – Deadlock – Pause frame storm – Slower receiver symptoms

  • Performance observed in production networks
slide-41
SLIDE 41
slide-42
SLIDE 42

4MB message, 1K packets Drop packets with IP ID’s last byte 0xff (1/256)

slide-43
SLIDE 43
slide-44
SLIDE 44
slide-45
SLIDE 45

S3 is dead. T1.p2 is congested Pause is sent to T1.p3, La.p1, To.p2, S1.

slide-46
SLIDE 46

S4àS2, S2 is dead Blue packet flooded to T0.p2 To.p2 is paused. Ingress T0.p3 pauses Lb.p0 Lb.p1 pauses T1.p4. T1.p1 pauses S4

slide-47
SLIDE 47
slide-48
SLIDE 48

Summary

  • What is RDMA
  • DCQCN: congestion control for RDMA
  • Deployment issues for RDMA