CompSci 514: Computer Networks Lecture 17: Network Support for - - PowerPoint PPT Presentation
CompSci 514: Computer Networks Lecture 17: Network Support for - - PowerPoint PPT Presentation
CompSci 514: Computer Networks Lecture 17: Network Support for Remote Direct Memory Access Xiaowei Yang Some slides adapted from http://www.cs.unh.edu/~rdr/rdma- intro-module.ppt Overview Introduction to RDMA DCQCN: congestion control
Overview
- Introduction to RDMA
- DCQCN: congestion control for large-scale
RDMA deployments
- Experience of deploying RDMA at a large scale
datacenter network
2
What is RDMA?
- A (relatively) new method for high-speed inter-
machine communication – new standards – new protocols – new hardware interface cards and switches – new software
Remote Direction Memory Access
- Read, write, send, receive etc. do not go
through CPU
- two machines (Intel Xeon E5-2660 2.2GHz, 16
core, 128GB RAM, 40Gbps NICs, Windows Server 2012R2) connected via a 40Gbps switch.
Remote Direct Memory Access
vRemote
–data transfers between nodes in a network
vDirect
–no Operating System Kernel involvement in transfers –everything about a transfer offloaded onto Interface Card
vMemory
–transfers between user space application virtual memory –no extra copying or buffering
vAccess
–send, receive, read, write, atomic operations
RDMA Benefits
vHigh throughput vLow latency vHigh messaging rate vLow CPU utilization vLow memory bus contention vMessage boundaries preserved vAsynchronous operation
RDMA Technologies
vInfiniBand – (41.8% of top 500 supercomputers)
–SDR 4x – 8 Gbps –DDR 4x – 16 Gbps –QDR 4x – 32 Gbps –FDR 4x – 54 Gbps
viWarp – internet Wide Area RDMA Protocol
–10 Gbps
vRoCE – RDMA over Converged Ethernet
–10 Gbps –40 Gbps
RDMA architecture layers
Software RDMA Drivers
vSoftiwarp
– www.zurich.ibm.com/sys/rdma – open source kernel module that implements iWARP protocols on top of ordinary kernel TCP sockets – interoperates with hardware iWARP at other end of wire
vSoft RoCE
– www.systemfabricworks.com/downloads/roce – open source IB transport and network layers in software
- ver ordinary Ethernet
– interoperates with hardware RoCE at other end of wire
Similarities between TCP and RDMA
vBoth utilize the client-server model vBoth require a connection for reliable transport vBoth provide a reliable transport mode
– TCP provides a reliable in-order sequence of bytes – RDMA provides a reliable in-order sequence of messages
How RDMA differs from TCP/IP
v“zero copy” – data transferred directly from virtual memory on one node to virtual memory
- n another node
v“kernel bypass” – no operating system involvement during data transfers vasynchronous operation – threads not blocked during I/O transfers
User App Kernel Stack CA Wire
TCP/IP setup
client server
setup setup connect listen accept bind User App Kernel Stack CA Wire
blue lines: control information red lines: user data green lines: control and data
User App Kernel Stack CA Wire
RDMA setup
client server
setup setup rdma_ connect rdma_listen rdma_accept rdma_bind User App Kernel Stack CA Wire
blue lines: control information red lines: user data green lines: control and data
User App Kernel Stack CA Wire
TCP/IP setup
client server
setup setup connect listen accept bind User App Kernel Stack CA Wire
blue lines: control information red lines: user data green lines: control and data
User App Kernel Stack CA Wire
TCP/IP transfer
client server
setup setup connect listen accept bind User App Kernel Stack CA Wire
blue lines: control information
data send data recv
red lines: user data green lines: control and data
transfer transfer
data copy data copy
User App Kernel Stack CA Wire
RDMA transfer
client server
setup setup rdma_ connect rdma_listen rdma_accept rdma_bind User App Kernel Stack CA Wire
blue lines: control information
data rdma_ post_ send data rdma_ post_ recv
red lines: user data green lines: control and data
transfer transfer
“Normal” TCP/IP socket access model
vByte streams – requires application to delimit / recover message boundaries vSynchronous – blocks until data is sent/received
–O_NONBLOCK, MSG_DONTWAIT are not asynchronous, are “try” and “try again”
v send() and recv() are paired
–both sides must participate in the transfer
v Requires data copy into system buffers
–order and timing of send() and recv() are irrelevant –user memory accessible immediately before and immediately after each send() and recv() call
virtual memory allocate add to tables sleep wakeup access TCP buffers
metadata control
copy
data packets ACKs
TCP RECV()
blocked
status
recv()
USER OPERATING SYSTEM NIC WIRE
allocate access
metadata control
data packets ACK
RDMA RECV()
status
recv()
USER CHANNEL ADAPTER WIRE
register
poll_cq()
recv queue completion queue
. . . . . . virtual memory parallel activity
RDMA access model
vMessages – preserves user's message boundaries vAsynchronous – no blocking during a transfer, which
–starts when metadata added to work queue –finishes when status available in completion queue
v 1-sided (unpaired) and 2-sided (paired) transfers v No data copying into system buffers
–order and timing of send() and recv() are relevant
- recv() must be waiting before issuing send()
–memory involved in transfer is untouchable between start and completion of transfer
Congestion Control for Large- Scale RDMA Deployments
By Yibo Zhu et al.
Problem
- RDMA requires a lossless data link layer
- Ethernet is not lossless
- Solution à RDMA over Converged Ethernet
RoCE
RoCE details
- Priority-based Flow Control (PFC)
– When busy, send Pause – When not busy, send Resume
Problems with PFC
- Per-port, not per flow
- Unfairness: port-fair, not flow-fair
- Collateral damage: head-of-line blocking for
some flows
Experimental topology
Unfairness
- H1-H4 write to R
- H4 has no contention at
port P2
- H1,H2, and H3 has
contention on P3, and P4
Head of line blocking
- VS à VR
- H11,J14, H31-H32 à
R
- T4 congested, sends
PAUSE messages
- T1 Pauses all its
incoming links regardless of their destinations
Solution
- Per-flow congestion control control
- Existing work:
– QCN (Quantized Congestion Notification)
- Using Ethernet SRC/DST and a flow ID to define a flow
- Switch sends congestion notification to sender based
- n source MAC address
- Only works at L2
- This work: DCQCN
– Works for IP-routed networks
Why QCN does not work for IP networks?
- Same packet has different SRC/DST MAC addresses.
DCQCN
- DCQCN is a rate-based, end-to-end congestion
protocol
- Most of the DCQCN functionality is
implemented in the NICs
High level Ideas
- ECN-mark packets at an egress queue
- Receiver sends Congestion Notification to
sender
- Sender reduces sending rates
Challenges
- How to set buffer sizes at the egress queue
- How often to send congestion notifications
- How a sender should reduce its sending rate
to ensure both convergence and fairness
Solutions provided by the paper
- ECN must be set before PFC is triggered
– Use PFC queue sizes to set ECN buffer
- Use a fluid model to tune congestion
parameters
RDMA over Commodity Ethernet at Scale
Chuanxiong Guo, Haitao Wu, Zhong Deng, Gaurav Soni, Jianxi Ye, Jitendra Padhye, Marina Lipshteyn Microsoft
What this paper is about
- Extending PFC to IP-routed network
- Safety issues of RDMA
– Livelock – Deadlock – Pause frame storm – Slower receiver symptoms
- Performance observed in production networks
4MB message, 1K packets Drop packets with IP ID’s last byte 0xff (1/256)
S3 is dead. T1.p2 is congested Pause is sent to T1.p3, La.p1, To.p2, S1.
S4àS2, S2 is dead Blue packet flooded to T0.p2 To.p2 is paused. Ingress T0.p3 pauses Lb.p0 Lb.p1 pauses T1.p4. T1.p1 pauses S4
Summary
- What is RDMA
- DCQCN: congestion control for RDMA
- Deployment issues for RDMA