Towards Low-Latency Byzantine Agreement Protocols Using RDMA DSN - - PowerPoint PPT Presentation

towards low latency byzantine agreement protocols using
SMART_READER_LITE
LIVE PREVIEW

Towards Low-Latency Byzantine Agreement Protocols Using RDMA DSN - - PowerPoint PPT Presentation

Towards Low-Latency Byzantine Agreement Protocols Using RDMA DSN Workshop on Byzantine Consensus and Resilient Blockchains Signe Rsch, Ines Messadi, Rdiger Kapitza, 2018-06-25 ruesch@ibr.cs.tu-bs.de Technische Universitt Braunschweig,


slide-1
SLIDE 1

Towards Low-Latency Byzantine Agreement Protocols Using RDMA

DSN Workshop on Byzantine Consensus and Resilient Blockchains Signe Rüsch, Ines Messadi, Rüdiger Kapitza, 2018-06-25

ruesch@ibr.cs.tu-bs.de Technische Universität Braunschweig, Germany

slide-2
SLIDE 2

Motivation RDMA Design Evaluation Conclusion

Blockchain and Cryptocurrencies

Permissionless: Proof-of-Work for ordering agreement

Scalability and energy consumption issues

Permissioned: e.g. for companies’ SCM

Blocks can be created by dedicated nodes in data centers Crash-fault tolerant protocols: Hyperledger Fabric with Kafka

Block n Hash h(n-1) Block n+1 Hash h(n) Block n+2 Hash h(n+1) tx1 tx2 … tx1 tx2 … tx1 tx2 …

2018-06-25 Signe Rüsch Byzantine Agreement Protocols Using RDMA Page 2 Institute of Operating Systems and Computer Networks

slide-3
SLIDE 3

Motivation RDMA Design Evaluation Conclusion

Blockchain and Cryptocurrencies

Permissionless: Proof-of-Work for ordering agreement

Scalability and energy consumption issues

Permissioned: e.g. for companies’ SCM

Blocks can be created by dedicated nodes in data centers Crash-fault tolerant protocols: Hyperledger Fabric with Kafka → Additional security of Byzantine fault tolerant (BFT) protocols!

Block n Hash h(n-1) Block n+1 Hash h(n) Block n+2 Hash h(n+1) tx1 tx2 … tx1 tx2 … tx1 tx2 …

2018-06-25 Signe Rüsch Byzantine Agreement Protocols Using RDMA Page 2 Institute of Operating Systems and Computer Networks

slide-4
SLIDE 4

Motivation RDMA Design Evaluation Conclusion

BFT Protocols

3f + 1 nodes reach consensus on order of requests High throughput requirements: blockchain to replace company’s database Multiple rounds of message exchanges Broadcast steps → High message complexity and latency!

2018-06-25 Signe Rüsch Byzantine Agreement Protocols Using RDMA Page 3 Institute of Operating Systems and Computer Networks

slide-5
SLIDE 5

Motivation RDMA Design Evaluation Conclusion

BFT Protocols

Message complexity optimization focusing on protocol level

E.g. hybrid BFT protocols

Current BFT protocols achieve necessary throughput

≈1 Million operations/second (Behl et al.,

EuroSys’17)

2018-06-25 Signe Rüsch Byzantine Agreement Protocols Using RDMA Page 4 Institute of Operating Systems and Computer Networks

slide-6
SLIDE 6

Motivation RDMA Design Evaluation Conclusion

BFT Protocols

Message complexity optimization focusing on protocol level

E.g. hybrid BFT protocols

Current BFT protocols achieve necessary throughput

≈1 Million operations/second (Behl et al.,

EuroSys’17)

Our focus: reduce latency on network layer with technology available in data centers!

2018-06-25 Signe Rüsch Byzantine Agreement Protocols Using RDMA Page 4 Institute of Operating Systems and Computer Networks

slide-7
SLIDE 7

Motivation RDMA Design Evaluation Conclusion

TCP Overhead

Two intermediate data copy steps per host

Application → kernel → network Network → kernel → application

>50 % of TCP latency due to data copying (Frey et al., ICDCS’09)

Application OS NIC Buffer Buffer Buffer Application OS NIC Buffer Buffer Buffer TCP/IP

2018-06-25 Signe Rüsch Byzantine Agreement Protocols Using RDMA Page 5 Institute of Operating Systems and Computer Networks

slide-8
SLIDE 8

Motivation RDMA Design Evaluation Conclusion

TCP Overhead

Two intermediate data copy steps per host

Application → kernel → network Network → kernel → application

>50 % of TCP latency due to data copying (Frey et al., ICDCS’09)

Application OS NIC Buffer Buffer Buffer Application OS NIC Buffer Buffer Buffer TCP/IP RDMA over Converged Ethernet

Reduce latency of BFT protocols with Remote Direct Memory Access (RDMA) communication framework!

2018-06-25 Signe Rüsch Byzantine Agreement Protocols Using RDMA Page 5 Institute of Operating Systems and Computer Networks

slide-9
SLIDE 9

Motivation RDMA Design Evaluation Conclusion

Overview

Remote Direct Memory Access Design of Rubin Evaluation of Rubin Conclusion

2018-06-25 Signe Rüsch Byzantine Agreement Protocols Using RDMA Page 6 Institute of Operating Systems and Computer Networks

slide-10
SLIDE 10

Motivation RDMA Design Evaluation Conclusion

Remote Direct Memory Access

Zero-copy communication protocol Kernel bypassing Data transfer directly into remote memory Applications register memory with RDMA NIC Message-oriented and asynchronous operations Often employed in data centers

2018-06-25 Signe Rüsch Byzantine Agreement Protocols Using RDMA Page 7 Institute of Operating Systems and Computer Networks

slide-11
SLIDE 11

Motivation RDMA Design Evaluation Conclusion

Remote Direct Memory Access

Zero-copy communication protocol Kernel bypassing Data transfer directly into remote memory Applications register memory with RDMA NIC Message-oriented and asynchronous operations Often employed in data centers Low latency, high throughput, CPU efficient! But possible security issues due to direct memory access?

2018-06-25 Signe Rüsch Byzantine Agreement Protocols Using RDMA Page 7 Institute of Operating Systems and Computer Networks

slide-12
SLIDE 12

Motivation RDMA Design Evaluation Conclusion

RDMA Consensus Protocols

DARE (Poke et al, HPDC’15) RDMA-tailored SMR protocol Achieve low latency in replica communication APUS (Wang et al., SoCC’17) Combine RDMA with Paxos Scalability regarding concurrent connections Derecho (Jha et al., 2017) C++ library for replicated crash-fault tolerant services built on Paxos

2018-06-25 Signe Rüsch Byzantine Agreement Protocols Using RDMA Page 8 Institute of Operating Systems and Computer Networks

slide-13
SLIDE 13

Motivation RDMA Design Evaluation Conclusion

RDMA Consensus Protocols

DARE (Poke et al, HPDC’15) RDMA-tailored SMR protocol Achieve low latency in replica communication APUS (Wang et al., SoCC’17) Combine RDMA with Paxos Scalability regarding concurrent connections Derecho (Jha et al., 2017) C++ library for replicated crash-fault tolerant services built on Paxos → Only crash faults are considered, no previous work on BFT! How to implement RDMA communication for BFT frameworks?

2018-06-25 Signe Rüsch Byzantine Agreement Protocols Using RDMA Page 8 Institute of Operating Systems and Computer Networks

slide-14
SLIDE 14

Motivation RDMA Design Evaluation Conclusion

Requirements

1

Easy integration into existing BFT prototypes

2

Security guarantees even in the presence of malicious nodes

3

Zero-copy communication

2018-06-25 Signe Rüsch Byzantine Agreement Protocols Using RDMA Page 9 Institute of Operating Systems and Computer Networks

slide-15
SLIDE 15

Motivation RDMA Design Evaluation Conclusion

1

Easy Integration

RDMA communication for multiple BFT frameworks

BFT-SMaRt (Bessani et al., DSN’14) UpRight (Clement et al., SOSP’09) Reptor (Behl et al., Middleware’15)

BFT frameworks very complex, e.g. Reptor:

Core: 50,000 LOC (Java) Deployment, benchmarking: 14,000 LOC (Python)

High development effort

≈20 years of BFT research Limited number of BFT frameworks

2018-06-25 Signe Rüsch Byzantine Agreement Protocols Using RDMA Page 10 Institute of Operating Systems and Computer Networks

slide-16
SLIDE 16

Motivation RDMA Design Evaluation Conclusion

1

Easy Integration

RDMA communication for multiple BFT frameworks

BFT-SMaRt (Bessani et al., DSN’14) UpRight (Clement et al., SOSP’09) Reptor (Behl et al., Middleware’15)

BFT frameworks very complex, e.g. Reptor:

Core: 50,000 LOC (Java) Deployment, benchmarking: 14,000 LOC (Python)

High development effort

≈20 years of BFT research Limited number of BFT frameworks

Direct integration is far too much overhead!

2018-06-25 Signe Rüsch Byzantine Agreement Protocols Using RDMA Page 10 Institute of Operating Systems and Computer Networks

slide-17
SLIDE 17

Motivation RDMA Design Evaluation Conclusion

1

Easy Integration

BFT frameworks often written in Java Use Java NIO for high-performance communication

With clients (BFT-SMaRt), replicas (UpRight), or both (Reptor)

Frameworks optimized to reduce data copy steps Need suitable level of abstraction

Not as low-level as the native RDMA interface Not as high-level as JSOR: socket interface, but intermediate data copies by default (Thirugnanapandi, 2014)

2018-06-25 Signe Rüsch Byzantine Agreement Protocols Using RDMA Page 11 Institute of Operating Systems and Computer Networks

slide-18
SLIDE 18

Motivation RDMA Design Evaluation Conclusion

1

Easy Integration

BFT frameworks often written in Java Use Java NIO for high-performance communication

With clients (BFT-SMaRt), replicas (UpRight), or both (Reptor)

Frameworks optimized to reduce data copy steps Need suitable level of abstraction

Not as low-level as the native RDMA interface Not as high-level as JSOR: socket interface, but intermediate data copies by default (Thirugnanapandi, 2014)

→ Modeled after Java NIO → Interface similar to Java socket interface → Easy switch between RDMA and TCP communication

2018-06-25 Signe Rüsch Byzantine Agreement Protocols Using RDMA Page 11 Institute of Operating Systems and Computer Networks

slide-19
SLIDE 19

Motivation RDMA Design Evaluation Conclusion

2

Security: RDMA Semantics

Read/Write Used in APUS and DARE Fastest communication mode Exchange memory key specifying buffer location Receiver not notified Security risks in BFT setting: get memory key, corrupt memory

Server A Server B

RNIC RNIC memory key exchange Data Buffer Data Buffer RDMA Write(data, key)

2018-06-25 Signe Rüsch Byzantine Agreement Protocols Using RDMA Page 12 Institute of Operating Systems and Computer Networks

slide-20
SLIDE 20

Motivation RDMA Design Evaluation Conclusion

2

Security: RDMA Semantics

Read/Write Used in APUS and DARE Fastest communication mode Exchange memory key specifying buffer location Receiver not notified Security risks in BFT setting: get memory key, corrupt memory Send/Receive Two sides active Receiver notified No known memory key Remote memory locations decided by application → No memory corruption!

Server A Server B

RNIC RNIC memory key exchange Data Buffer Data Buffer RDMA Write(data, key)

Server A Server B

RNIC RNIC Send Buffer Recv Buffer RDMA Send(data) RDMA Receive

2018-06-25 Signe Rüsch Byzantine Agreement Protocols Using RDMA Page 12 Institute of Operating Systems and Computer Networks

slide-21
SLIDE 21

Motivation RDMA Design Evaluation Conclusion

2

Security: RDMA Semantics

Read/Write Used in APUS and DARE Fastest communication mode Exchange memory key specifying buffer location Receiver not notified Security risks in BFT setting: get memory key, corrupt memory Send/Receive Two sides active Receiver notified No known memory key Remote memory locations decided by application → No memory corruption!

2018-06-25 Signe Rüsch Byzantine Agreement Protocols Using RDMA Page 12 Institute of Operating Systems and Computer Networks

Send/Receive has higher security → no memory corruption and MitM attack possible!

slide-22
SLIDE 22

Motivation RDMA Design Evaluation Conclusion

Our Framework: Rubin

Modeled after Java NIO and socket interface Integration in several BFT frameworks possible Use RDMA Send/Receive semantics for security Integrate into Reptor framework Use DiSNI library for RDMA communication in Java

BFT Java NIO RUBIN DiSNI RDMA

2018-06-25 Signe Rüsch Byzantine Agreement Protocols Using RDMA Page 13 Institute of Operating Systems and Computer Networks

slide-23
SLIDE 23

Motivation RDMA Design Evaluation Conclusion

Rubin Components

RDMA Channel: Java NIO SocketChannel with RDMA resources

2018-06-25 Signe Rüsch Byzantine Agreement Protocols Using RDMA Page 14 Institute of Operating Systems and Computer Networks

slide-24
SLIDE 24

Motivation RDMA Design Evaluation Conclusion

Rubin Components

RDMA Channel: Java NIO SocketChannel with RDMA resources RDMA Selector: efficiently handle multiple channels with one thread

Select channels that are ready for certain events Avoids expensive context switching

2018-06-25 Signe Rüsch Byzantine Agreement Protocols Using RDMA Page 14 Institute of Operating Systems and Computer Networks

slide-25
SLIDE 25

Motivation RDMA Design Evaluation Conclusion

Rubin Components

RDMA Channel: Java NIO SocketChannel with RDMA resources RDMA Selector: efficiently handle multiple channels with one thread

Select channels that are ready for certain events Avoids expensive context switching

RDMA Selection Keys: channel operation

Send, receive message, connection establishment

2018-06-25 Signe Rüsch Byzantine Agreement Protocols Using RDMA Page 14 Institute of Operating Systems and Computer Networks

slide-26
SLIDE 26

Motivation RDMA Design Evaluation Conclusion

Workflow of Rubin

1

Channel registration, set

interest

2018-06-25 Signe Rüsch Byzantine Agreement Protocols Using RDMA Page 15 Institute of Operating Systems and Computer Networks

slide-27
SLIDE 27

Motivation RDMA Design Evaluation Conclusion

Workflow of Rubin

1

Channel registration, set

interest

2

Selection Key creation

2018-06-25 Signe Rüsch Byzantine Agreement Protocols Using RDMA Page 15 Institute of Operating Systems and Computer Networks

slide-28
SLIDE 28

Motivation RDMA Design Evaluation Conclusion

Workflow of Rubin

1

Channel registration, set

interest

2

Selection Key creation

3

Non-/Blocking select()

2018-06-25 Signe Rüsch Byzantine Agreement Protocols Using RDMA Page 15 Institute of Operating Systems and Computer Networks

slide-29
SLIDE 29

Motivation RDMA Design Evaluation Conclusion

Workflow of Rubin

1

Channel registration, set

interest

2

Selection Key creation

3

Non-/Blocking select()

4

Hybrid event queue,

notify selector

2018-06-25 Signe Rüsch Byzantine Agreement Protocols Using RDMA Page 15 Institute of Operating Systems and Computer Networks

slide-30
SLIDE 30

Motivation RDMA Design Evaluation Conclusion

Workflow of Rubin

1

Channel registration, set

interest

2

Selection Key creation

3

Non-/Blocking select()

4

Hybrid event queue,

notify selector

5

Select responsible

channel

2018-06-25 Signe Rüsch Byzantine Agreement Protocols Using RDMA Page 15 Institute of Operating Systems and Computer Networks

slide-31
SLIDE 31

Motivation RDMA Design Evaluation Conclusion

3

Zero-Copy Communication

Pool of pre-allocated RDMA-registered application buffers Optimization: selective signaling to reduce notification overhead → Challenge: Buffer Copy Sender: register application buffers, no buffer copy Receiver: copy data to application buffer due to incompatibility

→ DiSNI requires direct buffers, but also heap buffers used in Reptor

2018-06-25 Signe Rüsch Byzantine Agreement Protocols Using RDMA Page 16 Institute of Operating Systems and Computer Networks

slide-32
SLIDE 32

Motivation RDMA Design Evaluation Conclusion

Evaluation Setup

2 server machines: 4-core Xeon v2 CPUs and 16GB RAM 10Gbps switched network Mellanox ConnectX-3 RDMA NICs Q1: How does RDMA communication compare to TCP? Q2: What is the performance of Rubin?

2018-06-25 Signe Rüsch Byzantine Agreement Protocols Using RDMA Page 17 Institute of Operating Systems and Computer Networks

slide-33
SLIDE 33

Motivation RDMA Design Evaluation Conclusion

Evaluation Setup

2 server machines: 4-core Xeon v2 CPUs and 16GB RAM 10Gbps switched network Mellanox ConnectX-3 RDMA NICs Q1: How does RDMA communication compare to TCP? Q2: What is the performance of Rubin? Echo server application:

Q1: Distributed microbenchmark for RDMA Channel Q2: Local microbenchmark for Rubin in Reptor communication stack

2018-06-25 Signe Rüsch Byzantine Agreement Protocols Using RDMA Page 17 Institute of Operating Systems and Computer Networks

slide-34
SLIDE 34

Motivation RDMA Design Evaluation Conclusion

RDMA Microbenchmarks – Latency

1 10 100 200 400 600 800 Payload (KB) Latency (µs)

TCP RDMA Send/Recv RDMA Read/Write RDMA Channel

RDMA Channel 33 – 43 % lower latency than TCP Optimizations: 30 % less latency than Send/Recv for messages <16KB

2018-06-25 Signe Rüsch Byzantine Agreement Protocols Using RDMA Page 18 Institute of Operating Systems and Computer Networks

slide-35
SLIDE 35

Motivation RDMA Design Evaluation Conclusion

RDMA Microbenchmarks – Latency

1 10 100 200 400 600 800 Payload (KB) Latency (µs)

TCP RDMA Send/Recv RDMA Read/Write RDMA Channel

RDMA Channel 33 – 43 % lower latency than TCP Optimizations: 30 % less latency than Send/Recv for messages <16KB Performance degradation due to remaining buffer copy

2018-06-25 Signe Rüsch Byzantine Agreement Protocols Using RDMA Page 18 Institute of Operating Systems and Computer Networks

slide-36
SLIDE 36

Motivation RDMA Design Evaluation Conclusion

RDMA Microbenchmarks – Throughput

1 10 100 5 10 Payload (KB) Requests per second (krps)

TCP RDMA Send/Recv RDMA Read/Write RDMA Channel

RDMA Channel 33 – 43 % higher throughput than TCP Optimizations: 30 % higher throughput than Send/Recv

2018-06-25 Signe Rüsch Byzantine Agreement Protocols Using RDMA Page 19 Institute of Operating Systems and Computer Networks

slide-37
SLIDE 37

Motivation RDMA Design Evaluation Conclusion

Rubin Microbenchmarks – Latency

20 40 60 80 100 102 103 Payload (KB) Latency (µs)

Rubin TCP

1KB, 100KB: 19 – 20 % lower latency 20KB – 80KB: 20 % higher latency

2018-06-25 Signe Rüsch Byzantine Agreement Protocols Using RDMA Page 20 Institute of Operating Systems and Computer Networks

slide-38
SLIDE 38

Motivation RDMA Design Evaluation Conclusion

Rubin Microbenchmarks – Throughput

20 40 60 80 100 104 105 Payload (KB) Requests per second

Rubin TCP

Rubin has 25 – 38 % higher throughput than TCP Limited by buffer copy → remove and optimize!

2018-06-25 Signe Rüsch Byzantine Agreement Protocols Using RDMA Page 21 Institute of Operating Systems and Computer Networks

slide-39
SLIDE 39

Motivation RDMA Design Evaluation Conclusion

Future Work

Zero-copy: remove any additional data copy steps Reptor: evaluate fully replicated system with Rubin communication Integration of Reptor into a permissioned blockchain framework

E.g. Hyperledger Fabric

2018-06-25 Signe Rüsch Byzantine Agreement Protocols Using RDMA Page 22 Institute of Operating Systems and Computer Networks

slide-40
SLIDE 40

Motivation RDMA Design Evaluation Conclusion

Conclusion – Rubin

RDMA framework for BFT protocols High-level abstraction to maintain flexibility Easy integration: modeled after Java NIO interface Up to 25 – 38 % higher throughput Next: RDMA-capable BFT ordering service in permissioned blockchain setting

Application OS NIC Buffer Buffer Buffer Application OS NIC Buffer Buffer Buffer TCP/IP RDMA over Converged Ethernet 2018-06-25 Signe Rüsch Byzantine Agreement Protocols Using RDMA Page 23 Institute of Operating Systems and Computer Networks

slide-41
SLIDE 41

Motivation RDMA Design Evaluation Conclusion

Conclusion – Rubin

RDMA framework for BFT protocols High-level abstraction to maintain flexibility Easy integration: modeled after Java NIO interface Up to 25 – 38 % higher throughput Next: RDMA-capable BFT ordering service in permissioned blockchain setting

Application OS NIC Buffer Buffer Buffer Application OS NIC Buffer Buffer Buffer TCP/IP RDMA over Converged Ethernet

Questions?

ruesch@ibr.cs.tu-bs.de

2018-06-25 Signe Rüsch Byzantine Agreement Protocols Using RDMA Page 23 Institute of Operating Systems and Computer Networks

slide-42
SLIDE 42

Motivation RDMA Design Evaluation Conclusion

Backup – RDMA Communication

OS only used to establish connection Queue Pair: send/receive queue holding work requests Work Request: information about data to be sent/received Completion Queue: holds events notifying application about finished

  • peration

Registered Memory SQ RQ CQ

Send WR CQE

Buffer to send

Sender Receiver

Registered Memory Receive buffer RQ

Recv WR

CQ

CQE

SQ 2018-06-25 Signe Rüsch Byzantine Agreement Protocols Using RDMA Page 24

Institute of Operating Systems and Computer Networks

slide-43
SLIDE 43

Motivation RDMA Design Evaluation Conclusion

Backup – Reptor Buffer Management

DiSNI requires direct buffers in native memory Reptor uses both direct buffers and heap buffers in JVM memory Remote side needs pre-prepared buffers to receive data via RDMA Reptor has complex buffer management scheme, often replacing buffers → Redesign parts of buffer management

2018-06-25 Signe Rüsch Byzantine Agreement Protocols Using RDMA Page 25 Institute of Operating Systems and Computer Networks

slide-44
SLIDE 44

Motivation RDMA Design Evaluation Conclusion

Backup – Reptor

BFT framework implementing both PBFT and Hybster Hybster: hybrid BFT protocol with TSS using Intel SGX Consensus-oriented parallelization: parallel execution of consensus instances

2018-06-25 Signe Rüsch Byzantine Agreement Protocols Using RDMA Page 26 Institute of Operating Systems and Computer Networks

slide-45
SLIDE 45

Motivation RDMA Design Evaluation Conclusion

Backup – Security Analysis

RDMA mechanisms: Protection Domains and memory access permissions Security issues mostly relevant for Read/Write communication Read/Write: node reads data while it is overwritten → data corruption Steering Tag:

Buffer identifier MitM attacks Invalidate tag to prevent legitimate access

2018-06-25 Signe Rüsch Byzantine Agreement Protocols Using RDMA Page 27 Institute of Operating Systems and Computer Networks

slide-46
SLIDE 46

Motivation RDMA Design Evaluation Conclusion

Backup – TCP Overhead

(Frey et al., ICDCS’09)

2018-06-25 Signe Rüsch Byzantine Agreement Protocols Using RDMA Page 28 Institute of Operating Systems and Computer Networks

slide-47
SLIDE 47

Motivation RDMA Design Evaluation Conclusion

References

ETB Technologies. Dell Mellanox CX324A CONNECTX-3 40Gb QSFP+ Dual Port Low Profile NIC - M9NW6. https://goo.gl/Z8pVbM

  • mcwiggin. Datto Data Center Shots. https://goo.gl/xZCPxd

2018-06-25 Signe Rüsch Byzantine Agreement Protocols Using RDMA Page 29 Institute of Operating Systems and Computer Networks