A VOIDING C OORDINATION WITH N ETWORK O RDERING : NOP AXOS AND E RIS - - PowerPoint PPT Presentation

a voiding c oordination with n etwork o rdering
SMART_READER_LITE
LIVE PREVIEW

A VOIDING C OORDINATION WITH N ETWORK O RDERING : NOP AXOS AND E RIS - - PowerPoint PPT Presentation

A VOIDING C OORDINATION WITH N ETWORK O RDERING : NOP AXOS AND E RIS Ellis Michael S ERVER FAILURES ARE THE COMMON CASE IN DATA CENTERS S ERVER FAILURES ARE THE COMMON CASE IN DATA CENTERS S ERVER FAILURES ARE THE COMMON CASE IN DATA CENTERS S


slide-1
SLIDE 1

AVOIDING COORDINATION WITH NETWORK ORDERING: NOPAXOS AND ERIS

Ellis Michael

slide-2
SLIDE 2

SERVER FAILURES ARE THE COMMON CASE IN

DATA CENTERS

slide-3
SLIDE 3

SERVER FAILURES ARE THE COMMON CASE IN

DATA CENTERS

slide-4
SLIDE 4

SERVER FAILURES ARE THE COMMON CASE IN

DATA CENTERS

slide-5
SLIDE 5

SERVER FAILURES ARE THE COMMON CASE IN

DATA CENTERS

slide-6
SLIDE 6

Operation A Operation B Operation C Operation A Operation B Operation C Operation A Operation B Operation C

STATE MACHINE REPLICATION

slide-7
SLIDE 7

Operation A Operation B Operation C Operation A Operation B Operation C Operation A Operation B Operation C

STATE MACHINE REPLICATION

slide-8
SLIDE 8

Operation A Operation B Operation C Operation A Operation B Operation C Operation A Operation B Operation C

STATE MACHINE REPLICATION

slide-9
SLIDE 9

PAXOS FOR STATE MACHINE REPLICATION

Client Leader Replica Replica Replica

prepare prepareok request reply

slide-10
SLIDE 10

PAXOS FOR STATE MACHINE REPLICATION

Client Leader Replica Replica Replica

prepare prepareok request reply

Throughput Bottleneck

slide-11
SLIDE 11

PAXOS FOR STATE MACHINE REPLICATION

Client Leader Replica Replica Replica

prepare prepareok request reply

Throughput Bottleneck Latency Penalty

slide-12
SLIDE 12

Messages may be:

  • dropped
  • reordered
  • delivered with arbitrary

latency

NETWORK PROPERTIES DETERMINE REPLICATION

COMPLEXITY

Asynchronous Network

  • Paxos protocol on every operation
  • High performance cost

Paxos

slide-13
SLIDE 13

NETWORK PROPERTIES DETERMINE REPLICATION

COMPLEXITY

Asynchronous Network

  • Paxos protocol on every operation
  • High performance cost

All replicas:

  • receive the same set of

messages

  • receive them in the

same order

Paxos

Reliability Ordering

slide-14
SLIDE 14

NETWORK PROPERTIES DETERMINE REPLICATION

COMPLEXITY

Asynchronous Network

  • Paxos protocol on every operation
  • High performance cost
  • Replication is trivial

All replicas:

  • receive the same set of

messages

  • receive them in the

same order

Paxos

Reliability Ordering

slide-15
SLIDE 15

NETWORK PROPERTIES DETERMINE REPLICATION

COMPLEXITY

Asynchronous Network

  • Paxos protocol on every operation
  • High performance cost
  • Replication is trivial
  • Network implementation has

the same complexity as Paxos

All replicas:

  • receive the same set of

messages

  • receive them in the

same order

Paxos

Reliability Ordering

slide-16
SLIDE 16

Network Guarantee Weak Strong

Asynchronous Network

Paxos

Ordering Reliability

slide-17
SLIDE 17

Network Guarantee Weak Strong

Asynchronous Network

Paxos

Ordering Reliability

slide-18
SLIDE 18

Network Guarantee Weak Strong

Can we build a network model that:

  • provides performance benefits
  • can be implemented more efficiently

Asynchronous Network

Paxos

Ordering Reliability

slide-19
SLIDE 19

SPECPAXOS ASSUMED THE NETWORK

WAS MOSTLY ORDERED



 WHAT IF IT COULD PROVIDE AN

ORDERING GUARANTEE?

slide-20
SLIDE 20

TOWARDS AN ORDERED BUT UNRELIABLE

NETWORK

Key Idea: Separate ordering from reliable delivery in state machine replication Network provides ordering Replication protocol handles reliability

slide-21
SLIDE 21

OUM APPROACH

  • Designate one sequencer in the network
  • Sequencer maintains a counter for each OUM group
  • 1. Forward OUM messages to the sequencer
  • 2. Sequencer increments counter and writes counter

value into packet headers

  • 3. Receivers use sequence numbers to detect

reordering and message drops

slide-22
SLIDE 22

Ordered Unreliable Multicast Senders Receivers

Counter:

slide-23
SLIDE 23

1 1

Ordered Unreliable Multicast Senders Receivers

Counter: 2

1 2 2 2

slide-24
SLIDE 24

1 1

Ordered Unreliable Multicast Senders Receivers

Counter: 4

1 2 3 4 2 2 3 4 4

slide-25
SLIDE 25

1 1

Ordered Unreliable Multicast Senders Receivers

DROP

Counter: 4

1 2 3 4 2 2 3 4 4

slide-26
SLIDE 26

1 1

Ordered Unreliable Multicast Senders Receivers

DROP

Counter: 4

1 2 3 4 2 2 3 4 4 Ordered Multicast: no coordination required to determine order of messages

slide-27
SLIDE 27

1 1

Ordered Unreliable Multicast Senders Receivers

DROP

Counter: 4

1 2 3 4 2 2 3 4 4 Ordered Multicast: no coordination required to determine order of messages Drop Detection: coordination only required when messages are dropped

slide-28
SLIDE 28

SEQUENCER IMPLEMENTATIONS

In-switch sequencing

  • next generation

programmable switches

  • implemented in P4
  • nearly zero cost

Middlebox prototype

  • Cavium Octeon

network processor

  • connects to root

switches

  • adds 8 us latency

End-host sequencing

  • no specialized

hardware required

  • incurs higher latency

penalties

  • similar throughput

benefits

slide-29
SLIDE 29

SEQUENCER IMPLEMENTATIONS

In-switch sequencing

  • next generation

programmable switches

  • implemented in P4
  • nearly zero cost

Middlebox prototype

  • Cavium Octeon

network processor

  • connects to root

switches

  • adds 8 us latency

End-host sequencing

  • no specialized

hardware required

  • incurs higher latency

penalties

  • similar throughput

benefits

slide-30
SLIDE 30

SEQUENCER IMPLEMENTATIONS

In-switch sequencing

  • next generation

programmable switches

  • implemented in P4
  • nearly zero cost

Middlebox prototype

  • Cavium Octeon

network processor

  • connects to root

switches

  • adds 8 us latency

End-host sequencing

  • no specialized

hardware required

  • incurs higher latency

penalties

  • similar throughput

benefits

slide-31
SLIDE 31

NOPAXOS OVERVIEW

  • Built on top of the guarantees of OUM
  • Client requests are totally ordered but can be

dropped

  • No coordination in the common case
  • Replicas run agreement on drop detection
  • View change protocol for leader or sequencer failure
slide-32
SLIDE 32

NORMAL OPERATION

Client Replica (leader) Replica Replica

slide-33
SLIDE 33

NORMAL OPERATION

Client Replica (leader) Replica Replica

OUM

request

slide-34
SLIDE 34

NORMAL OPERATION

Client Replica (leader) Replica Replica

OUM

request reply

Execute

slide-35
SLIDE 35

NORMAL OPERATION

Client Replica (leader) Replica Replica

OUM

request reply

Execute

waits for replies from majority including leader’s

slide-36
SLIDE 36

NORMAL OPERATION

Client Replica (leader) Replica Replica

OUM

request reply

Execute

waits for replies from majority including leader’s no coordination

slide-37
SLIDE 37

NORMAL OPERATION

Client Replica (leader) Replica Replica

OUM

request reply

Execute

waits for replies from majority including leader’s no coordination

1 Round Trip Time

slide-38
SLIDE 38

GAP AGREEMENT

Replicas detect message drops.

  • Non-leader replicas: recover the missing

message from the leader

  • Leader replica: coordinates to commit a 


NO-OP (Paxos)

  • Efficient recovery from network anomalies
slide-39
SLIDE 39

WHY DO FOLLOWERS NOT EXECUTE?

  • Request logs in NOPaxos are non-authoritative. The

followers might not be involved in the quorum to commit a no-op. The leader might get replaced.

  • Followers simply log operations. Operations are

permanently committed with periodic synchronization.

  • If a leader gets replaced and discovers that some of its

commands weren't actually committed, it can roll-back

  • r get a state transfer.
slide-40
SLIDE 40

VIEW CHANGE

  • Handles leader or sequencer failure
  • Ensures that all replicas are in a consistent state

and agree on all of the commands and no-ops committed in the previous view.

  • Runs a view change protocol similar to VR
  • view-number is a tuple of


<leader-number, session-number>

slide-41
SLIDE 41

NOPAXOS ACHIEVES BETTER THROUGHPUT AND

LATENCY

Latency (us) Throughput (ops/sec)

better → better ↓

slide-42
SLIDE 42

250 500 750 1000 65,000 130,000 195,000 260,000

NOPAXOS ACHIEVES BETTER THROUGHPUT AND

LATENCY

Latency (us) Throughput (ops/sec)

NOPaxos Fast Paxos Paxos

better → better ↓

slide-43
SLIDE 43

250 500 750 1000 65,000 130,000 195,000 260,000

NOPAXOS ACHIEVES BETTER THROUGHPUT AND

LATENCY

Latency (us) Throughput (ops/sec)

NOPaxos Fast Paxos Paxos

4.7X throughput and more than 40% reduction in latency

better → better ↓

slide-44
SLIDE 44

Paxos + Batching

250 500 750 1000 65,000 130,000 195,000 260,000

NOPAXOS ACHIEVES BETTER THROUGHPUT AND

LATENCY

Latency (us) Throughput (ops/sec)

NOPaxos Fast Paxos Paxos

4.7X throughput and more than 40% reduction in latency

better → better ↓

slide-45
SLIDE 45

Paxos + Batching

250 500 750 1000 65,000 130,000 195,000 260,000

NOPAXOS ACHIEVES BETTER THROUGHPUT AND

LATENCY

Latency (us) Throughput (ops/sec)

NOPaxos Fast Paxos Paxos

4.7X throughput and more than 40% reduction in latency 25% higher throughput and 6X lower latency

better → better ↓

slide-46
SLIDE 46

NOPAXOS IS RESILIENT TO NETWORK ANOMALIES

65,000 130,000 195,000 260,000 0.001% 0.01% 0.1% 1%

NOPaxos Paxos SpecPaxos

Throughput (ops/sec) Packet Drop Rate

slide-47
SLIDE 47

NOPAXOS IS RESILIENT TO NETWORK ANOMALIES

65,000 130,000 195,000 260,000 0.001% 0.01% 0.1% 1%

NOPaxos Paxos SpecPaxos

Throughput (ops/sec) Packet Drop Rate

NOPaxos Speculative Paxos Paxos

slide-48
SLIDE 48

NOPAXOS IS RESILIENT TO NETWORK ANOMALIES

65,000 130,000 195,000 260,000 0.001% 0.01% 0.1% 1%

NOPaxos Paxos SpecPaxos

Throughput (ops/sec) Packet Drop Rate

drops to 24%

  • f maximum

throughput

NOPaxos Speculative Paxos Paxos

slide-49
SLIDE 49

NOPAXOS ATTAINS THROUGHPUT WITHIN 2% OF

AN UNREPLICATED SYSTEM

slide-50
SLIDE 50

NOPAXOS ATTAINS THROUGHPUT WITHIN 2% OF

AN UNREPLICATED SYSTEM

125 250 375 500 65,000 130,000 195,000 260,000

Latency (us) Throughput (ops/sec)

better → better ↓

slide-51
SLIDE 51

NOPAXOS ATTAINS THROUGHPUT WITHIN 2% OF

AN UNREPLICATED SYSTEM

125 250 375 500 65,000 130,000 195,000 260,000

NOPaxos Unreplicated Paxos

Latency (us) Throughput (ops/sec)

better → better ↓

slide-52
SLIDE 52

NOPAXOS ATTAINS THROUGHPUT WITHIN 2% OF

AN UNREPLICATED SYSTEM

125 250 375 500 65,000 130,000 195,000 260,000

NOPaxos Unreplicated Paxos

within 2% throughput and 16us latency of an unreplicated system

Latency (us) Throughput (ops/sec)

better → better ↓

slide-53
SLIDE 53

NOPAXOS ATTAINS THROUGHPUT WITHIN 2% OF

AN UNREPLICATED SYSTEM

125 250 375 500 65,000 130,000 195,000 260,000

NOPaxos Unreplicated NOPaxos using end-host sequencer Paxos

within 2% throughput and 16us latency of an unreplicated system

Latency (us) Throughput (ops/sec)

better → better ↓

slide-54
SLIDE 54

NOPAXOS ATTAINS THROUGHPUT WITHIN 2% OF

AN UNREPLICATED SYSTEM

125 250 375 500 65,000 130,000 195,000 260,000

NOPaxos Unreplicated NOPaxos using end-host sequencer Paxos

within 2% throughput and 16us latency of an unreplicated system similar throughput but 36% higher latency

Latency (us) Throughput (ops/sec)

better → better ↓

slide-55
SLIDE 55

SUMMARY

  • Separate ordering from reliable delivery in state machine

replication

  • A network model OUM that provides ordered but

unreliable message delivery

  • A more efficient replication protocol NOPaxos that

ensures reliable delivery

  • The combined system achieves performance equivalent

to an unreplicated system

slide-56
SLIDE 56

THE ERIS TRANSACTION PROTOCOL

slide-57
SLIDE 57

Shard 3 Client Shard 1 Shard 2

EXISTING TRANSACTIONAL SYSTEMS:

EXTENSIVE COORDINATION

slide-58
SLIDE 58

Shard 3 Client Shard 1 Shard 2

req prepare

  • k

commit

EXISTING TRANSACTIONAL SYSTEMS:

EXTENSIVE COORDINATION

slide-59
SLIDE 59

Shard 3 Client Shard 1 Shard 2

req prepare

  • k

commit

EXISTING TRANSACTIONAL SYSTEMS:

EXTENSIVE COORDINATION

slide-60
SLIDE 60

Shard 3 Client Shard 1 Shard 2

req prepare

  • k

commit

EXISTING TRANSACTIONAL SYSTEMS:

EXTENSIVE COORDINATION

slide-61
SLIDE 61
  • Processes independent transactions 


without coordination in the normal case

  • Performance within 3% of a nontransactional,

unreplicated system on TPC-C

  • Strongly consistent, fault tolerant transactions

with minimal performance penalties

ERIS

slide-62
SLIDE 62

KEY CONTRIBUTIONS

A new architecture that divides the responsibility for transactional guarantees by …leveraging the datacenter network to order messages within and across shards …and a co-designed transaction protocol 
 with minimal coordination.

slide-63
SLIDE 63

TRADITIONAL LAYERED APPROACH

Atomic Commitment (2PC) Concurrency Control (2PL) Concurrency Control (2PL) Replication (Paxos)

Replica Replica Replica

Replication (Paxos)

Replica Replica Replica

slide-64
SLIDE 64

TRADITIONAL LAYERED APPROACH

Atomic Commitment (2PC) Concurrency Control (2PL) Concurrency Control (2PL) Replication (Paxos)

Replica Replica Replica

Replication (Paxos)

Replica Replica Replica

Ordering (within shard) Reliability (within shard)

slide-65
SLIDE 65

Isolation

TRADITIONAL LAYERED APPROACH

Atomic Commitment (2PC) Concurrency Control (2PL) Concurrency Control (2PL) Replication (Paxos)

Replica Replica Replica

Replication (Paxos)

Replica Replica Replica

Ordering (within shard) Reliability (within shard)

slide-66
SLIDE 66

Ordering (across shard) Isolation

TRADITIONAL LAYERED APPROACH

Atomic Commitment (2PC) Concurrency Control (2PL) Concurrency Control (2PL) Replication (Paxos)

Replica Replica Replica

Replication (Paxos)

Replica Replica Replica

Ordering (within shard) Reliability (within shard) Reliability (across shards)

slide-67
SLIDE 67

Ordering (across shard) Isolation

A NEW WAY TO DIVIDE RESPONSIBILITIES

Ordering (within shard) Reliability (within shard) Reliability (across shards)

Multi-sequencing Independent Transaction Protocol General Transaction Protocol

Eris

slide-68
SLIDE 68

Ordering (across shard) Isolation

A NEW WAY TO DIVIDE RESPONSIBILITIES

Ordering (within shard) Reliability (within shard) Reliability (across shards)

Multi-sequencing Independent Transaction Protocol General Transaction Protocol

Eris Application Network

slide-69
SLIDE 69

Client Sequencer

GOAL

slide-70
SLIDE 70

IN-NETWORK CONCURRENCY CONTROL GOALS

  • Globally consistent ordering across messages

delivered to multiple destination shards

  • No reliable delivery guarantee
  • Recipients can detect dropped messages
slide-71
SLIDE 71

A B C Receivers

T1

(ABC)

T1

(ABC)

T1

(ABC)

T2

(AB)

T2

(AB)

slide-72
SLIDE 72

A B C Receivers

T1

(ABC)

T1

(ABC)

T1

(ABC)

T2

(AB)

T2

(AB)

slide-73
SLIDE 73

A B C Receivers

T1

(ABC)

T1

(ABC)

T1

(ABC)

T2

(AB)

T2

(AB)

slide-74
SLIDE 74

A B C Receivers

T1

(ABC)

T1

(ABC)

T2

(AB)

T2

(AB)

slide-75
SLIDE 75

A B C Receivers

T1

(ABC)

T1

(ABC)

T2

(AB)

T2

(AB)

DROP

slide-76
SLIDE 76

A B C Receivers

T1

(ABC)

T1

(ABC)

T2

(AB)

T2

(AB)

DROP

T1

(ABC)

slide-77
SLIDE 77

T2

(AB)

T2

(AB)

T2

(AB)

T2

(AB)

T1

(ABC)

T1

(ABC)

T1

(ABC)

T1

(ABC)

T1

(ABC)

T1

(ABC)

A B C Receivers

T1

(ABC)

T1

(ABC)

T2

(AB)

T2

(AB)

DROP

T1

(ABC)

slide-78
SLIDE 78

MULTI-SEQUENCED GROUPCAST

  • Groupcast: message header specifies a set of

destination multicast groups

  • Multi-sequenced groupcast: messages are

sequenced atomically across all recipient groups

  • Sequencer keeps a counter for each group
  • Extends OUM in NOPaxos
slide-79
SLIDE 79

A0 B0 C0 A B C Receivers Sequencer Counter:


slide-80
SLIDE 80

A0 B0 C0 A B C Receivers Sequencer Counter:


T1

(ABC)

slide-81
SLIDE 81

A0 B0 C0 A B C Receivers Sequencer Counter:


T1

(ABC)

slide-82
SLIDE 82

A1 B1 C1 A B C Receivers Sequencer Counter:


T1

(ABC)

slide-83
SLIDE 83

A1 B1 C1 A B C Receivers Sequencer Counter:


T1

(ABC)

A1 B1 C1

slide-84
SLIDE 84

A1 B1 C1 A B C Receivers Sequencer Counter:


T1

(ABC)

A1 B1 C1

T1

(ABC)

A1 B1 C1

T1

(ABC)

A1 B1 C1

slide-85
SLIDE 85

A1 B1 C1 A B C Receivers Sequencer Counter:


T1

(ABC)

A1 B1 C1

T1

(ABC)

A1 B1 C1

T1

(ABC)

A1 B1 C1

T2

(AB)

slide-86
SLIDE 86

A1 B1 C1 A B C Receivers Sequencer Counter:


T1

(ABC)

A1 B1 C1

T1

(ABC)

A1 B1 C1

T1

(ABC)

A1 B1 C1

T2

(AB)

slide-87
SLIDE 87

A2 B2 C1 A B C Receivers Sequencer Counter:


T1

(ABC)

A1 B1 C1

T1

(ABC)

A1 B1 C1

T1

(ABC)

A1 B1 C1

T2

(AB)

slide-88
SLIDE 88

A2 B2 C1 A B C Receivers Sequencer Counter:


T1

(ABC)

A1 B1 C1

T1

(ABC)

A1 B1 C1

T2

(AB) A2 B2

T1

(ABC)

A1 B1 C1

slide-89
SLIDE 89

A2 B2 C1 A B C Receivers Sequencer Counter:


T1

(ABC)

A1 B1 C1

T1

(ABC)

A1 B1 C1

T2

(AB) A2 B2

T2

(AB) A2 B2

T1

(ABC)

A1 B1 C1

slide-90
SLIDE 90

A2 B2 C1 A B C Receivers Sequencer Counter:


T1

(ABC)

A1 B1 C1

T1

(ABC)

A1 B1 C1

T2

(AB) A2 B2

T1

(ABC)

A1 B1 C1

slide-91
SLIDE 91

A2 B2 C1 A B C Receivers Sequencer Counter:


T1

(ABC)

A1 B1 C1

T1

(ABC)

A1 B1 C1

T2

(AB) A2 B2

T1

(ABC)

A1 B1 C1

T3

(A)

slide-92
SLIDE 92

A2 B2 C1 A B C Receivers Sequencer Counter:


T1

(ABC)

A1 B1 C1

T1

(ABC)

A1 B1 C1

T2

(AB) A2 B2

T1

(ABC)

A1 B1 C1

T3

(A)

slide-93
SLIDE 93

A3 B2 C1 A B C Receivers Sequencer Counter:


T1

(ABC)

A1 B1 C1

T1

(ABC)

A1 B1 C1

T2

(AB) A2 B2

T1

(ABC)

A1 B1 C1

T3

(A)

slide-94
SLIDE 94

A3 B2 C1 A B C Receivers Sequencer Counter:


T1

(ABC)

A1 B1 C1

T1

(ABC)

A1 B1 C1

T2

(AB) A2 B2

T3

(A) A3

T1

(ABC)

A1 B1 C1

slide-95
SLIDE 95

A3 B2 C1 A B C Receivers Sequencer Counter:


T1

(ABC)

A1 B1 C1

T1

(ABC)

A1 B1 C1

T2

(AB) A2 B2

T3

(A) A3

T1

(ABC)

A1 B1 C1

slide-96
SLIDE 96

A3 B2 C1 A B C Receivers Sequencer Counter:


T1

(ABC)

A1 B1 C1

T1

(ABC)

A1 B1 C1

T2

(AB) A2 B2

T3

(A) A3

T1

(ABC)

A1 B1 C1

slide-97
SLIDE 97

A3 B2 C1 A B C Receivers Sequencer Counter:


T1

(ABC)

A1 B1 C1

T1

(ABC)

A1 B1 C1

T2

(AB) A2 B2

T3

(A) A3

T1

(ABC)

A1 B1 C1

DROP

slide-98
SLIDE 98

WHAT HAVE WE ACCOMPLISHED SO FAR?

  • Consistently ordered groupcast primitive with 


drop detection

  • How do we go from multi-sequenced groupcast

to transactions?

slide-99
SLIDE 99

TRANSACTION MODEL

Eris supports two types of transactions

  • Independent transactions:

✤ One-shot (stored procedures) ✤ No cross-shard dependencies ✤ Proposed by H-Store [VLDB ’07] and Granola [ATC ’12]

  • Fully general transactions
slide-100
SLIDE 100

INDEPENDENT TRANSACTION

Name Salary Alice 600 Name Salary Bob 350 Name Salary Charlie 400

START TRANSACTION
 UPDATE tb t1 SET t1.Salary = t1.Salary + 100 WHERE t1.Salary < 500 COMMIT START TRANSACTION
 UPDATE tb t1 SET t1.Salary = t1.Salary + 100 WHERE t1.Salary < 500 COMMIT START TRANSACTION
 UPDATE tb t1 SET t1.Salary = t1.Salary + 100 WHERE t1.Salary < 500 COMMIT

slide-101
SLIDE 101

INDEPENDENT TRANSACTION

Name Salary Alice 600 Name Salary Bob 350 Name Salary Charlie 400

START TRANSACTION
 UPDATE tb t1 SET t1.Salary = t1.Salary + 100 WHERE t1.Salary < 500 COMMIT START TRANSACTION
 UPDATE tb t1 SET t1.Salary = t1.Salary + 100 WHERE t1.Salary < 500 COMMIT START TRANSACTION
 UPDATE tb t1 SET t1.Salary = t1.Salary + 100 WHERE t1.Salary < 500 COMMIT

slide-102
SLIDE 102

INDEPENDENT TRANSACTION

Name Salary Alice 600 Name Salary Bob 350 Name Salary Charlie 400

START TRANSACTION
 UPDATE tb t1 SET t1.Salary = t1.Salary + 100 WHERE t1.Salary < 500 COMMIT START TRANSACTION
 UPDATE tb t1 SET t1.Salary = t1.Salary + 100 WHERE t1.Salary < 500 COMMIT START TRANSACTION
 UPDATE tb t1 SET t1.Salary = t1.Salary + 100 WHERE t1.Salary < 500 COMMIT

Name Salary Bob 450 Name Salary Charlie 500

slide-103
SLIDE 103

INDEPENDENT TRANSACTION

Name Salary Alice 600 Name Salary Bob 350 Name Salary Charlie 400

START TRANSACTION
 UPDATE tb t1 SET t1.Salary = t1.Salary + 100 WHERE t1.Salary < 500 COMMIT START TRANSACTION
 UPDATE tb t1 SET t1.Salary = t1.Salary + 100 WHERE t1.Salary < 500 COMMIT START TRANSACTION
 UPDATE tb t1 SET t1.Salary = t1.Salary + 100 WHERE t1.Salary < 500 COMMIT

Name Salary Bob 450 Name Salary Charlie 500

START TRANSACTION
 UPDATE tb t1 SET t1.Salary = t1.Salary + 100 WHERE 500 < (SELECT AVG(t2.Salary) FROM tb t2) COMMIT

slide-104
SLIDE 104

INDEPENDENT TRANSACTION

Name Salary Alice 600 Name Salary Bob 350 Name Salary Charlie 400

START TRANSACTION
 UPDATE tb t1 SET t1.Salary = t1.Salary + 100 WHERE t1.Salary < 500 COMMIT START TRANSACTION
 UPDATE tb t1 SET t1.Salary = t1.Salary + 100 WHERE t1.Salary < 500 COMMIT START TRANSACTION
 UPDATE tb t1 SET t1.Salary = t1.Salary + 100 WHERE t1.Salary < 500 COMMIT

Name Salary Bob 450 Name Salary Charlie 500

START TRANSACTION
 UPDATE tb t1 SET t1.Salary = t1.Salary + 100 WHERE 500 < (SELECT AVG(t2.Salary) FROM tb t2) COMMIT

Not Independent!

slide-105
SLIDE 105

INDEPENDENT TRANSACTION

Name Salary Alice 600 Name Salary Bob 350 Name Salary Charlie 400

START TRANSACTION
 UPDATE tb t1 SET t1.Salary = t1.Salary + 100 WHERE t1.Salary < 500 COMMIT START TRANSACTION
 UPDATE tb t1 SET t1.Salary = t1.Salary + 100 WHERE t1.Salary < 500 COMMIT START TRANSACTION
 UPDATE tb t1 SET t1.Salary = t1.Salary + 100 WHERE t1.Salary < 500 COMMIT

Name Salary Bob 450 Name Salary Charlie 500

slide-106
SLIDE 106

INDEPENDENT TRANSACTION

Name Salary Alice 600 Name Salary Bob 350 Name Salary Charlie 400

START TRANSACTION
 UPDATE tb t1 SET t1.Salary = t1.Salary + 100 WHERE t1.Salary < 500 COMMIT START TRANSACTION
 UPDATE tb t1 SET t1.Salary = t1.Salary + 100 WHERE t1.Salary < 500 COMMIT START TRANSACTION
 UPDATE tb t1 SET t1.Salary = t1.Salary + 100 WHERE t1.Salary < 500 COMMIT

Name Salary Bob 450 Name Salary Charlie 500

Many applications consist entirely of independent transactions

slide-107
SLIDE 107

WHY INDEPENDENT TRANSACTIONS?

  • No coordination/communication across shards
  • Executing them serially at each shard in a

consistent order guarantees serializability

  • Multi-sequenced groupcast establishes such an
  • rder
  • How to handle message drops and sequencer/

server failures?

slide-108
SLIDE 108

Shard 3 Client Shard 1 Shard 2 Sequencer

NORMAL CASE

Learner Learner Learner Replica Replica Replica Replica Replica Replica

slide-109
SLIDE 109

Shard 3 Client Shard 1 Shard 2 Sequencer

NORMAL CASE

Learner Learner Learner Replica Replica Replica Replica Replica Replica

slide-110
SLIDE 110

Shard 3 Client Shard 1 Shard 2 Sequencer

NORMAL CASE

Learner Learner Learner Replica Replica Replica Replica Replica Replica

slide-111
SLIDE 111

Shard 3 Client Shard 1 Shard 2 Sequencer

NORMAL CASE

Learner Learner Learner Replica Replica Replica Replica Replica Replica

slide-112
SLIDE 112

Shard 3 Client Shard 1 Shard 2 Sequencer

NORMAL CASE

Learner Learner Learner Replica Replica Replica Replica Replica Replica

slide-113
SLIDE 113

Shard 3 Client Shard 1 Shard 2 Sequencer

1 round trip

NORMAL CASE

Learner Learner Learner Replica Replica Replica Replica Replica Replica

slide-114
SLIDE 114

Shard 3 Client Shard 1 Shard 2 Sequencer

1 round trip no coordination

NORMAL CASE

Learner Learner Learner Replica Replica Replica Replica Replica Replica

slide-115
SLIDE 115

HOW TO HANDLE DROPPED MESSAGES?

A B C

DROP

T1

(ABC) A 1 B1

T1

(ABC) A 1 B1

T1

(ABC) A 1 B1

T3

(A) A3

slide-116
SLIDE 116

HOW TO HANDLE DROPPED MESSAGES?

A B C

T1

(ABC) A 1 B1

T1

(ABC) A 1 B1

T1

(ABC) A 1 B1

T3

(A) A3

slide-117
SLIDE 117

HOW TO HANDLE DROPPED MESSAGES?

A B C

T1

(ABC) A 1 B1

T1

(ABC) A 1 B1

T1

(ABC) A 1 B1

T2

(AB) A2 B2

T3

(A) A3

slide-118
SLIDE 118

HOW TO HANDLE DROPPED MESSAGES?

A B C

T1

(ABC) A 1 B1

T1

(ABC) A 1 B1

T1

(ABC) A 1 B1

T2

(AB) A2 B2

T3

(A) A3

slide-119
SLIDE 119

HOW TO HANDLE DROPPED MESSAGES?

A B C

T1

(ABC) A 1 B1

T1

(ABC) A 1 B1

T1

(ABC) A 1 B1

T2

(AB) A2 B2

T3

(A) A3

slide-120
SLIDE 120

HOW TO HANDLE DROPPED MESSAGES?

A B C

T1

(ABC) A 1 B1

T1

(ABC) A 1 B1

T1

(ABC) A 1 B1

T2

(AB) A2 B2

T3

(A) A3

Global coordination problem

slide-121
SLIDE 121

THE FAILURE COORDINATOR

A B C

DROP

Failure Coordinator

T1

(ABC) A 1 B1

T1

(ABC) A 1 B1

T1

(ABC) A 1 B1

T3

(A) A3

T2

(AB) A2 B2

slide-122
SLIDE 122

THE FAILURE COORDINATOR

A B C

DROP

Failure Coordinator

Received A2? T1

(ABC) A 1 B1

T1

(ABC) A 1 B1

T1

(ABC) A 1 B1

T3

(A) A3

T2

(AB) A2 B2

slide-123
SLIDE 123

THE FAILURE COORDINATOR

A B C

DROP

Failure Coordinator

Received A2? Received A2? T1

(ABC) A 1 B1

T1

(ABC) A 1 B1

T1

(ABC) A 1 B1

T3

(A) A3

T2

(AB) A2 B2

slide-124
SLIDE 124

THE FAILURE COORDINATOR

A B C

DROP

Failure Coordinator

Received A2? Received A2? T1

(ABC) A 1 B1

T1

(ABC) A 1 B1

T1

(ABC) A 1 B1

T3

(A) A3

T2

(AB) A2 B2

slide-125
SLIDE 125

THE FAILURE COORDINATOR

A B C

DROP

Failure Coordinator

Not Found T1

(ABC) A 1 B1

T1

(ABC) A 1 B1

T1

(ABC) A 1 B1

T3

(A) A3

T2

(AB) A2 B2

T2

(AB) A2 B2

slide-126
SLIDE 126

THE FAILURE COORDINATOR

A B C

DROP

Failure Coordinator

Not Found T1

(ABC) A 1 B1

T1

(ABC) A 1 B1

T1

(ABC) A 1 B1

T3

(A) A3

T2

(AB) A2 B2

T2

(AB) A2 B2

slide-127
SLIDE 127

THE FAILURE COORDINATOR

A B C

DROP

Failure Coordinator

T1

(ABC) A 1 B1

T1

(ABC) A 1 B1

T1

(ABC) A 1 B1

T3

(A) A3

T2

(AB) A2 B2

T2

(AB) A2 B2

slide-128
SLIDE 128

THE FAILURE COORDINATOR

A B C Failure Coordinator

T1

(ABC) A 1 B1

T1

(ABC) A 1 B1

T1

(ABC) A 1 B1

T3

(A) A3

T2

(AB) A2 B2

T2

(AB) A2 B2

slide-129
SLIDE 129

THE FAILURE COORDINATOR

A B C

DROP

Received A2? Received A2? T1

(ABC) A 1 B1

T1

(ABC) A 1 B1

T3

(A) A3

T1

(ABC) A 1 B1

Failure Coordinator

slide-130
SLIDE 130

THE FAILURE COORDINATOR

A B C

DROP

Not Found Not Found T1

(ABC) A 1 B1

T1

(ABC) A 1 B1

T3

(A) A3

T1

(ABC) A 1 B1

Failure Coordinator

slide-131
SLIDE 131

THE FAILURE COORDINATOR

A B C

DROP

Not Found Not Found T1

(ABC) A 1 B1

T1

(ABC) A 1 B1

T3

(A) A3

T1

(ABC) A 1 B1

Failure Coordinator

slide-132
SLIDE 132

THE FAILURE COORDINATOR

A B C

DROP

Drop A2 Drop A2 Drop A2 T1

(ABC) A 1 B1

T1

(ABC) A 1 B1

T3

(A) A3

T1

(ABC) A 1 B1

Failure Coordinator

slide-133
SLIDE 133

THE FAILURE COORDINATOR

A B C

Drop A2 Drop A2 Drop A2

NO OP

T1

(ABC) A 1 B1

T1

(ABC) A 1 B1

T3

(A) A3

T1

(ABC) A 1 B1

Failure Coordinator

slide-134
SLIDE 134

THE FAILURE COORDINATOR

A B C

Drop A2 Drop A2 Drop A2

NO OP

Drops: A2 Drops: A2

T1

(ABC) A 1 B1

T1

(ABC) A 1 B1

T3

(A) A3

T1

(ABC) A 1 B1

Failure Coordinator

slide-135
SLIDE 135

DESIGNATED LEARNER AND SEQUENCER FAILURES

Designated learner (DL) failure:

  • View change based protocol
  • Ensures new DL learns all committed transactions from previous

views Sequencer failure:

  • Higher epoch number from the new sequencer
  • Epoch change ensures all replicas across all shards start the new epoch

in consistent states. They should all agree on the exact set of transactions completed in the previous epoch.

slide-136
SLIDE 136

CAN WE PROCESS NON-INDEPENDENT

TRANSACTIONS EFFICIENTLY?

slide-137
SLIDE 137

APPROACH: DIVIDE INTO INDEPENDENT TRANSACTIONS

  • Relies on the linearizable execution of independent transactions
  • This means that we have the abstraction of a single, correct

machine that processes independent transactions only.

  • Uses locks to provide strong isolation
  • Two phases:

✤ Independent transaction 1: execute reads and acquire locks ✤ Independent transaction 2: commit/abort changes and release

locks

slide-138
SLIDE 138

BENEFITS OF OUR LAYERED ARCHITECTURE

  • Simple solution to handle client failures: if the client fails, any

server can unilaterally send the abort command for its general transactions as an independent transaction.

  • No deadlocks/deadlock detection. Locks are acquired in a

single step.

  • Furthermore, we don't even need aborts! Wait queues are easy.
  • Takes advantage of the efficient independent transaction

processing layer. General transactions are processed in two round trips in the normal case.

slide-139
SLIDE 139

EVALUATION COMPARISON SYSTEMS

  • Lock-Store (2PC + 2PL + Paxos)
  • TAPIR [SOSP ’15]
  • Granola [ATC ‘12]
  • Non-transactional, unreplicated (NT-UR)
slide-140
SLIDE 140

ERIS PERFORMS WELL ON INDEPENDENT

TRANSACTIONS

Lock-Store TAPIR Granola Eris NT-UR 0K 300K 600K 900K 1,200K

Distributed independent transactions

Throughput (txns/sec)

slide-141
SLIDE 141

ERIS PERFORMS WELL ON INDEPENDENT

TRANSACTIONS

Lock-Store TAPIR Granola Eris NT-UR 0K 300K 600K 900K 1,200K

Distributed independent transactions

Throughput (txns/sec)

Eris outperforms 
 Lock-Store, TAPIR and Granola by more than 3X

slide-142
SLIDE 142

ERIS PERFORMS WELL ON INDEPENDENT

TRANSACTIONS

Lock-Store TAPIR Granola Eris NT-UR 0K 300K 600K 900K 1,200K

Distributed independent transactions

Throughput (txns/sec)

Eris achieves throughput within 10% of NT-UR Eris outperforms 
 Lock-Store, TAPIR and Granola by more than 3X

slide-143
SLIDE 143

ERIS PERFORMS WELL ON INDEPENDENT

TRANSACTIONS

Lock-Store TAPIR Granola Eris NT-UR 0K 300K 600K 900K 1,200K

Distributed independent transactions

Throughput (txns/sec)

Eris achieves throughput within 10% of NT-UR Eris outperforms 
 Lock-Store, TAPIR and Granola by more than 3X

More than 70% reduction in latency compared to Lock-Store, and within 10% latency of NT-UR

slide-144
SLIDE 144

ERIS ALSO PERFORMS WELL ON GENERAL

TRANSACTIONS

Lock-Store TAPIR Granola Eris NT-UR 0K 300K 600K 900K 1,200K

Distributed general transactions

Throughput (txns/sec)

slide-145
SLIDE 145

ERIS ALSO PERFORMS WELL ON GENERAL

TRANSACTIONS

Lock-Store TAPIR Granola Eris NT-UR 0K 300K 600K 900K 1,200K

Distributed general transactions

Throughput (txns/sec)

Eris maintains throughput within 10%

  • f NT-UR
slide-146
SLIDE 146

0K 60K 120K 180K 240K Lock-Store TAPIR Granola Eris NT-UR

TPC-C benchmark

Throughput (txns/sec)

ERIS EXCELS AT COMPLEX TRANSACTIONAL

APPLICATIONS

slide-147
SLIDE 147

0K 60K 120K 180K 240K Lock-Store TAPIR Granola Eris NT-UR

TPC-C benchmark

Throughput (txns/sec)

ERIS EXCELS AT COMPLEX TRANSACTIONAL

APPLICATIONS

7.6X and 6.4X higher throughput than Lock-Store and Tapir

slide-148
SLIDE 148

0K 60K 120K 180K 240K Lock-Store TAPIR Granola Eris NT-UR

TPC-C benchmark

Throughput (txns/sec)

ERIS EXCELS AT COMPLEX TRANSACTIONAL

APPLICATIONS

7.6X and 6.4X higher throughput than Lock-Store and Tapir within 3% throughput of NT-UR

slide-149
SLIDE 149

ERIS IS RESILIENT TO NETWORK ANOMALIES

0K 450K 900K 1,350K 1,800K 0.01% 0.1% 1% 10%

Eris Lock-Store TAPIR Granola NT-UR

Packet Drop Rate

Throughput (txns/sec)

slide-150
SLIDE 150

ERIS IS RESILIENT TO NETWORK ANOMALIES

0K 450K 900K 1,350K 1,800K 0.01% 0.1% 1% 10%

Eris Lock-Store TAPIR Granola NT-UR

Packet Drop Rate

TAPIR Lock-Store Eris Granola NT-UR

Throughput (txns/sec)

slide-151
SLIDE 151

ERIS RECAP

  • A new division of responsibility for transaction processing

✤ An in-network concurrency control mechanism that establishes a

consistent order of transactions across shards

✤ An efficient protocol that ensures reliable delivery of

independent transactions

✤ A general transaction layer atop independent transaction

processing

  • Result: strongly consistent, fault-tolerant transactions with

minimal performance overhead

slide-152
SLIDE 152

ERIS AND NOPAXOS DISCUSSION

  • Can we use an end-host sequencer for Eris? In

NOPaxos, it's not a problem.

  • What properties are important to NOPaxos's

"scalability"?

  • How deployable are these approaches?
  • How scalable is Eris compared to two-phase

commit?