Consistency-Aware Durability Aishwarya Ganesan, Ram Alagappan, - - PowerPoint PPT Presentation

consistency aware durability
SMART_READER_LITE
LIVE PREVIEW

Consistency-Aware Durability Aishwarya Ganesan, Ram Alagappan, - - PowerPoint PPT Presentation

Strong and Efficient Consistency with Consistency-Aware Durability Aishwarya Ganesan, Ram Alagappan, Andrea Arpaci-Dusseau, and Remzi Arpaci-Dusseau Distributed Storage Systems 2 Consistency Models in Distributed Systems 3 Consistency Models


slide-1
SLIDE 1

Strong and Efficient Consistency with Consistency-Aware Durability

Aishwarya Ganesan, Ram Alagappan, Andrea Arpaci-Dusseau, and Remzi Arpaci-Dusseau

slide-2
SLIDE 2

Distributed Storage Systems

2

slide-3
SLIDE 3

Consistency Models in Distributed Systems

3

slide-4
SLIDE 4

Consistency Models in Distributed Systems

3

What does a read see given a previous set of reads and writes?

slide-5
SLIDE 5

Consistency Models in Distributed Systems

3

What does a read see given a previous set of reads and writes?

linearizability strong

slide-6
SLIDE 6

Consistency Models in Distributed Systems

3

What does a read see given a previous set of reads and writes?

linearizability eventual strong weak

slide-7
SLIDE 7

Consistency Models in Distributed Systems

3

What does a read see given a previous set of reads and writes?

linearizability eventual sequential causal+ PRAM monotonic reads causal strong weak

slide-8
SLIDE 8

Consistency Models in Distributed Systems

3

What does a read see given a previous set of reads and writes?

linearizability eventual sequential causal+ PRAM monotonic reads causal strong weak

Well studied and understood!

slide-9
SLIDE 9

Unlike consistency models, scant attention to durability model!

Durability Models

4

slide-10
SLIDE 10

Unlike consistency models, scant attention to durability model! How writes are replicated and persisted

Durability Models

4

slide-11
SLIDE 11

Unlike consistency models, scant attention to durability model! How writes are replicated and persisted Durability model influences consistency

Durability Models

4

slide-12
SLIDE 12

Unlike consistency models, scant attention to durability model! How writes are replicated and persisted Durability model influences consistency Also determines performance

Durability Models

4

slide-13
SLIDE 13

Unlike consistency models, scant attention to durability model! How writes are replicated and persisted Durability model influences consistency Also determines performance

Durability Models

4

Despite this importance, often overlooked!

slide-14
SLIDE 14

Two Widely Used Durability Models

5

slide-15
SLIDE 15

Two Widely Used Durability Models

5

Immediate durability

slide-16
SLIDE 16

Two Widely Used Durability Models

5

client write node-1 node-2 node-3

Immediate durability

slide-17
SLIDE 17

Two Widely Used Durability Models

5

replicate

client write node-1 node-2 node-3

Immediate durability

slide-18
SLIDE 18

Two Widely Used Durability Models

5

replicate persist

client write node-1 node-2 node-3

Immediate durability

fsync fsync fsync

slide-19
SLIDE 19

Two Widely Used Durability Models

5

replicate persist

client write node-1 node-2 node-3

Immediate durability

ack fsync fsync fsync

slide-20
SLIDE 20

Two Widely Used Durability Models

5

replicate persist

client write node-1 node-2 node-3

Immediate durability

ack

enables strong consistency but too slow!

fsync fsync fsync

slide-21
SLIDE 21

Two Widely Used Durability Models

5

replicate persist

client write node-1 node-2 node-3

Immediate durability

ack

enables strong consistency but too slow! Eventual durability

fsync fsync fsync

slide-22
SLIDE 22

Two Widely Used Durability Models

5

replicate persist

client write node-1 node-2 node-3

Immediate durability

ack

enables strong consistency but too slow!

client write node-1 node-2 node-3

Eventual durability

fsync fsync fsync ack

slide-23
SLIDE 23

Two Widely Used Durability Models

5

replicate persist

client write node-1 node-2 node-3

Immediate durability

ack

enables strong consistency but too slow!

client write node-1 node-2 node-3

Eventual durability

lazily replicate and persist

fsync fsync fsync ack

slide-24
SLIDE 24

Two Widely Used Durability Models

5

replicate persist

client write node-1 node-2 node-3

Immediate durability

ack

enables strong consistency but too slow!

client write node-1 node-2 node-3

Eventual durability fast but enables only weak consistency due to data loss upon failures!

lazily replicate and persist

fsync fsync fsync ack

slide-25
SLIDE 25

Is it possible for a durability layer to enable both strong consistency and high performance?

6

slide-26
SLIDE 26

CAD: Consistency-aware Durability

7

slide-27
SLIDE 27

Design the durability layer by taking the consistency model into account

CAD: Consistency-aware Durability

7

slide-28
SLIDE 28

Design the durability layer by taking the consistency model into account Intuition: what a read sees is important for most consistency models

CAD: Consistency-aware Durability

7

slide-29
SLIDE 29

Design the durability layer by taking the consistency model into account Intuition: what a read sees is important for most consistency models Key idea: CAD shifts the point of durability to reads from writes

data is replicated and persisted before a read is served

CAD: Consistency-aware Durability

7

slide-30
SLIDE 30

Design the durability layer by taking the consistency model into account Intuition: what a read sees is important for most consistency models Key idea: CAD shifts the point of durability to reads from writes

data is replicated and persisted before a read is served delayed writes → high performance

CAD: Consistency-aware Durability

7

slide-31
SLIDE 31

Design the durability layer by taking the consistency model into account Intuition: what a read sees is important for most consistency models Key idea: CAD shifts the point of durability to reads from writes

data is replicated and persisted before a read is served delayed writes → high performance data durable before it is read → strong consistency even under failures

CAD: Consistency-aware Durability

7

slide-32
SLIDE 32

Design the durability layer by taking the consistency model into account Intuition: what a read sees is important for most consistency models Key idea: CAD shifts the point of durability to reads from writes

data is replicated and persisted before a read is served delayed writes → high performance data durable before it is read → strong consistency even under failures lose some writes if failures arise before read; but, useful for many systems that use eventual durability

CAD: Consistency-aware Durability

7

slide-33
SLIDE 33

Design the durability layer by taking the consistency model into account Intuition: what a read sees is important for most consistency models Key idea: CAD shifts the point of durability to reads from writes

data is replicated and persisted before a read is served delayed writes → high performance data durable before it is read → strong consistency even under failures lose some writes if failures arise before read; but, useful for many systems that use eventual durability

We show efficacy of CAD by providing cross-client monotonic reads

a new and strong consistency property

CAD: Consistency-aware Durability

7

slide-34
SLIDE 34

Results

8

slide-35
SLIDE 35

ORCA: CAD and cross-client monotonic reads for leader-based systems

implemented in ZooKeeper

Results

8

slide-36
SLIDE 36

ORCA: CAD and cross-client monotonic reads for leader-based systems

implemented in ZooKeeper

Compared to strongly consistent ZooKeeper

ORCA is 1.6 – 3.3x faster by using CAD higher read throughput by allowing reads at many nodes reduces latency in geo-distributed settings by 14x

Results

8

slide-37
SLIDE 37

ORCA: CAD and cross-client monotonic reads for leader-based systems

implemented in ZooKeeper

Compared to strongly consistent ZooKeeper

ORCA is 1.6 – 3.3x faster by using CAD higher read throughput by allowing reads at many nodes reduces latency in geo-distributed settings by 14x

Compared to weakly consistent ZooKeeper

ORCA provides similar throughput and latency but with stronger guarantees

Results

8

slide-38
SLIDE 38

ORCA: CAD and cross-client monotonic reads for leader-based systems

implemented in ZooKeeper

Compared to strongly consistent ZooKeeper

ORCA is 1.6 – 3.3x faster by using CAD higher read throughput by allowing reads at many nodes reduces latency in geo-distributed settings by 14x

Compared to weakly consistent ZooKeeper

ORCA provides similar throughput and latency but with stronger guarantees

Experimentally show ORCA’s guarantees under failures, useful for apps

Results

8

slide-39
SLIDE 39

Outline

Introduction Motivation CAD and cross-client monotonic reads ORCA design Results Summary and conclusion

9

slide-40
SLIDE 40

Consistency Models and Guarantees

10

Example: I’m bored at FAST and want to go home!

slide-41
SLIDE 41

Consistency Models and Guarantees

10

Example: I’m bored at FAST and want to go home!

slide-42
SLIDE 42

Consistency Models and Guarantees

10

Example: I’m bored at FAST and want to go home!

slide-43
SLIDE 43

Consistency Models and Guarantees

10

Example: I’m bored at FAST and want to go home!

slide-44
SLIDE 44

Consistency Models and Guarantees

10

Example: I’m bored at FAST and want to go home!

Linearizability

slide-45
SLIDE 45

Consistency Models and Guarantees

10

Example: I’m bored at FAST and want to go home!

Linearizability

slide-46
SLIDE 46

Consistency Models and Guarantees

10

Example: I’m bored at FAST and want to go home!

Linearizability

slide-47
SLIDE 47

Consistency Models and Guarantees

10

Example: I’m bored at FAST and want to go home!

Linearizability

latest data: no staleness

slide-48
SLIDE 48

Consistency Models and Guarantees

10

Example: I’m bored at FAST and want to go home! Linearizability

latest data: no staleness in-order reads across clients

slide-49
SLIDE 49

Consistency Models and Guarantees

10

Example: I’m bored at FAST and want to go home! Linearizability

latest data: no staleness in-order reads across clients

Weaker models

slide-50
SLIDE 50

Consistency Models and Guarantees

10

Example: I’m bored at FAST and want to go home! Linearizability

latest data: no staleness in-order reads across clients

Weaker models

slide-51
SLIDE 51

Consistency Models and Guarantees

10

Example: I’m bored at FAST and want to go home! Linearizability

latest data: no staleness in-order reads across clients

Weaker models

slide-52
SLIDE 52

Consistency Models and Guarantees

10

Example: I’m bored at FAST and want to go home! Linearizability

latest data: no staleness in-order reads across clients

Weaker models

stale reads

  • ut-of-order reads across clients
slide-53
SLIDE 53

Consistency Models and Guarantees

10

Example: I’m bored at FAST and want to go home! Linearizability

latest data: no staleness in-order reads across clients

Weaker models

stale reads

  • ut-of-order reads across clients

even with monotonic reads and causal

slide-54
SLIDE 54

Linearizability requires immediate durability

must synchronously replicate and persist data on a majority to tolerate failures

11

Realizing Strong Consistency

slide-55
SLIDE 55

Linearizability requires immediate durability

must synchronously replicate and persist data on a majority to tolerate failures

11

Realizing Strong Consistency

leader – S1 S2 S3 a0 a0 a0

  • n disk

in memory durable = on disk on a majority

slide-56
SLIDE 56

Linearizability requires immediate durability

must synchronously replicate and persist data on a majority to tolerate failures

11

Realizing Strong Consistency

leader – S1 S2 S3 a0 a0 a0

  • n disk

in memory durable = on disk on a majority

slide-57
SLIDE 57

Linearizability requires immediate durability

must synchronously replicate and persist data on a majority to tolerate failures

11

Realizing Strong Consistency

leader – S1 S2 S3 a0 a0 a0 a1

  • n disk

in memory durable = on disk on a majority

a1 a1

slide-58
SLIDE 58

Linearizability requires immediate durability

must synchronously replicate and persist data on a majority to tolerate failures

11

Realizing Strong Consistency

leader – S1 S2 S3 a0 a0 a0 a1

  • n disk

in memory durable = on disk on a majority

a1 a1 a1 a1

slide-59
SLIDE 59

Linearizability requires immediate durability

must synchronously replicate and persist data on a majority to tolerate failures

11

Realizing Strong Consistency

leader – S1 S2 S3 a0 a0 a0 a1

  • n disk

in memory durable = on disk on a majority

a1 a1 a1 a1

slide-60
SLIDE 60

Linearizability requires immediate durability

must synchronously replicate and persist data on a majority to tolerate failures

11

Realizing Strong Consistency

leader – S1 S2 S3 a0 a0 a0 a1

  • n disk

in memory durable = on disk on a majority

a1 a1 a1 a1

slide-61
SLIDE 61

Linearizability requires immediate durability

must synchronously replicate and persist data on a majority to tolerate failures

11

Realizing Strong Consistency

leader – S1 S2 S3 a0 a0 a0 a1

  • n disk

in memory durable = on disk on a majority

a1 a1 a1

slide-62
SLIDE 62

Linearizability requires immediate durability

must synchronously replicate and persist data on a majority to tolerate failures

11

Realizing Strong Consistency

leader – S1 S2 S3 a0 a0 a0 a1

  • n disk

in memory durable = on disk on a majority

a1 a1 a1

slide-63
SLIDE 63

Linearizability requires immediate durability

must synchronously replicate and persist data on a majority to tolerate failures

11

Realizing Strong Consistency

leader – S1 S2 S3 a0 a0 a0 a1

  • n disk

in memory durable = on disk on a majority

a1 a1 a1

slide-64
SLIDE 64

Linearizability requires immediate durability

must synchronously replicate and persist data on a majority to tolerate failures

11

Realizing Strong Consistency

Poor performance due to synchronous operations

10x slower within data center

leader – S1 S2 S3 a0 a0 a0 a1

  • n disk

in memory durable = on disk on a majority

a1 a1 a1

slide-65
SLIDE 65

Weaker models only require eventual durability

data buffered on one node, replication and persistence in background

12

Realizing Weaker Models

S1 S2 S3 a0 a0 a0

slide-66
SLIDE 66

Weaker models only require eventual durability

data buffered on one node, replication and persistence in background

12

Realizing Weaker Models

S1 S2 S3 a0 a0 a0 app session-1

slide-67
SLIDE 67

Weaker models only require eventual durability

data buffered on one node, replication and persistence in background

12

Realizing Weaker Models

S1 S2 S3 a0 a0 a0 a1 app session-1

slide-68
SLIDE 68

Weaker models only require eventual durability

data buffered on one node, replication and persistence in background

12

Realizing Weaker Models

S1 S2 S3 a0 a0 a0 a1 app session-1

slide-69
SLIDE 69

Weaker models only require eventual durability

data buffered on one node, replication and persistence in background

12

Realizing Weaker Models

S1 S2 S3 a0 a0 a0 a1 app session-1

slide-70
SLIDE 70

Weaker models only require eventual durability

data buffered on one node, replication and persistence in background

12

Realizing Weaker Models

S1 S2 S3 a0 a0 a0 app session-1

slide-71
SLIDE 71

Weaker models only require eventual durability

data buffered on one node, replication and persistence in background

12

Realizing Weaker Models

S1 S2 S3 a0 a0 a0 app session-1 S1 S2 S3 a0 a0 a0

slide-72
SLIDE 72

Weaker models only require eventual durability

data buffered on one node, replication and persistence in background

12

Realizing Weaker Models

S1 S2 S3 a0 a0 a0 app session-1 S1 S2 S3 a0 a0 a0 app session-2

slide-73
SLIDE 73

Weaker models only require eventual durability

data buffered on one node, replication and persistence in background

12

Realizing Weaker Models

S1 S2 S3 a0 a0 a0 app session-1 S1 S2 S3 a0 a0 a0 app session-2

slide-74
SLIDE 74

Weaker models only require eventual durability

data buffered on one node, replication and persistence in background

12

  • ut-of-order across

clients valid under causal and monotonic reads but confusing semantics

Realizing Weaker Models

S1 S2 S3 a0 a0 a0 app session-1 S1 S2 S3 a0 a0 a0 app session-2

slide-75
SLIDE 75

Weaker models only require eventual durability

data buffered on one node, replication and persistence in background

12

  • ut-of-order across

clients valid under causal and monotonic reads but confusing semantics

Many deployments prefer eventual durability for performance

in fact, it is the default (e.g., MongoDB, Redis)

Realizing Weaker Models

S1 S2 S3 a0 a0 a0 app session-1 S1 S2 S3 a0 a0 a0 app session-2

slide-76
SLIDE 76

Weaker models only require eventual durability

data buffered on one node, replication and persistence in background

12

  • ut-of-order across

clients valid under causal and monotonic reads but confusing semantics

Many deployments prefer eventual durability for performance

in fact, it is the default (e.g., MongoDB, Redis)

Thus settle for weak consistency

Realizing Weaker Models

S1 S2 S3 a0 a0 a0 app session-1 S1 S2 S3 a0 a0 a0 app session-2

slide-77
SLIDE 77

Immediate durability enables strong consistency but is slow Eventual durability is fast but enables only weaker consistency

slide-78
SLIDE 78

Outline

Introduction Motivation CAD and cross-client monotonic reads ORCA design Results Summary and conclusion

13

slide-79
SLIDE 79

Consistency-aware Durability

14

slide-80
SLIDE 80

Most consistency models care about what reads see

Consistency-aware Durability

14

slide-81
SLIDE 81

Most consistency models care about what reads see Key idea: CAD shifts the point of durability to reads from writes

Consistency-aware Durability

14

slide-82
SLIDE 82

Most consistency models care about what reads see Key idea: CAD shifts the point of durability to reads from writes

Consistency-aware Durability

14

delay durability of writes

slide-83
SLIDE 83

Most consistency models care about what reads see Key idea: CAD shifts the point of durability to reads from writes

Consistency-aware Durability

14

client write S1 S2 S3

delay durability of writes

slide-84
SLIDE 84

Most consistency models care about what reads see Key idea: CAD shifts the point of durability to reads from writes

Consistency-aware Durability

14

client write S1 S2 S3

delay durability of writes

slide-85
SLIDE 85

Most consistency models care about what reads see Key idea: CAD shifts the point of durability to reads from writes

Consistency-aware Durability

14

client write S1 S2 S3 ack

delay durability of writes

slide-86
SLIDE 86

Most consistency models care about what reads see Key idea: CAD shifts the point of durability to reads from writes

Consistency-aware Durability

14

client write S1 S2 S3 ack

delay durability of writes

good performance

slide-87
SLIDE 87

Most consistency models care about what reads see Key idea: CAD shifts the point of durability to reads from writes

Consistency-aware Durability

14

client write S1 S2 S3 ack

delay durability of writes

good performance

make data durable before serving reads

slide-88
SLIDE 88

Most consistency models care about what reads see Key idea: CAD shifts the point of durability to reads from writes

Consistency-aware Durability

14

client write S1 S2 S3 ack

delay durability of writes

good performance

make data durable before serving reads

client S1 S2 S3 read

slide-89
SLIDE 89

Most consistency models care about what reads see Key idea: CAD shifts the point of durability to reads from writes

Consistency-aware Durability

14

client write S1 S2 S3 ack

delay durability of writes

good performance

make data durable before serving reads

client S1 S2 S3 read

slide-90
SLIDE 90

Most consistency models care about what reads see Key idea: CAD shifts the point of durability to reads from writes

Consistency-aware Durability

14

client write S1 S2 S3 ack

delay durability of writes

good performance

make data durable before serving reads

client S1 S2 S3 read

slide-91
SLIDE 91

Most consistency models care about what reads see Key idea: CAD shifts the point of durability to reads from writes

Consistency-aware Durability

14

client write S1 S2 S3 ack

delay durability of writes

good performance

make data durable before serving reads

client S1 S2 S3 read

slide-92
SLIDE 92

Most consistency models care about what reads see Key idea: CAD shifts the point of durability to reads from writes

Consistency-aware Durability

14

client write S1 S2 S3 ack

delay durability of writes

good performance

make data durable before serving reads

client S1 S2 S3

prevents out-of-order data across failures strong consistency

read

slide-93
SLIDE 93

Most consistency models care about what reads see Key idea: CAD shifts the point of durability to reads from writes

Consistency-aware Durability

14

client write S1 S2 S3 ack

delay durability of writes CAD does not always incur overheads on reads

reads do not immediately follow writes – natural in many workloads common case: data already durable well before applications access it good performance

make data durable before serving reads

client S1 S2 S3

prevents out-of-order data across failures strong consistency

read

slide-94
SLIDE 94

Cross-client Monotonic Reads upon CAD

15

slide-95
SLIDE 95

Cross-client Monotonic Reads upon CAD

A read from a client guaranteed to return at least the latest state returned to a previous read from any client

15

slide-96
SLIDE 96

Cross-client Monotonic Reads upon CAD

A read from a client guaranteed to return at least the latest state returned to a previous read from any client

15

a0 a1 a2

slide-97
SLIDE 97

Cross-client Monotonic Reads upon CAD

A read from a client guaranteed to return at least the latest state returned to a previous read from any client

15

a0 a1

client-1

a2 a2

slide-98
SLIDE 98

Cross-client Monotonic Reads upon CAD

A read from a client guaranteed to return at least the latest state returned to a previous read from any client

15

a0 a1

client-1

a2

client-2

a2 a2

slide-99
SLIDE 99

Cross-client Monotonic Reads upon CAD

A read from a client guaranteed to return at least the latest state returned to a previous read from any client

15

a0 a1

client-1

a2

client-2

a2

Even in the presence of failures and across client sessions

a2

slide-100
SLIDE 100

Cross-client Monotonic Reads upon CAD

A read from a client guaranteed to return at least the latest state returned to a previous read from any client

15

a0 a1

client-1

a2

client-2

a2

Even in the presence of failures and across client sessions No existing model provides this guarantee except linearizability but not with high performance

a2

slide-101
SLIDE 101

Cross-client Monotonic Reads upon CAD

A read from a client guaranteed to return at least the latest state returned to a previous read from any client

15

a0 a1

client-1

a2

client-2

a2

Even in the presence of failures and across client sessions No existing model provides this guarantee except linearizability but not with high performance CAD enables this property with high performance

a2

slide-102
SLIDE 102

Cross-client Monotonic Reads upon CAD

A read from a client guaranteed to return at least the latest state returned to a previous read from any client

15

a0 a1

client-1

a2

client-2

a2

Even in the presence of failures and across client sessions No existing model provides this guarantee except linearizability but not with high performance CAD enables this property with high performance Does not prevent staleness like many weaker models

a2

slide-103
SLIDE 103

Cross-client Monotonic Reads upon CAD

A read from a client guaranteed to return at least the latest state returned to a previous read from any client

15

a0 a1

client-1

a2

client-2

a2

Even in the presence of failures and across client sessions No existing model provides this guarantee except linearizability but not with high performance CAD enables this property with high performance Does not prevent staleness like many weaker models However, avoids out-of-order data, useful in many app scenarios

e.g., location-sharing, twitter timelines a2

slide-104
SLIDE 104

Outline

Introduction Motivation CAD and cross-client monotonic reads ORCA design Results Summary and conclusion

16

slide-105
SLIDE 105

ORCA

17

slide-106
SLIDE 106

ORCA

Implementation of consistency-aware durability and cross-client monotonic reads in leader-based majority systems

17

slide-107
SLIDE 107

ORCA

Implementation of consistency-aware durability and cross-client monotonic reads in leader-based majority systems Leader-based systems (e.g., MongoDB, ZooKeeper)

leader – a dedicated node

  • thers are followers

writes flow through leader, establishes a single order

17

slide-108
SLIDE 108

ORCA

Implementation of consistency-aware durability and cross-client monotonic reads in leader-based majority systems Leader-based systems (e.g., MongoDB, ZooKeeper)

leader – a dedicated node

  • thers are followers

writes flow through leader, establishes a single order

Majority

data is safe when persisted on majority nodes (e.g., 3 out of 5 servers)

17

slide-109
SLIDE 109

ORCA Write Path

18

slide-110
SLIDE 110

ORCA Write Path

Same as an eventually durable system

18

leader – S1 S2 S3 a a a a0

  • n disk

in memory durable = on disk on a majority

slide-111
SLIDE 111

ORCA Write Path

Same as an eventually durable system

18

leader – S1 S2 S3 a a a a0

  • n disk

in memory durable = on disk on a majority

slide-112
SLIDE 112

ORCA Write Path

Same as an eventually durable system

18

leader – S1 S2 S3 a a a b a0

  • n disk

in memory durable = on disk on a majority

slide-113
SLIDE 113

ORCA Write Path

Same as an eventually durable system

18

leader – S1 S2 S3 a a a b a0

  • n disk

in memory durable = on disk on a majority

slide-114
SLIDE 114

ORCA Write Path

Same as an eventually durable system

18

replication and persistence in background

leader – S1 S2 S3 a a a b a0

  • n disk

in memory durable = on disk on a majority

b b

slide-115
SLIDE 115

ORCA Write Path

Same as an eventually durable system

18

replication and persistence in background

leader – S1 S2 S3 a a a b a0

  • n disk

in memory durable = on disk on a majority

b b b b b

slide-116
SLIDE 116

ORCA Read Path

19

  • n disk

in memory durable = on disk on a majority

leader – S1 S2 S3 a a a b

slide-117
SLIDE 117

ORCA Read Path

19

  • n disk

in memory durable = on disk on a majority

leader – S1 S2 S3 a a a b

slide-118
SLIDE 118

ORCA Read Path

19

  • n disk

in memory durable = on disk on a majority

Durable-index – index of the latest durable item in the system

leader – S1 S2 S3 a a a b

slide-119
SLIDE 119

ORCA Read Path

19

  • n disk

in memory durable = on disk on a majority

Durable-index – index of the latest durable item in the system Update-index of item i -- index of the last update that modified i

leader – S1 S2 S3 a a a b

slide-120
SLIDE 120

ORCA Read Path

19

  • n disk

in memory durable = on disk on a majority

Durable-index – index of the latest durable item in the system Update-index of item i -- index of the last update that modified i Durability check – i durable if update-index of i ≤ durable-index of system

leader – S1 S2 S3 a a a b

slide-121
SLIDE 121

ORCA Read Path

19

durable-index:1 a’s update-index:1 a is durable

  • n disk

in memory durable = on disk on a majority

Durable-index – index of the latest durable item in the system Update-index of item i -- index of the last update that modified i Durability check – i durable if update-index of i ≤ durable-index of system

leader – S1 S2 S3 a a a b

slide-122
SLIDE 122

ORCA Read Path

19

durable-index:1 a’s update-index:1 a is durable serve read immediately

  • n disk

in memory durable = on disk on a majority

Durable-index – index of the latest durable item in the system Update-index of item i -- index of the last update that modified i Durability check – i durable if update-index of i ≤ durable-index of system

leader – S1 S2 S3 a a a b

slide-123
SLIDE 123

ORCA Read Path

19

durable-index:1 a’s update-index:1 a is durable serve read immediately

  • n disk

in memory durable = on disk on a majority

Durable-index – index of the latest durable item in the system Update-index of item i -- index of the last update that modified i Durability check – i durable if update-index of i ≤ durable-index of system

a a a b leader – S1 S2 S3 a a a b

slide-124
SLIDE 124

ORCA Read Path

19

durable-index:1 a’s update-index:1 a is durable serve read immediately durable-index:1 b’s update-index: 2 b is not durable

  • n disk

in memory durable = on disk on a majority

Durable-index – index of the latest durable item in the system Update-index of item i -- index of the last update that modified i Durability check – i durable if update-index of i ≤ durable-index of system

a a a b leader – S1 S2 S3 a a a b

slide-125
SLIDE 125

ORCA Read Path

19

durable-index:1 a’s update-index:1 a is durable serve read immediately durable-index:1 b’s update-index: 2 b is not durable make b durable before serving

  • n disk

in memory durable = on disk on a majority

Durable-index – index of the latest durable item in the system Update-index of item i -- index of the last update that modified i Durability check – i durable if update-index of i ≤ durable-index of system

a a a b leader – S1 S2 S3 a a a b

slide-126
SLIDE 126

ORCA Read Path

19

durable-index:1 a’s update-index:1 a is durable serve read immediately durable-index:1 b’s update-index: 2 b is not durable make b durable before serving

  • n disk

in memory durable = on disk on a majority

Durable-index – index of the latest durable item in the system Update-index of item i -- index of the last update that modified i Durability check – i durable if update-index of i ≤ durable-index of system

a a a b leader – S1 S2 S3 a a a b b b

slide-127
SLIDE 127

ORCA Read Path

19

durable-index:1 a’s update-index:1 a is durable serve read immediately durable-index:1 b’s update-index: 2 b is not durable make b durable before serving

  • n disk

in memory durable = on disk on a majority

Durable-index – index of the latest durable item in the system Update-index of item i -- index of the last update that modified i Durability check – i durable if update-index of i ≤ durable-index of system

a a a b leader – S1 S2 S3 a a a b b b b b b

slide-128
SLIDE 128

ORCA Read Path

19

durable-index:1 a’s update-index:1 a is durable serve read immediately durable-index:1 b’s update-index: 2 b is not durable make b durable before serving

  • n disk

in memory durable = on disk on a majority

Durable-index – index of the latest durable item in the system Update-index of item i -- index of the last update that modified i Durability check – i durable if update-index of i ≤ durable-index of system

a a a b leader – S1 S2 S3 a a a b b b b b b

slide-129
SLIDE 129

ORCA Read Path

19

durable-index:1 a’s update-index:1 a is durable serve read immediately durable-index:1 b’s update-index: 2 b is not durable make b durable before serving

  • n disk

in memory durable = on disk on a majority

Durable-index – index of the latest durable item in the system Update-index of item i -- index of the last update that modified i Durability check – i durable if update-index of i ≤ durable-index of system

a a a b leader – S1 S2 S3 a a a b b b b b b a a a b b b b b b

slide-130
SLIDE 130

ORCA Read Path

19

durable-index:1 a’s update-index:1 a is durable serve read immediately durable-index:1 b’s update-index: 2 b is not durable make b durable before serving

  • n disk

in memory durable = on disk on a majority

Durable-index – index of the latest durable item in the system Update-index of item i -- index of the last update that modified i Durability check – i durable if update-index of i ≤ durable-index of system

a a a b leader – S1 S2 S3 a a a b b b b b b a a a b b b b b b

slide-131
SLIDE 131

ORCA Read Path

19

durable-index:1 a’s update-index:1 a is durable serve read immediately durable-index:1 b’s update-index: 2 b is not durable make b durable before serving

  • n disk

in memory durable = on disk on a majority

Durable-index – index of the latest durable item in the system Update-index of item i -- index of the last update that modified i Durability check – i durable if update-index of i ≤ durable-index of system

a a a b leader – S1 S2 S3 a a a b b b b b b a a a b b b b b b

leader – S2

slide-132
SLIDE 132

ORCA Read Path

19

durable-index:1 a’s update-index:1 a is durable serve read immediately durable-index:1 b’s update-index: 2 b is not durable make b durable before serving

  • n disk

in memory durable = on disk on a majority

Durable-index – index of the latest durable item in the system Update-index of item i -- index of the last update that modified i Durability check – i durable if update-index of i ≤ durable-index of system

a a a b leader – S1 S2 S3 a a a b b b b b b a a a b b b b b b

leader – S2

slide-133
SLIDE 133

Cross-Client Monotonic Reads in ORCA

20

slide-134
SLIDE 134

Cross-Client Monotonic Reads in ORCA

If reads restricted to leader, CAD provides cross-client monotonic reads

20

slide-135
SLIDE 135

Cross-Client Monotonic Reads in ORCA

If reads restricted to leader, CAD provides cross-client monotonic reads

not scalable

20

slide-136
SLIDE 136

Cross-Client Monotonic Reads in ORCA

If reads restricted to leader, CAD provides cross-client monotonic reads

not scalable

Allow reads at followers

lagging followers could cause out-of-order states, CAD is not sufficient

20

slide-137
SLIDE 137

Cross-Client Monotonic Reads in ORCA

If reads restricted to leader, CAD provides cross-client monotonic reads

not scalable

Allow reads at followers

lagging followers could cause out-of-order states, CAD is not sufficient

20

a1 a2 a1 a1 a1 a1

leader – S1 S5 S2 S3 S4

slide-138
SLIDE 138

Cross-Client Monotonic Reads in ORCA

If reads restricted to leader, CAD provides cross-client monotonic reads

not scalable

Allow reads at followers

lagging followers could cause out-of-order states, CAD is not sufficient

20

a1 a2 a1 a1 a1 a1

leader – S1 S5 S2 S3 S4

slide-139
SLIDE 139

Cross-Client Monotonic Reads in ORCA

If reads restricted to leader, CAD provides cross-client monotonic reads

not scalable

Allow reads at followers

lagging followers could cause out-of-order states, CAD is not sufficient

20

a1 a2 a1 a1 a1 a1

leader – S1 S5 S2 S3 S4

slide-140
SLIDE 140

Cross-Client Monotonic Reads in ORCA

If reads restricted to leader, CAD provides cross-client monotonic reads

not scalable

Allow reads at followers

lagging followers could cause out-of-order states, CAD is not sufficient

20

a1 a2 a1 a1 a1 a1 a2 a2 a2 a2

leader – S1 S5 S2 S3 S4

slide-141
SLIDE 141

Cross-Client Monotonic Reads in ORCA

If reads restricted to leader, CAD provides cross-client monotonic reads

not scalable

Allow reads at followers

lagging followers could cause out-of-order states, CAD is not sufficient

20

a1 a2 a1 a1 a1 a1 a2 a2 a2 a2

leader – S1 S5 S2 S3 S4

slide-142
SLIDE 142

Cross-Client Monotonic Reads in ORCA

If reads restricted to leader, CAD provides cross-client monotonic reads

not scalable

Allow reads at followers

lagging followers could cause out-of-order states, CAD is not sufficient

20

a1 a2 a1 a1 a1 a1 a2 a2 a2 a2

leader – S1 S5 S2 S3 S4

slide-143
SLIDE 143

Cross-Client Monotonic Reads in ORCA

If reads restricted to leader, CAD provides cross-client monotonic reads

not scalable

Allow reads at followers

lagging followers could cause out-of-order states, CAD is not sufficient

20

a1 a2 a1 a1 a1 a1 a2 a2 a2 a2 a1 b a1 a1 a1 a1 a2 a2 a2 a2

leader – S1 S5 S2 S3 S4

slide-144
SLIDE 144

Cross-Client Monotonic Reads in ORCA

If reads restricted to leader, CAD provides cross-client monotonic reads

not scalable

Allow reads at followers

lagging followers could cause out-of-order states, CAD is not sufficient

20

a1 a2 a1 a1 a1 a1 a2 a2 a2 a2 a1 b a1 a1 a1 a1 a2 a2 a2 a2

leader – S1 S5 S2 S3 S4

slide-145
SLIDE 145

Cross-Client Monotonic Reads in ORCA

If reads restricted to leader, CAD provides cross-client monotonic reads

not scalable

Allow reads at followers

lagging followers could cause out-of-order states, CAD is not sufficient

20

a1 a2 a1 a1 a1 a1 a2 a2 a2 a2 a1 b a1 a1 a1 a1 a2 a2 a2 a2

leader – S1 S5 S2 S3 S4

slide-146
SLIDE 146

Cross-Client Monotonic Reads in ORCA

If reads restricted to leader, CAD provides cross-client monotonic reads

not scalable

Allow reads at followers

lagging followers could cause out-of-order states, CAD is not sufficient

20

a1 a2 a1 a1 a1 a1 a2 a2 a2 a2 a1 b a1 a1 a1 a1 a2 a2 a2 a2

leader – S1 S5 S2 S3 S4

slide-147
SLIDE 147

Cross-Client Monotonic Reads in ORCA

If reads restricted to leader, CAD provides cross-client monotonic reads

not scalable

Allow reads at followers

lagging followers could cause out-of-order states, CAD is not sufficient

20

a1 a2 a1 a1 a1 a1 a2 a2 a2 a2 a1 b a1 a1 a1 a1 a2 a2 a2 a2

leader – S1 S5 S2 S3 S4

Additional mechanisms: Active sets (lease-based mechanism), not in this talk…

slide-148
SLIDE 148

Outline

Introduction Motivation CAD and cross-client monotonic reads ORCA design Results Summary and conclusion

21

slide-149
SLIDE 149

Evaluation

22

slide-150
SLIDE 150

Evaluation

Implemented in ZooKeeper

22

slide-151
SLIDE 151

Evaluation

Implemented in ZooKeeper Evaluate different durability models in isolation

compare CAD against immediate and eventual durability

22

slide-152
SLIDE 152

Evaluation

Implemented in ZooKeeper Evaluate different durability models in isolation

compare CAD against immediate and eventual durability

Evaluate overall system performance

ORCA against strong and weakly consistent ZooKeeper

22

slide-153
SLIDE 153

CAD Durability Layer Performance

23

slide-154
SLIDE 154

CAD Durability Layer Performance

23

YCSB-A: 50% W, 50% R

slide-155
SLIDE 155

CAD Durability Layer Performance

23

YCSB-A: 50% W, 50% R

20 40 60 80 100 500 1000 1500 CDF Latency (us) Write Latency Distribution immediate eventual cad

slide-156
SLIDE 156

CAD Durability Layer Performance

23

YCSB-A: 50% W, 50% R

20 40 60 80 100 500 1000 1500 CDF Latency (us) Write Latency Distribution immediate eventual cad

slide-157
SLIDE 157

CAD Durability Layer Performance

23

YCSB-A: 50% W, 50% R

CAD writes faster than immediate durability CAD matches performance of eventual 20 40 60 80 100 500 1000 1500 CDF Latency (us) Write Latency Distribution immediate eventual cad

slide-158
SLIDE 158

CAD Durability Layer Performance

23

YCSB-A: 50% W, 50% R

CAD writes faster than immediate durability CAD matches performance of eventual 20 40 60 80 100 500 1000 1500 CDF Latency (us) Write Latency Distribution immediate eventual cad 20 40 60 80 100 500 1000 1500 CDF Latency (us) Read Latency Distribution immediate eventual cad

slide-159
SLIDE 159

CAD Durability Layer Performance

23

YCSB-A: 50% W, 50% R

CAD writes faster than immediate durability CAD matches performance of eventual 20 40 60 80 100 500 1000 1500 CDF Latency (us) Write Latency Distribution immediate eventual cad 20 40 60 80 100 500 1000 1500 CDF Latency (us) Read Latency Distribution immediate eventual cad

slide-160
SLIDE 160

CAD Durability Layer Performance

23

YCSB-A: 50% W, 50% R

CAD writes faster than immediate durability CAD matches performance of eventual 20 40 60 80 100 500 1000 1500 CDF Latency (us) Write Latency Distribution immediate eventual cad 20 40 60 80 100 500 1000 1500 CDF Latency (us) Read Latency Distribution immediate eventual cad

reads queued behind writes

slide-161
SLIDE 161

CAD Durability Layer Performance

23

YCSB-A: 50% W, 50% R

CAD writes faster than immediate durability CAD matches performance of eventual 20 40 60 80 100 500 1000 1500 CDF Latency (us) Write Latency Distribution immediate eventual cad 20 40 60 80 100 500 1000 1500 CDF Latency (us) Read Latency Distribution immediate eventual cad Most reads in CAD fast Only 5% slow due to synchronous ops

5% reads queued behind writes

slide-162
SLIDE 162

CAD Durability Layer Performance

23

YCSB-A: 50% W, 50% R

CAD writes faster than immediate durability CAD matches performance of eventual 20 40 60 80 100 500 1000 1500 CDF Latency (us) Write Latency Distribution immediate eventual cad 20 40 60 80 100 500 1000 1500 CDF Latency (us) Read Latency Distribution immediate eventual cad Most reads in CAD fast Only 5% slow due to synchronous ops

5% reads queued behind writes

CAD performs similar to eventual and is faster than immediate

slide-163
SLIDE 163

ORCA System Performance

24

Strong-ZK – uses immediate durability, reads only at leader Weak-ZK – uses eventual durability, reads at many nodes ORCA – uses CAD, reads at many nodes

slide-164
SLIDE 164

ORCA System Performance

24

3.78 2.09 2.09 3.44 3.28 1.97 1.75 3.04

10 20 30 40 50 A B D F

Throughput (KOps/s) strong-ZK weak-ZK

  • rca

50% R 95% R 95% R 66.7% R 50% W 5% W 5% W 33.3% W

Strong-ZK – uses immediate durability, reads only at leader Weak-ZK – uses eventual durability, reads at many nodes ORCA – uses CAD, reads at many nodes

slide-165
SLIDE 165

ORCA System Performance

24

3.78 2.09 2.09 3.44 3.28 1.97 1.75 3.04

10 20 30 40 50 A B D F

Throughput (KOps/s) strong-ZK weak-ZK

  • rca

50% R 95% R 95% R 66.7% R 50% W 5% W 5% W 33.3% W

Strong-ZK performs poorly due to immediate durability and leader-restricted reads Strong-ZK – uses immediate durability, reads only at leader Weak-ZK – uses eventual durability, reads at many nodes ORCA – uses CAD, reads at many nodes

slide-166
SLIDE 166

ORCA System Performance

24

3.78 2.09 2.09 3.44 3.28 1.97 1.75 3.04

10 20 30 40 50 A B D F

Throughput (KOps/s) strong-ZK weak-ZK

  • rca

50% R 95% R 95% R 66.7% R 50% W 5% W 5% W 33.3% W

Strong-ZK performs poorly due to immediate durability and leader-restricted reads Weak-ZK performs well due to eventual durability and scalable reads Strong-ZK – uses immediate durability, reads only at leader Weak-ZK – uses eventual durability, reads at many nodes ORCA – uses CAD, reads at many nodes

slide-167
SLIDE 167

ORCA System Performance

24

3.78 2.09 2.09 3.44 3.28 1.97 1.75 3.04

10 20 30 40 50 A B D F

Throughput (KOps/s) strong-ZK weak-ZK

  • rca

50% R 95% R 95% R 66.7% R 50% W 5% W 5% W 33.3% W

Strong-ZK performs poorly due to immediate durability and leader-restricted reads Weak-ZK performs well due to eventual durability and scalable reads ORCA adds little overheads compared to weak-ZK

reads that access non-durable data

Strong-ZK – uses immediate durability, reads only at leader Weak-ZK – uses eventual durability, reads at many nodes ORCA – uses CAD, reads at many nodes

slide-168
SLIDE 168

ORCA System Performance

24

3.78 2.09 2.09 3.44 3.28 1.97 1.75 3.04

10 20 30 40 50 A B D F

Throughput (KOps/s) strong-ZK weak-ZK

  • rca

50% R 95% R 95% R 66.7% R 50% W 5% W 5% W 33.3% W

Strong-ZK performs poorly due to immediate durability and leader-restricted reads Weak-ZK performs well due to eventual durability and scalable reads ORCA adds little overheads compared to weak-ZK

reads that access non-durable data

Strong-ZK – uses immediate durability, reads only at leader Weak-ZK – uses eventual durability, reads at many nodes ORCA – uses CAD, reads at many nodes

slide-169
SLIDE 169

More experiments in the paper…

Evaluation

correctness testing using a cluster crash-testing framework geo-replicated setting micro-benchmarks

Application case studies

location-tracking social-media timeline

25

slide-170
SLIDE 170

Summary and Conclusions

26

slide-171
SLIDE 171

Summary and Conclusions

Surprisingly, durability models are overlooked

26

slide-172
SLIDE 172

Summary and Conclusions

Surprisingly, durability models are overlooked Immediate durability enables strong consistency but is slow

26

slide-173
SLIDE 173

Summary and Conclusions

Surprisingly, durability models are overlooked Immediate durability enables strong consistency but is slow Eventual durability is fast but only enables weak consistency

26

slide-174
SLIDE 174

Summary and Conclusions

Surprisingly, durability models are overlooked Immediate durability enables strong consistency but is slow Eventual durability is fast but only enables weak consistency CAD – consistency-aware durability, a new way of thinking about durability

enables both strong consistency and high performance

26

slide-175
SLIDE 175

Summary and Conclusions

Surprisingly, durability models are overlooked Immediate durability enables strong consistency but is slow Eventual durability is fast but only enables weak consistency CAD – consistency-aware durability, a new way of thinking about durability

enables both strong consistency and high performance CAD is useful for many deployments that currently adopt eventual durability

26

slide-176
SLIDE 176

Summary and Conclusions

Surprisingly, durability models are overlooked Immediate durability enables strong consistency but is slow Eventual durability is fast but only enables weak consistency CAD – consistency-aware durability, a new way of thinking about durability

enables both strong consistency and high performance CAD is useful for many deployments that currently adopt eventual durability

Consistency and performance are seemingly at odds – by carefully examining the underlying layer, achieve both

26

slide-177
SLIDE 177

Summary and Conclusions

Surprisingly, durability models are overlooked Immediate durability enables strong consistency but is slow Eventual durability is fast but only enables weak consistency CAD – consistency-aware durability, a new way of thinking about durability

enables both strong consistency and high performance CAD is useful for many deployments that currently adopt eventual durability

Consistency and performance are seemingly at odds – by carefully examining the underlying layer, achieve both

Thank you!

26