[PPT] - RETHINKING END-TO-END RELIABILITY IN CLOUD STORAGE SYSTEMS Amy Tai, PowerPoint Presentation

SLIDE 1

RETHINKING END-TO-END RELIABILITY IN CLOUD STORAGE SYSTEMS

Amy Tai, Andrew Kryczka, Shobhit Kanaujia, Kyle Jamieson, Michael J. Freedman, Asaf Cidon To appear in Usenix ATC 2019

SLIDE 2

Denser flash  shorter lifetime

2

Denser flash Acceptable error rate

TLC lifetime MLC lifetime

Number of writes

SLC lifetime

Error rate

Source: Novotný, R., J. Kadlec, and R. Kuchta. "NAND Flash Memory Organization and Operations." Journal of Information Technology & Software Engineering 5.1 (2015): 1.

SLC MLC TLC

SLIDE 3

Shorter flash lifetimes are a problem

Datacenter operators must closely monitor flash writes
Memory : flash cost ratio is increasing

 workloads moving from DRAM to flash  increases pressure on flash

Datacenters struggling to adopt future generations of flash

(e.g., QLC)

3

How can we increase flash lifetimes?

SLIDE 4

Increasing acceptable error rate  increase lifetimes

4

Acceptable error rate

TLC lifetime

Number of writes Error rate

Source: Novotný, R., J. Kadlec, and R. Kuchta. "NAND Flash Memory Organization and Operations." Journal of Information Technology & Software Engineering 5.1 (2015): 1.

SLC MLC TLC

SLIDE 5

But.. hardware is expected to have low error rates

Software is designed so bit errors are rare
Bit errors errors cause failed operations and reduced availability
Error-handling path is not performant

5

SLIDE 6

Distributed error Isolation and RECovery Techniques (DIRECT)

6

1. Use distributed redundancy to fix local bit errors
Distributed systems need redundant copies for availability
2. Optimize error-recovery performance

 flash devices can expose high error rates  flash devices have longer lifetimes  cheaper flash devices (QLC and beyond)

SLIDE 7

Bit errors in the storage stack…

7

unreliable flash

hardened file system (e.g., ZFS)

local data store

Distributed Coordination / Replication Layer

unreliable flash

hardened file system (e.g., ZFS)

local data store unreliable flash

hardened file system (e.g., ZFS)

local data store

. . .

SLIDE 8

… can manifest in the file system

8

unreliable flash

hardened file system (e.g., ZFS)

local data store

Distributed Coordination / Replication Layer

unreliable flash

hardened file system (e.g., ZFS)

local data store unreliable flash

hardened file system (e.g., ZFS)

local data store

. . .

Errors in File System:

File system metadata (inodes, etc.)
File system data (data blocks)

SLIDE 9

…or in the local data store

10

unreliable flash

hardened file system (e.g., ZFS)

local data store

Distributed Coordination / Replication Layer

unreliable flash

hardened file system (e.g., ZFS)

local data store unreliable flash

hardened file system (e.g., ZFS)

local data store

. . .

Errors in File System:

File system metadata (inodes, etc.)
File system data (data blocks)

 Application metadata or data RocksDB

SLIDE 10

…and need to be dealt with in the coordination layer

11

unreliable flash

hardened file system (e.g., ZFS)

local data store

Distributed Coordination / Replication Layer

unreliable flash

hardened file system (e.g., ZFS)

local data store unreliable flash

hardened file system (e.g., ZFS)

local data store

. . .

Errors in File System:

File system metadata (inodes, etc.)
File system data (data blocks)

 Application metadata or data

Correct recovery

Paxos / ZooKeeper

SLIDE 11

DIRECT corrects bit errors in the local data store

12

unreliable flash

hardened file system (e.g., ZFS)

Distributed Coordination / Replication Layer

unreliable flash

hardened file system (e.g., ZFS)

unreliable flash

hardened file system (e.g., ZFS)

DIRECT local data store local data store local data store

. . .

SLIDE 12

Local data store errors: metadata

14

. . .

DIRECT

Local metadata (version number, server ID, index, etc)

Data objects

Local metadata (version number, server ID, index, etc)

Data objects

Local metadata (version number, server ID, index, etc)

Data objects

local data store local data store local data store

X

Distributed Coordination / Replication Layer

SLIDE 13

DIRECT

1. Protect and fix errors in local metadata

 With local replication of metadata

15

SLIDE 14

Local data store errors: data

16

. . .

DIRECT

Local metadata (version number, server ID, index, etc)

Data objects

Local metadata (version number, server ID, index, etc)

Data objects

Local metadata (version number, server ID, index, etc)

Data objects

local data store local data store local data store

X

Distributed Coordination / Replication Layer

SLIDE 15

DIRECT

1. Protect and fix errors in local metadata

 With local replication of metadata

2. Fix errors in data objects with replicas

17

SLIDE 16

Optimizing error recovery: strawman treats bit errors as unavailability events

18

. . .

DIRECT

Local metadata (version number, server ID, index, etc)

Data objects

Local metadata (version number, server ID, index, etc)

Data objects

Local metadata (version number, server ID, index, etc)

Data objects

X

Distributed Coordination / Replication Layer

Copy entire node Prohibitively

slow

SLIDE 17

Optimizing error recovery: strawman treats bit errors as unavailability events

19

. . .

DIRECT

Local metadata (version number, server ID, index, etc)

Data objects

Local metadata (version number, server ID, index, etc)

Data objects

Local metadata (version number, server ID, index, etc)

Data objects

X

Distributed Coordination / Replication Layer (PAR)

Copy entire node Prohibitively

slow

How to isolate data necessary for recovery?

SLIDE 18

DIRECT

1. Protect and fix errors in local metadata

 With local replication of metadata

2. Fix errors in data objects with replicas

 Minimize amount of data required from other replicas  Challenging in logically-replicated systems

20

SLIDE 19

DIRECT

1. Protect and fix errors in local metadata

 With local replication of metadata

2. Fix errors in data objects with replicas

 Minimize amount of data required from other replicas  Challenging in logically-replicated systems

3. Safe recovery

21

SLIDE 20

Naïve recovery protocol

22

. . .

DIRECT

Local metadata (version number, server ID, index, etc)

A

Local metadata (version number, server ID, index, etc) Local metadata (version number, server ID, index, etc)

X

Distributed Coordination / Replication Layer

bject replicas

A A

SLIDE 21

Naïve recovery protocol

23

. . .

DIRECT

Local metadata (version number, server ID, index, etc) Local metadata (version number, server ID, index, etc) Local metadata (version number, server ID, index, etc)

X

Distributed Coordination / Replication Layer

recovery request write operation

A A A

’

SLIDE 22

Naïve recovery protocol

24

. . .

DIRECT

Local metadata (version number, server ID, index, etc) Local metadata (version number, server ID, index, etc) Local metadata (version number, server ID, index, etc)

X

Distributed Coordination / Replication Layer

write operation

A A

’ ’ ’

A

SLIDE 23

Naïve recovery protocol: inconsistency

25

. . .

DIRECT

Local metadata (version number, server ID, index, etc) Local metadata (version number, server ID, index, etc) Local metadata (version number, server ID, index, etc)

Distributed Coordination / Replication Layer

A A

’ ’

A

SLIDE 24

DIRECT

1. Protect and fix errors in local metadata

 With local replication of metadata

2. Fix errors in data objects with replicas

 Minimize amount of data required from other replicas  Challenging in logically-replicated systems

3. Safe recovery

 With respect to system’s consistency guarantees

26

SLIDE 25

Implementations of DIRECT

ZippyDB/RocksDB
RocksDB: KV store backed by log-structured merge tree
ZippyDB: distributed KV store backed by RocksDB
HDFS: Block-level distributed file system

47

SLIDE 26

ZippyDB Overview

48

ZippyDB

Write request Write request Write request

Secondary

Primary

Secondary

RocksDB RocksDB RocksDB

SLIDE 27

ZippyDB Overview

49

ZippyDB

Write request Write request Write request

Secondary

Primary

Secondary

RocksDB RocksDB RocksDB

Coordination Layer RocksDB = Local data store

SLIDE 28

How ZippyDB handles corruptions

User reads: retry from another server
Background reads (compaction): crash server

50

SLIDE 29

ZippyDB-DIRECT

1. Protect and fix errors in local metadata

 With local replication of metadata

2. Fix errors in data objects with replicas

 Minimize amount of data required from other replicas  Challenging in logically-replicated systems

3. Safe recovery

 With respect to system’s consistency guarantees

51

SLIDE 30

RocksDB SST file layout

52 Data block 1 Data block N Metadata block 1 Metadata block 2 Index block

footer

. . . . .

Data block 1 Data block N Metadata block 1 Metadata block 1 Index block

footer

. . . . .

Index block Metadata block 2 Metadata block 2

footer

SLIDE 31

ZippyDB-DIRECT

1. Protect and fix errors in local metadata

 With local replication of metadata

2. Fix errors in data objects with replicas

 Minimize amount of data required from other replicas  Challenging in logically-replicated systems

3. Safe recovery

 With respect to system’s consistency guarantees

53

SLIDE 32

Identifying corrupt data

54

No way of knowing the exact key-value pair!

Data block 1 Data block N Metadata block 1 Metadata block 1 Index block

footer

. . . . .

Index block Metadata block 2 Metadata block 2

footer

X

SLIDE 33

Index blocks identify corrupt range

55

Instead.. use index entries to get a range No way of knowing the exact key-value pair!

Data block 1 Data block N Metadata block 1 Metadata block 1 Index block

footer

. . . . .

Index block Metadata block 2 Metadata block 2

footer

X

SLIDE 34

Index blocks identify corrupt range

56

No way of knowing the exact key-value pair!

<key123, value456>

Fetch all keys in the range [key1,key2] from another replica Instead.. use index entries to get a range

Data block 1 Data block N Metadata block 1 Metadata block 1 Index block

footer

. . . . .

Index block Metadata block 2 Metadata block 2

footer

X

SLIDE 35

ZippyDB-DIRECT

1. Protect and fix errors in local metadata

 With local replication of metadata

2. Fix errors in data objects with replicas

 Minimize amount of data required from other replicas  Challenging in logically-replicated systems

3. Safe recovery

 With respect to system’s consistency guarantees

57

SLIDE 36

Safe recovery of corrupted data objects

58

X

RocksD B

(1)

RocksD B

Corrupted key range

(2)

Secondary

Primary

RocksDB Secondary

(3)

Patch request

ZippyDB ZippyDB ZippyDB

Patch request

(4) (4)

Patch request

SLIDE 37

Safe recovery of corrupted data objects

60

X

RocksD B

(1)

RocksD B

Corrupted key range

(2)

Secondary

Primary (3)

Patch request

(4)

RocksDB Secondary

(4)

patch

(6)

patch patch

(5) (5)

Patch request Patch request

ZippyDB ZippyDB ZippyDB

SLIDE 38

ZippyDB-DIRECT Summary

1. Replicate metadata blocks in-line
2. Recover data blocks by recovering key range
3. Run recovery through coordination layer for safety

61

SLIDE 39

ZippyDB-DIRECT Evalution

1. How much higher error rate can ZippyDB/RocksDB

tolerate with DIRECT?

2. How much faster is recovery from bit corruptions in

ZippyDB/RocksDB with DIRECT?

62

SLIDE 40

Experimental Setup

60 Facebook servers
Traffic duplicated from live servers
Error injection done on the fly in RocksDB

63

SLIDE 41

Error rate tolerance

64

ZippyDB ZippyDB-DIRECT ZippyDB ZippyDB-DIRECT ZippyDB ZippyDB-DIRECT ZippyDB ZippyDB-DIRECT

~2600 QPS

UBER 10-10 10-11 10-12 10-13

ZippyDB 2.731% 1.981% 0.265% 0.011% ZippyDB- DIRECT 0.187% 0.040% 0.0008% 0.0002%

SLIDE 42

DIRECT takes ms to recover, while ZippyDB takes minutes

65

SLIDE 43

Conclusion

With existing distributed redundancy, DIRECT

enables storage systems to:

Tolerate 100,000x higher bit error rates
Reduce recovery time from bit errors by orders of magnitude
Two novel mechanisms:
Minimize the size of the recovered data
Leverage the coordination layer for safe recovery
Consequently, extend flash lifetime by

allowing devices to expose bit errors

66