RETHINKING END-TO-END RELIABILITY IN CLOUD STORAGE SYSTEMS Amy Tai, - - PowerPoint PPT Presentation

rethinking end to end reliability in cloud storage systems
SMART_READER_LITE
LIVE PREVIEW

RETHINKING END-TO-END RELIABILITY IN CLOUD STORAGE SYSTEMS Amy Tai, - - PowerPoint PPT Presentation

RETHINKING END-TO-END RELIABILITY IN CLOUD STORAGE SYSTEMS Amy Tai, Andrew Kryczka, Shobhit Kanaujia, Kyle Jamieson, Michael J. Freedman, Asaf Cidon To appear in Usenix ATC 2019 Denser flash shorter lifetime Denser flash TLC Error rate


slide-1
SLIDE 1

RETHINKING END-TO-END RELIABILITY IN CLOUD STORAGE SYSTEMS

Amy Tai, Andrew Kryczka, Shobhit Kanaujia, Kyle Jamieson, Michael J. Freedman, Asaf Cidon To appear in Usenix ATC 2019

slide-2
SLIDE 2

Denser flash  shorter lifetime

2

Denser flash Acceptable error rate

TLC lifetime MLC lifetime

Number of writes

SLC lifetime

Error rate

Source: Novotný, R., J. Kadlec, and R. Kuchta. "NAND Flash Memory Organization and Operations." Journal of Information Technology & Software Engineering 5.1 (2015): 1.

SLC MLC TLC

slide-3
SLIDE 3

Shorter flash lifetimes are a problem

  • Datacenter operators must closely monitor flash writes
  • Memory : flash cost ratio is increasing

 workloads moving from DRAM to flash  increases pressure on flash

  • Datacenters struggling to adopt future generations of flash

(e.g., QLC)

3

How can we increase flash lifetimes?

slide-4
SLIDE 4

Increasing acceptable error rate  increase lifetimes

4

Acceptable error rate

TLC lifetime

Number of writes Error rate

Source: Novotný, R., J. Kadlec, and R. Kuchta. "NAND Flash Memory Organization and Operations." Journal of Information Technology & Software Engineering 5.1 (2015): 1.

SLC MLC TLC

slide-5
SLIDE 5

But.. hardware is expected to have low error rates

  • Software is designed so bit errors are rare
  • Bit errors errors cause failed operations and reduced availability
  • Error-handling path is not performant

5

slide-6
SLIDE 6

Distributed error Isolation and RECovery Techniques (DIRECT)

6

  • 1. Use distributed redundancy to fix local bit errors
  • Distributed systems need redundant copies for availability
  • 2. Optimize error-recovery performance

 flash devices can expose high error rates  flash devices have longer lifetimes  cheaper flash devices (QLC and beyond)

slide-7
SLIDE 7

Bit errors in the storage stack…

7

unreliable flash

hardened file system (e.g., ZFS)

local data store

Distributed Coordination / Replication Layer

unreliable flash

hardened file system (e.g., ZFS)

local data store unreliable flash

hardened file system (e.g., ZFS)

local data store

. . .

slide-8
SLIDE 8

… can manifest in the file system

8

unreliable flash

hardened file system (e.g., ZFS)

local data store

Distributed Coordination / Replication Layer

unreliable flash

hardened file system (e.g., ZFS)

local data store unreliable flash

hardened file system (e.g., ZFS)

local data store

. . .

Errors in File System:

  • File system metadata (inodes, etc.)
  • File system data (data blocks)
slide-9
SLIDE 9

…or in the local data store

10

unreliable flash

hardened file system (e.g., ZFS)

local data store

Distributed Coordination / Replication Layer

unreliable flash

hardened file system (e.g., ZFS)

local data store unreliable flash

hardened file system (e.g., ZFS)

local data store

. . .

Errors in File System:

  • File system metadata (inodes, etc.)
  • File system data (data blocks)

 Application metadata or data RocksDB

slide-10
SLIDE 10

…and need to be dealt with in the coordination layer

11

unreliable flash

hardened file system (e.g., ZFS)

local data store

Distributed Coordination / Replication Layer

unreliable flash

hardened file system (e.g., ZFS)

local data store unreliable flash

hardened file system (e.g., ZFS)

local data store

. . .

Errors in File System:

  • File system metadata (inodes, etc.)
  • File system data (data blocks)

 Application metadata or data

  • Correct recovery

Paxos / ZooKeeper

slide-11
SLIDE 11

DIRECT corrects bit errors in the local data store

12

unreliable flash

hardened file system (e.g., ZFS)

Distributed Coordination / Replication Layer

unreliable flash

hardened file system (e.g., ZFS)

unreliable flash

hardened file system (e.g., ZFS)

DIRECT local data store local data store local data store

. . .

slide-12
SLIDE 12

Local data store errors: metadata

14

. . .

DIRECT

Local metadata (version number, server ID, index, etc)

Data objects

Local metadata (version number, server ID, index, etc)

Data objects

Local metadata (version number, server ID, index, etc)

Data objects

local data store local data store local data store

X

Distributed Coordination / Replication Layer

slide-13
SLIDE 13

DIRECT

  • 1. Protect and fix errors in local metadata

 With local replication of metadata

15

slide-14
SLIDE 14

Local data store errors: data

16

. . .

DIRECT

Local metadata (version number, server ID, index, etc)

Data objects

Local metadata (version number, server ID, index, etc)

Data objects

Local metadata (version number, server ID, index, etc)

Data objects

local data store local data store local data store

X

Distributed Coordination / Replication Layer

slide-15
SLIDE 15

DIRECT

  • 1. Protect and fix errors in local metadata

 With local replication of metadata

  • 2. Fix errors in data objects with replicas

17

slide-16
SLIDE 16

Optimizing error recovery: strawman treats bit errors as unavailability events

18

. . .

DIRECT

Local metadata (version number, server ID, index, etc)

Data objects

Local metadata (version number, server ID, index, etc)

Data objects

Local metadata (version number, server ID, index, etc)

Data objects

X

Distributed Coordination / Replication Layer

Copy entire node Prohibitively

slow

slide-17
SLIDE 17

Optimizing error recovery: strawman treats bit errors as unavailability events

19

. . .

DIRECT

Local metadata (version number, server ID, index, etc)

Data objects

Local metadata (version number, server ID, index, etc)

Data objects

Local metadata (version number, server ID, index, etc)

Data objects

X

Distributed Coordination / Replication Layer (PAR)

Copy entire node Prohibitively

slow

How to isolate data necessary for recovery?

slide-18
SLIDE 18

DIRECT

  • 1. Protect and fix errors in local metadata

 With local replication of metadata

  • 2. Fix errors in data objects with replicas

 Minimize amount of data required from other replicas  Challenging in logically-replicated systems

20

slide-19
SLIDE 19

DIRECT

  • 1. Protect and fix errors in local metadata

 With local replication of metadata

  • 2. Fix errors in data objects with replicas

 Minimize amount of data required from other replicas  Challenging in logically-replicated systems

  • 3. Safe recovery

21

slide-20
SLIDE 20

Naïve recovery protocol

22

. . .

DIRECT

Local metadata (version number, server ID, index, etc)

A

Local metadata (version number, server ID, index, etc) Local metadata (version number, server ID, index, etc)

X

Distributed Coordination / Replication Layer

  • bject replicas

A A

slide-21
SLIDE 21

Naïve recovery protocol

23

. . .

DIRECT

Local metadata (version number, server ID, index, etc) Local metadata (version number, server ID, index, etc) Local metadata (version number, server ID, index, etc)

X

Distributed Coordination / Replication Layer

recovery request write operation

A A A

slide-22
SLIDE 22

Naïve recovery protocol

24

. . .

DIRECT

Local metadata (version number, server ID, index, etc) Local metadata (version number, server ID, index, etc) Local metadata (version number, server ID, index, etc)

X

Distributed Coordination / Replication Layer

write operation

A A

’ ’ ’

A

slide-23
SLIDE 23

Naïve recovery protocol: inconsistency

25

. . .

DIRECT

Local metadata (version number, server ID, index, etc) Local metadata (version number, server ID, index, etc) Local metadata (version number, server ID, index, etc)

Distributed Coordination / Replication Layer

A A

’ ’

A

slide-24
SLIDE 24

DIRECT

  • 1. Protect and fix errors in local metadata

 With local replication of metadata

  • 2. Fix errors in data objects with replicas

 Minimize amount of data required from other replicas  Challenging in logically-replicated systems

  • 3. Safe recovery

 With respect to system’s consistency guarantees

26

slide-25
SLIDE 25

Implementations of DIRECT

  • ZippyDB/RocksDB
  • RocksDB: KV store backed by log-structured merge tree
  • ZippyDB: distributed KV store backed by RocksDB
  • HDFS: Block-level distributed file system

47

slide-26
SLIDE 26

ZippyDB Overview

48

ZippyDB

Write request Write request Write request

Secondary

Primary

Secondary

RocksDB RocksDB RocksDB

slide-27
SLIDE 27

ZippyDB Overview

49

ZippyDB

Write request Write request Write request

Secondary

Primary

Secondary

RocksDB RocksDB RocksDB

Coordination Layer RocksDB = Local data store

slide-28
SLIDE 28

How ZippyDB handles corruptions

  • User reads: retry from another server
  • Background reads (compaction): crash server

50

slide-29
SLIDE 29

ZippyDB-DIRECT

  • 1. Protect and fix errors in local metadata

 With local replication of metadata

  • 2. Fix errors in data objects with replicas

 Minimize amount of data required from other replicas  Challenging in logically-replicated systems

  • 3. Safe recovery

 With respect to system’s consistency guarantees

51

slide-30
SLIDE 30

RocksDB SST file layout

52 Data block 1 Data block N Metadata block 1 Metadata block 2 Index block

footer

. . . . .

Data block 1 Data block N Metadata block 1 Metadata block 1 Index block

footer

. . . . .

Index block Metadata block 2 Metadata block 2

footer

slide-31
SLIDE 31

ZippyDB-DIRECT

  • 1. Protect and fix errors in local metadata

 With local replication of metadata

  • 2. Fix errors in data objects with replicas

 Minimize amount of data required from other replicas  Challenging in logically-replicated systems

  • 3. Safe recovery

 With respect to system’s consistency guarantees

53

slide-32
SLIDE 32

Identifying corrupt data

54

No way of knowing the exact key-value pair!

Data block 1 Data block N Metadata block 1 Metadata block 1 Index block

footer

. . . . .

Index block Metadata block 2 Metadata block 2

footer

X

slide-33
SLIDE 33

Index blocks identify corrupt range

55

Instead.. use index entries to get a range No way of knowing the exact key-value pair!

Data block 1 Data block N Metadata block 1 Metadata block 1 Index block

footer

. . . . .

Index block Metadata block 2 Metadata block 2

footer

X

slide-34
SLIDE 34

Index blocks identify corrupt range

56

No way of knowing the exact key-value pair!

<key123, value456>

Fetch all keys in the range [key1,key2] from another replica Instead.. use index entries to get a range

Data block 1 Data block N Metadata block 1 Metadata block 1 Index block

footer

. . . . .

Index block Metadata block 2 Metadata block 2

footer

X

slide-35
SLIDE 35

ZippyDB-DIRECT

  • 1. Protect and fix errors in local metadata

 With local replication of metadata

  • 2. Fix errors in data objects with replicas

 Minimize amount of data required from other replicas  Challenging in logically-replicated systems

  • 3. Safe recovery

 With respect to system’s consistency guarantees

57

slide-36
SLIDE 36

Safe recovery of corrupted data objects

58

X

RocksD B

(1)

RocksD B

Corrupted key range

(2)

Secondary

Primary

RocksDB Secondary

(3)

Patch request

ZippyDB ZippyDB ZippyDB

Patch request

(4) (4)

Patch request

slide-37
SLIDE 37

Safe recovery of corrupted data objects

60

X

RocksD B

(1)

RocksD B

Corrupted key range

(2)

Secondary

Primary (3)

Patch request

(4)

RocksDB Secondary

(4)

patch

(6)

patch patch

(5) (5)

Patch request Patch request

ZippyDB ZippyDB ZippyDB

slide-38
SLIDE 38

ZippyDB-DIRECT Summary

  • 1. Replicate metadata blocks in-line
  • 2. Recover data blocks by recovering key range
  • 3. Run recovery through coordination layer for safety

61

slide-39
SLIDE 39

ZippyDB-DIRECT Evalution

  • 1. How much higher error rate can ZippyDB/RocksDB

tolerate with DIRECT?

  • 2. How much faster is recovery from bit corruptions in

ZippyDB/RocksDB with DIRECT?

62

slide-40
SLIDE 40

Experimental Setup

  • 60 Facebook servers
  • Traffic duplicated from live servers
  • Error injection done on the fly in RocksDB

63

slide-41
SLIDE 41

Error rate tolerance

64

ZippyDB ZippyDB-DIRECT ZippyDB ZippyDB-DIRECT ZippyDB ZippyDB-DIRECT ZippyDB ZippyDB-DIRECT

~2600 QPS

UBER 10-10 10-11 10-12 10-13

ZippyDB 2.731% 1.981% 0.265% 0.011% ZippyDB- DIRECT 0.187% 0.040% 0.0008% 0.0002%

slide-42
SLIDE 42

DIRECT takes ms to recover, while ZippyDB takes minutes

65

slide-43
SLIDE 43

Conclusion

  • With existing distributed redundancy, DIRECT

enables storage systems to:

  • Tolerate 100,000x higher bit error rates
  • Reduce recovery time from bit errors by orders of magnitude
  • Two novel mechanisms:
  • Minimize the size of the recovered data
  • Leverage the coordination layer for safe recovery
  • Consequently, extend flash lifetime by

allowing devices to expose bit errors

66