RETHINKING END-TO-END RELIABILITY IN CLOUD STORAGE SYSTEMS Amy Tai, - - PowerPoint PPT Presentation
RETHINKING END-TO-END RELIABILITY IN CLOUD STORAGE SYSTEMS Amy Tai, - - PowerPoint PPT Presentation
RETHINKING END-TO-END RELIABILITY IN CLOUD STORAGE SYSTEMS Amy Tai, Andrew Kryczka, Shobhit Kanaujia, Kyle Jamieson, Michael J. Freedman, Asaf Cidon To appear in Usenix ATC 2019 Denser flash shorter lifetime Denser flash TLC Error rate
Denser flash shorter lifetime
2
Denser flash Acceptable error rate
TLC lifetime MLC lifetime
Number of writes
SLC lifetime
Error rate
Source: Novotný, R., J. Kadlec, and R. Kuchta. "NAND Flash Memory Organization and Operations." Journal of Information Technology & Software Engineering 5.1 (2015): 1.
SLC MLC TLC
Shorter flash lifetimes are a problem
- Datacenter operators must closely monitor flash writes
- Memory : flash cost ratio is increasing
workloads moving from DRAM to flash increases pressure on flash
- Datacenters struggling to adopt future generations of flash
(e.g., QLC)
3
How can we increase flash lifetimes?
Increasing acceptable error rate increase lifetimes
4
Acceptable error rate
TLC lifetime
Number of writes Error rate
Source: Novotný, R., J. Kadlec, and R. Kuchta. "NAND Flash Memory Organization and Operations." Journal of Information Technology & Software Engineering 5.1 (2015): 1.
SLC MLC TLC
But.. hardware is expected to have low error rates
- Software is designed so bit errors are rare
- Bit errors errors cause failed operations and reduced availability
- Error-handling path is not performant
5
Distributed error Isolation and RECovery Techniques (DIRECT)
6
- 1. Use distributed redundancy to fix local bit errors
- Distributed systems need redundant copies for availability
- 2. Optimize error-recovery performance
flash devices can expose high error rates flash devices have longer lifetimes cheaper flash devices (QLC and beyond)
Bit errors in the storage stack…
7
unreliable flash
hardened file system (e.g., ZFS)
local data store
Distributed Coordination / Replication Layer
unreliable flash
hardened file system (e.g., ZFS)
local data store unreliable flash
hardened file system (e.g., ZFS)
local data store
. . .
… can manifest in the file system
8
unreliable flash
hardened file system (e.g., ZFS)
local data store
Distributed Coordination / Replication Layer
unreliable flash
hardened file system (e.g., ZFS)
local data store unreliable flash
hardened file system (e.g., ZFS)
local data store
. . .
Errors in File System:
- File system metadata (inodes, etc.)
- File system data (data blocks)
…or in the local data store
10
unreliable flash
hardened file system (e.g., ZFS)
local data store
Distributed Coordination / Replication Layer
unreliable flash
hardened file system (e.g., ZFS)
local data store unreliable flash
hardened file system (e.g., ZFS)
local data store
. . .
Errors in File System:
- File system metadata (inodes, etc.)
- File system data (data blocks)
Application metadata or data RocksDB
…and need to be dealt with in the coordination layer
11
unreliable flash
hardened file system (e.g., ZFS)
local data store
Distributed Coordination / Replication Layer
unreliable flash
hardened file system (e.g., ZFS)
local data store unreliable flash
hardened file system (e.g., ZFS)
local data store
. . .
Errors in File System:
- File system metadata (inodes, etc.)
- File system data (data blocks)
Application metadata or data
- Correct recovery
Paxos / ZooKeeper
DIRECT corrects bit errors in the local data store
12
unreliable flash
hardened file system (e.g., ZFS)
Distributed Coordination / Replication Layer
unreliable flash
hardened file system (e.g., ZFS)
unreliable flash
hardened file system (e.g., ZFS)
DIRECT local data store local data store local data store
. . .
Local data store errors: metadata
14
. . .
DIRECT
Local metadata (version number, server ID, index, etc)
Data objects
Local metadata (version number, server ID, index, etc)
Data objects
Local metadata (version number, server ID, index, etc)
Data objects
local data store local data store local data store
X
Distributed Coordination / Replication Layer
DIRECT
- 1. Protect and fix errors in local metadata
With local replication of metadata
15
Local data store errors: data
16
. . .
DIRECT
Local metadata (version number, server ID, index, etc)
Data objects
Local metadata (version number, server ID, index, etc)
Data objects
Local metadata (version number, server ID, index, etc)
Data objects
local data store local data store local data store
X
Distributed Coordination / Replication Layer
DIRECT
- 1. Protect and fix errors in local metadata
With local replication of metadata
- 2. Fix errors in data objects with replicas
17
Optimizing error recovery: strawman treats bit errors as unavailability events
18
. . .
DIRECT
Local metadata (version number, server ID, index, etc)
Data objects
Local metadata (version number, server ID, index, etc)
Data objects
Local metadata (version number, server ID, index, etc)
Data objects
X
Distributed Coordination / Replication Layer
Copy entire node Prohibitively
slow
Optimizing error recovery: strawman treats bit errors as unavailability events
19
. . .
DIRECT
Local metadata (version number, server ID, index, etc)
Data objects
Local metadata (version number, server ID, index, etc)
Data objects
Local metadata (version number, server ID, index, etc)
Data objects
X
Distributed Coordination / Replication Layer (PAR)
Copy entire node Prohibitively
slow
How to isolate data necessary for recovery?
DIRECT
- 1. Protect and fix errors in local metadata
With local replication of metadata
- 2. Fix errors in data objects with replicas
Minimize amount of data required from other replicas Challenging in logically-replicated systems
20
DIRECT
- 1. Protect and fix errors in local metadata
With local replication of metadata
- 2. Fix errors in data objects with replicas
Minimize amount of data required from other replicas Challenging in logically-replicated systems
- 3. Safe recovery
21
Naïve recovery protocol
22
. . .
DIRECT
Local metadata (version number, server ID, index, etc)
A
Local metadata (version number, server ID, index, etc) Local metadata (version number, server ID, index, etc)
X
Distributed Coordination / Replication Layer
- bject replicas
A A
Naïve recovery protocol
23
. . .
DIRECT
Local metadata (version number, server ID, index, etc) Local metadata (version number, server ID, index, etc) Local metadata (version number, server ID, index, etc)
X
Distributed Coordination / Replication Layer
recovery request write operation
A A A
’
Naïve recovery protocol
24
. . .
DIRECT
Local metadata (version number, server ID, index, etc) Local metadata (version number, server ID, index, etc) Local metadata (version number, server ID, index, etc)
X
Distributed Coordination / Replication Layer
write operation
A A
’ ’ ’
A
Naïve recovery protocol: inconsistency
25
. . .
DIRECT
Local metadata (version number, server ID, index, etc) Local metadata (version number, server ID, index, etc) Local metadata (version number, server ID, index, etc)
Distributed Coordination / Replication Layer
A A
’ ’
A
DIRECT
- 1. Protect and fix errors in local metadata
With local replication of metadata
- 2. Fix errors in data objects with replicas
Minimize amount of data required from other replicas Challenging in logically-replicated systems
- 3. Safe recovery
With respect to system’s consistency guarantees
26
Implementations of DIRECT
- ZippyDB/RocksDB
- RocksDB: KV store backed by log-structured merge tree
- ZippyDB: distributed KV store backed by RocksDB
- HDFS: Block-level distributed file system
47
ZippyDB Overview
48
ZippyDB
Write request Write request Write request
Secondary
Primary
Secondary
RocksDB RocksDB RocksDB
ZippyDB Overview
49
ZippyDB
Write request Write request Write request
Secondary
Primary
Secondary
RocksDB RocksDB RocksDB
Coordination Layer RocksDB = Local data store
How ZippyDB handles corruptions
- User reads: retry from another server
- Background reads (compaction): crash server
50
ZippyDB-DIRECT
- 1. Protect and fix errors in local metadata
With local replication of metadata
- 2. Fix errors in data objects with replicas
Minimize amount of data required from other replicas Challenging in logically-replicated systems
- 3. Safe recovery
With respect to system’s consistency guarantees
51
RocksDB SST file layout
52 Data block 1 Data block N Metadata block 1 Metadata block 2 Index block
footer
. . . . .
Data block 1 Data block N Metadata block 1 Metadata block 1 Index block
footer
. . . . .
Index block Metadata block 2 Metadata block 2
footer
ZippyDB-DIRECT
- 1. Protect and fix errors in local metadata
With local replication of metadata
- 2. Fix errors in data objects with replicas
Minimize amount of data required from other replicas Challenging in logically-replicated systems
- 3. Safe recovery
With respect to system’s consistency guarantees
53
Identifying corrupt data
54
No way of knowing the exact key-value pair!
Data block 1 Data block N Metadata block 1 Metadata block 1 Index block
footer
. . . . .
Index block Metadata block 2 Metadata block 2
footer
X
Index blocks identify corrupt range
55
Instead.. use index entries to get a range No way of knowing the exact key-value pair!
Data block 1 Data block N Metadata block 1 Metadata block 1 Index block
footer
. . . . .
Index block Metadata block 2 Metadata block 2
footer
X
Index blocks identify corrupt range
56
No way of knowing the exact key-value pair!
<key123, value456>
Fetch all keys in the range [key1,key2] from another replica Instead.. use index entries to get a range
Data block 1 Data block N Metadata block 1 Metadata block 1 Index block
footer
. . . . .
Index block Metadata block 2 Metadata block 2
footer
X
ZippyDB-DIRECT
- 1. Protect and fix errors in local metadata
With local replication of metadata
- 2. Fix errors in data objects with replicas
Minimize amount of data required from other replicas Challenging in logically-replicated systems
- 3. Safe recovery
With respect to system’s consistency guarantees
57
Safe recovery of corrupted data objects
58
X
RocksD B
(1)
RocksD B
Corrupted key range
(2)
Secondary
Primary
RocksDB Secondary
(3)
Patch request
ZippyDB ZippyDB ZippyDB
Patch request
(4) (4)
Patch request
Safe recovery of corrupted data objects
60
X
RocksD B
(1)
RocksD B
Corrupted key range
(2)
Secondary
Primary (3)
Patch request
(4)
RocksDB Secondary
(4)
patch
(6)
patch patch
(5) (5)
Patch request Patch request
ZippyDB ZippyDB ZippyDB
ZippyDB-DIRECT Summary
- 1. Replicate metadata blocks in-line
- 2. Recover data blocks by recovering key range
- 3. Run recovery through coordination layer for safety
61
ZippyDB-DIRECT Evalution
- 1. How much higher error rate can ZippyDB/RocksDB
tolerate with DIRECT?
- 2. How much faster is recovery from bit corruptions in
ZippyDB/RocksDB with DIRECT?
62
Experimental Setup
- 60 Facebook servers
- Traffic duplicated from live servers
- Error injection done on the fly in RocksDB
63
Error rate tolerance
64
ZippyDB ZippyDB-DIRECT ZippyDB ZippyDB-DIRECT ZippyDB ZippyDB-DIRECT ZippyDB ZippyDB-DIRECT
~2600 QPS
UBER 10-10 10-11 10-12 10-13
ZippyDB 2.731% 1.981% 0.265% 0.011% ZippyDB- DIRECT 0.187% 0.040% 0.0008% 0.0002%
DIRECT takes ms to recover, while ZippyDB takes minutes
65
Conclusion
- With existing distributed redundancy, DIRECT
enables storage systems to:
- Tolerate 100,000x higher bit error rates
- Reduce recovery time from bit errors by orders of magnitude
- Two novel mechanisms:
- Minimize the size of the recovered data
- Leverage the coordination layer for safe recovery
- Consequently, extend flash lifetime by
allowing devices to expose bit errors
66