Understanding and Finding Crash-Consistency Bugs in Parallel File Systems
Jinghan Sun, Chen Wang, Jian Huang, and Marc Snir
University of Illinois at Urbana-Champaign Contact: Jinghan Sun (js39@illinois.edu)
Understanding and Finding Crash-Consistency Bugs in Parallel File - - PowerPoint PPT Presentation
Understanding and Finding Crash-Consistency Bugs in Parallel File Systems Jinghan Sun , Chen Wang, Jian Huang, and Marc Snir University of Illinois at Urbana-Champaign Contact: Jinghan Sun (js39@illinois.edu) PFS failures are frequent and
University of Illinois at Urbana-Champaign Contact: Jinghan Sun (js39@illinois.edu)
8% 34% 34% 12% 0% 8% 16% 24% 32% 40%
PFS Failure Frequency
Weekly Monthly Never Not Reported
41% 14% 6% 4% 35%
0% 10% 20% 30% 40% 50%
Single Day Failure Cost
<$100K $100K-$500K $500K-$1M >$1M Not Reported
Source: Hyperion Research 2019 59% 24% 14% 3% 0% 15% 30% 45% 60% 75%
PFS Recovery Time
<1 day 2-3 days 1 week >1 week
1
2
3
Atomic Replace via Rename Write-ahead Logging
create delete rename resize update
4
1 2 3 4 5 6 7 8 9 10 ARVR WAL H5-create H5-delete H5-resize H5-rename H5-write
Number of Vulnerabilities on Different Filesystems
BeeGFS OrangeFS ext4
5
storage #1 metadata storage #2
6
// atomic replace via rename (ARVR) bool atomic_update(){ int fd = creat("file.tmp"); write(fd, new, size); close(fd); rename("file.tmp","file.txt"); } unlink
unlink idfile_2 dentries/tmp dentries/file rename append chunk creat chunk creat idfile idfile dentries/tmp link
storage #1 metadata storage #2
beegfs-client
7
unlink
unlink
unlink idfile_2 dentries/tmp dentries/file rename append chunk creat chunk creat idfile idfile dentries/tmp link
storage #1 metadata storage #2
param
param
Persisted operations Non-persisted operations 8
unlink
unlink idfile_2 dentries/tmp dentries/file rename append chunk creat chunk creat idfile idfile dentries/tmp link
storage #1 metadata storage #2
param
param
Persisted operations Non-persisted operations 9
unlink
unlink
unlink idfile_2 dentries/tmp dentries/file rename append chunk creat chunk creat idfile idfile dentries/tmp link
storage #1 metadata storage #2
param
param
Persisted operations Non-persisted operations 10
unlink
unlink
crash state legal state
…
workload checker
passed failed
crash state
… …
crash state filesystem & app-level recovery
client-side traces server-side traces
Legal replay Crash Record Test Classification
File system images that satisfy the given consistency model
consistency model 1 2 4 3 5 legal state legal state
Report
11
crash state legal state
…
workload checker
passed failed
crash state
… …
crash state filesystem & app-level recovery
client-side traces server-side traces
Legal replay Crash Record Test Classification
File system images that satisfy the given consistency model
consistency model 1 legal state legal state
Report
12
crash state legal state
…
workload checker
passed failed
crash state
… …
crash state filesystem & app-level recovery
client-side traces server-side traces
Legal replay Crash Record Test Classification
File system images that satisfy the given consistency model
consistency model 1 2 3 legal state legal state
Report
13
crash state legal state
…
workload checker
passed failed
crash state
… …
crash state filesystem & app-level recovery
client-side traces server-side traces
Legal replay Crash Record Test Classification
File system images that satisfy the given consistency model
consistency model 1 2 4 3 5 legal state legal state
Report
14
15
16