Cer$fying a Crash-safe File System Nickolai Zeldovich - - PowerPoint PPT Presentation

cer fying a crash safe file system
SMART_READER_LITE
LIVE PREVIEW

Cer$fying a Crash-safe File System Nickolai Zeldovich - - PowerPoint PPT Presentation

Cer$fying a Crash-safe File System Nickolai Zeldovich Collaborators: Tej Chajed, Haogang Chen, Alex Konradi, Stephanie Wang, Daniel Ziegler, Adam Chlipala, M. Frans Kaashoek File systems should not lose data People use file systems to


slide-1
SLIDE 1

Cer$fying a 
 Crash-safe File System

Nickolai Zeldovich

Collaborators: Tej Chajed, Haogang Chen, Alex Konradi, Stephanie Wang, Daniel Ziegler, Adam Chlipala, M. Frans Kaashoek

slide-2
SLIDE 2

File systems should not lose data

  • People use file systems to

store permanent data

  • Computers can crash any$me
  • power failures
  • hardware failures (unplug USB drive)
  • soRware bugs
  • File systems should not lose or

corrupt data in case of crashes

slide-3
SLIDE 3

File systems are complex and have bugs

  • Linux ext4: ~60,000 lines of code
  • Some bugs are serious: data loss, security exploits, etc.

Cumula&ve number of bug patches in Linux file systems [Lu et al., FAST’13]

# of patches for bugs 150 300 450 600 Dec-03 Apr-04 Dec-04 Jan-06 Feb-07 Apr-08 Jun-09 Aug-10 May-11

ext3 xfs jfs reiserfs ext4 btrfs

slide-4
SLIDE 4

Researches in avoiding bugs in file systems

  • Most research is on finding bugs
  • Crash injec$on (e.g., EXPLODE [OSDI’06])
  • Symbolic execu$on (e.g., EXE [Oakland’06])
  • Design modeling (e.g., in Alloy [ABZ’08])
  • Some elimina$on of bugs by proving:
  • FS without directories [Arkoudas et al. 2004]
  • BilbyFS [Keller 2014]
  • UBIFS [Ernst et al. 2013]
slide-5
SLIDE 5

reduce
 # of bugs

Researches in avoiding bugs in file systems

  • Most research is on finding bugs
  • Crash injec$on (e.g., EXPLODE [OSDI’06])
  • Symbolic execu$on (e.g., EXE [Oakland’06])
  • Design modeling (e.g., in Alloy [ABZ’08])
  • Some elimina$on of bugs by proving:
  • FS without directories [Arkoudas et al. 2004]
  • BilbyFS [Keller 2014]
  • UBIFS [Ernst et al. 2013]
slide-6
SLIDE 6

incomplete + no crashes reduce
 # of bugs

Researches in avoiding bugs in file systems

  • Most research is on finding bugs
  • Crash injec$on (e.g., EXPLODE [OSDI’06])
  • Symbolic execu$on (e.g., EXE [Oakland’06])
  • Design modeling (e.g., in Alloy [ABZ’08])
  • Some elimina$on of bugs by proving:
  • FS without directories [Arkoudas et al. 2004]
  • BilbyFS [Keller 2014]
  • UBIFS [Ernst et al. 2013]
slide-7
SLIDE 7

Dealing with crashes is hard

  • Crashes expose many par$ally-updated states
  • Reasoning about all failure cases is hard
  • Performance op$miza$ons lead to more tricky

par$al states

  • Disk I/O is expensive
  • Buffer updates in memory
slide-8
SLIDE 8

Dealing with crashes is hard

commit 353b67d8ced4dc53281c88150ad295e24bc4b4c5 Author: Jan Kara <jack@suse.cz> Date: Sat Nov 26 00:35:39 2011 +0100 Title: jbd: Issue cache flush after checkpointing

  • -- a/fs/jbd/checkpoint.c

+++ b/fs/jbd/checkpoint.c @@ -504,7 +503,25 @@ int cleanup_journal_tail(journal_t *journal) spin_unlock(&journal->j_state_lock); return 1; } + spin_unlock(&journal->j_state_lock); + + /* + * We need to make sure that any blocks that were recently written out + * --- perhaps by log_do_checkpoint() --- are flushed out before we + * drop the transactions from the journal. It's unlikely this will be + * necessary, especially with an appropriately sized journal, but we + * need this to guarantee correctness. Fortunately + * cleanup_journal_tail() doesn't get called all that often. + */ + if (journal->j_flags & JFS_BARRIER) + blkdev_issue_flush(journal->j_fs_dev, GFP_KERNEL, NULL); + spin_lock(&journal->j_state_lock); + if (!tid_gt(first_tid, journal->j_tail_sequence)) { + spin_unlock(&journal->j_state_lock); + /* Someone else cleaned up journal so return 0 */ + return 0; + }

A patch for Linux’s write-ahead logging (jbd) in 2012: “Is it safe to omit a disk write barrier here?” It's unlikely this will be necessary, … but we need this to guarantee correctness. Fortunately this func;on doesn't get called all that o<en.

slide-9
SLIDE 9

Goal: cer$fy a file system under crashes

A complete file system with a machine-checkable proof that its implementa$on meets its specifica$on, both under normal execu@on and under any sequence

  • f crashes, including crashes during recovery.
slide-10
SLIDE 10

Contribu$ons

  • CHL: Crash Hoare Logic
  • Specifica$on framework for crash-safety of storage
  • Crash condi$on and recovery seman$cs
  • Automa$on to reduce proof effort
  • FSCQ: the first cer$fied crash-safe file system
  • Basic Unix-like file system (no hard-links, no concurrency)
  • Precise specifica$on for the core subset of POSIX
  • I/O performance on par with Linux ext4
  • CPU overhead is high
slide-11
SLIDE 11

FSCQ runs standard Unix programs

Crash Hoare Logic (CHL) Top-level specifica@on Internal specifica@ons Program Proof FSCQ (wriNen in Coq) Program

slide-12
SLIDE 12

FSCQ runs standard Unix programs

Coq proof checker Crash Hoare Logic (CHL) Top-level specifica@on Internal specifica@ons Program Proof FSCQ (wriNen in Coq)

OK

Program

slide-13
SLIDE 13

FSCQ runs standard Unix programs

Coq proof checker Crash Hoare Logic (CHL) Top-level specifica@on Internal specifica@ons Program Proof FSCQ (wriNen in Coq) FSCQ’s Haskell code FSCQ’s FUSE server

OK Mechanical
 code extrac$on Haskell compiler

slide-14
SLIDE 14

FSCQ runs standard Unix programs

Coq proof checker Crash Hoare Logic (CHL) Top-level specifica@on Internal specifica@ons Program Proof FSCQ (wriNen in Coq) FSCQ’s Haskell code FSCQ’s FUSE server Haskell libraries & FUSE driver

OK Linux kernel /dev/sda Mechanical
 code extrac$on Haskell compiler

slide-15
SLIDE 15

FSCQ runs standard Unix programs

Coq proof checker Crash Hoare Logic (CHL) Top-level specifica@on Internal specifica@ons Program Proof FSCQ (wriNen in Coq) FSCQ’s Haskell code FSCQ’s FUSE server Haskell libraries & FUSE driver

OK Linux kernel /dev/sda Mechanical
 code extrac$on Haskell compiler

syscalls FUSE upcalls disk read(), 
 write(), sync() $ mv src dest $ git clone repo… $ make

slide-16
SLIDE 16

FSCQ’s Trusted Compu@ng Base

Coq proof checker Crash Hoare Logic (CHL) Top-level specifica@on Internal specifica@ons Program Proof FSCQ (wriNen in Coq) FSCQ’s Haskell code FSCQ’s FUSE server Haskell libraries & FUSE driver

OK Linux kernel /dev/sda Mechanical
 code extrac@on Haskell compiler

syscalls FUSE upcalls disk read(), 
 write(), sync() $ mv src dest $ git clone repo… $ make

slide-17
SLIDE 17

Outline

  • Crash safety
  • What is the correct behavior aRer a crash?
  • Challenge 1: formalizing crashes
  • Crash Hoare Logic (CHL)
  • Challenge 2: incorpora$ng performance op$miza$ons
  • Disk sequences
  • Building a complete file system
  • Evalua$on
slide-18
SLIDE 18

What is crash safety?

  • What guarantee should file system provide when it

crashes and reboot?

  • Look it up in the POSIX standard?
slide-19
SLIDE 19

POSIX is vague about crash behavior

  • POSIX’s goal was to specify “common-denominator” behavior
  • Gives freedom to file systems to implement their own op$miza$ons

[...] a power failure [...] can cause data to be lost. The data may be associated with a file that is s:ll open, with one that has been closed, with a directory, or with any other internal system data structures associated with permanent storage. This data can be lost, in whole or part, so that only careful inspec:on of file contents could determine that an update did not occur.

IEEE Std 1003.1, 2013 Edi$on

slide-20
SLIDE 20

What is crash safety?

  • What guarantee should file system provide when it

crashes and reboot?

  • Look it up in the POSIX standard? (Too Vague)
  • A simple and useful defini$on is transac@onal
  • Atomicity: every file-system call is all-or-nothing
  • Durability: every call persists on disk when it returns
  • Run every file-system call inside a transac$on, using

write-ahead logging.

slide-21
SLIDE 21

Write-ahead logging

Disk

slide-22
SLIDE 22

Write-ahead logging

Log ➡ log_begin()

Disk

slide-23
SLIDE 23

Write-ahead logging

Log

2 a 8 b 5 c

➡ log_begin() ➡ log_write(2, ‘a’) ➡ log_write(8, ‘b’) ➡ log_write(5, ‘c’)

Disk

  • 1. Append writes to the log
slide-24
SLIDE 24

Write-ahead logging

Log

2 a 8 b 5 c 3

➡ log_begin() ➡ log_write(2, ‘a’) ➡ log_write(8, ‘b’) ➡ log_write(5, ‘c’) ➡ log_commit()

Disk

  • 1. Append writes to the log
  • 2. Set commit record
slide-25
SLIDE 25

Write-ahead logging

a c b

Log

2 a 8 b 5 c 3

➡ log_begin() ➡ log_write(2, ‘a’) ➡ log_write(8, ‘b’) ➡ log_write(5, ‘c’) ➡ log_commit()

Disk

  • 1. Append writes to the log
  • 2. Set commit record
  • 3. Apply the log to disk loca$ons
slide-26
SLIDE 26

Write-ahead logging

  • Recovery: aRer crash, replay (apply) any commiNed transac$on in the log
  • Atomicity: either all writes appear on disk or none do
  • Durability: all changes are persisted on disk when log_commit() returns

a c b

Log ➡ log_begin() ➡ log_write(2, ‘a’) ➡ log_write(8, ‘b’) ➡ log_write(5, ‘c’) ➡ log_commit()

Disk

  • 1. Append writes to the log
  • 2. Set commit record
  • 3. Apply the log to disk loca$ons
  • 4. Truncate the log
slide-27
SLIDE 27

Example: transac$onal crash safety

  • Q: How to formally define what happens when the computer crashes?
  • Q: How to formally specify the behavior of “create” in presence of

crash and recovery?

def create(dir, name): log_begin() newfile = allocate_inode() newfile.init() dir.add(name, newfile) log_commit() def log_recover(): if committed: log_apply() log_truncate()

… aYer crash …

slide-28
SLIDE 28

Approach: Crash Hoare Logic

{pre} code {post}

SPEC disk write (a, v) PRE a 7! v0 POST a 7! v

slide-29
SLIDE 29

Approach: Crash Hoare Logic

  • Crash condi@on: all intermediate disk states (plus two end-states)
  • CHL’s disk model matches what most other file systems assume:
  • Wri$ng a single block is an atomic opera$on, no data corrup$on

{pre} code {post}

SPEC disk write (a, v) PRE a 7! v0 POST a 7! v CRASH a 7! v0 _ a 7! v

{crash}

slide-30
SLIDE 30

Asynchronous disk I/O

slide-31
SLIDE 31

Asynchronous disk I/O

  • For performance, hard drive caches

writes in its internal vola$le buffer

  • Writes do not persist immediately
slide-32
SLIDE 32

Asynchronous disk I/O

  • For performance, hard drive caches

writes in its internal vola$le buffer

  • Writes do not persist immediately
  • Disk flushes the buffer to media in

background

  • Writes might be reordered
slide-33
SLIDE 33

Asynchronous disk I/O

  • For performance, hard drive caches

writes in its internal vola$le buffer

  • Writes do not persist immediately
  • Disk flushes the buffer to media in

background

  • Writes might be reordered
  • Use write barrier (disk_sync) to force

flushing the buffer

  • Make data persistent & enforce ordering: log

contents are persistent before commit record

  • Disk syncs are expensive!
slide-34
SLIDE 34

Formalizing asynchronous disk I/O

  • Challenge: when crashes, the disk might lose some of

the recent writes

a ⟼ 0, b ⟼ 0 disk_write(a, 1) disk_write(b, 2) disk_write(a, 3)

Q: What are the possible disk states if
 crashing aRer the 3 writes?

slide-35
SLIDE 35

Formalizing asynchronous disk I/O

  • Challenge: when crashes, the disk might lose some of

the recent writes

  • Idea: use value-sets:
  • Read returns the latest value:
  • Write adds a value to the set:
  • Sync discards previous values:
  • Reboot chooses a random value:

a 7! hv0, vsi a 7! hv, {v0} [ vsi a 7! hv0, ∅i a 7! hv0, ∅i, v0 2 {v0} [ vs v0

a ⟼ 0, b ⟼ 0 disk_write(a, 1) disk_write(b, 2) disk_write(a, 3)

Q: What are the possible disk states if
 crashing aRer the 3 writes? A: 6 cases: a ⟼ 0 or 1 or 3, b ⟼ 0 or 2

slide-36
SLIDE 36

CHL asynchronous disk model

  • Specifica$ons for disk_write, disk_read, and disk_sync are axioms
  • “disk |= …” means the disk address space entails the predicate

SPEC disk write (a, v) PRE disk | = a 7! hv0, vsi POST disk | = a 7! hv, {v0} [ vsi CRASH disk | = a 7! hv0, vsi _ a 7! hv, {v0} [ vsi

slide-37
SLIDE 37

Abstrac$on layers

  • Each abstrac$on layer forms an address space

Physical disk log

a 7! hv0, vsi

slide-38
SLIDE 38

Abstrac$on layers

  • Each abstrac$on layer forms an address space

Physical disk log

a 7! hv0, vsi

Logical disk

a 7! v

slide-39
SLIDE 39

Abstrac$on layers

  • Each abstrac$on layer forms an address space

Physical disk log

a 7! hv0, vsi

Logical disk

a 7! v

Files

inum 7! file

file0 file1 file2 filen ⋯

slide-40
SLIDE 40

Abstrac$on layers

  • Each abstrac$on layer forms an address space

Physical disk log

a 7! hv0, vsi

Logical disk

a 7! v

Files

inum 7! file

file0 file1 file2 filen ⋯ Directory tree

slide-41
SLIDE 41

Abstrac$on layers

  • Each abstrac$on layer forms an address space
  • Representa@on invariants connect logical states between layers

Physical disk log

a 7! hv0, vsi

Logical disk

a 7! v

Files

inum 7! file

file0 file1 file2 filen ⋯ Directory tree

dir_rep files_rep log_rep

slide-42
SLIDE 42

Example: representa$on invariant

SPEC log write (a, v) PRE

  • ld state

|

= a 7! v0 POST new state | = a 7! v

  • old_state and new_state are “logical disks” exposed by the logging

system

slide-43
SLIDE 43

Example: representa$on invariant

  • old_state and new_state are “logical disks” exposed by the logging

system

  • log_rep connects transac$on state to an on-disk representa$on
  • Describes the log’s on-disk layout using many ⟼ primi$ves

SPEC log write (a, v) PRE disk | = log rep (ActiveTxn, start state, old state)

  • ld state

|

= a 7! v0 POST disk | = log rep (ActiveTxn, start state, new state) new state | = a 7! v CRASH disk | = log rep (ActiveTxn, start state, any state)

slide-44
SLIDE 44

Cer$fying procedures

  • bmap: return the block address at a given offset for an inode

def bmap(inode, bnum): if bnum >= NDIRECT: indirect = log_read(inode.blocks[NDIRECT]) return indirect[bnum - NDIRECT] else: return inode.blocks[bnum]

slide-45
SLIDE 45

Cer$fying procedures

  • bmap: return the block address at a given offset for an inode

PRE POST CRASH def bmap(inode, bnum): if bnum >= NDIRECT: indirect = log_read(inode.blocks[NDIRECT]) return indirect[bnum - NDIRECT] else: return inode.blocks[bnum]

slide-46
SLIDE 46

Cer$fying procedures

  • Follow the control flow graph

PRE POST CRASH if return log_read return

procedure bmap()

slide-47
SLIDE 47

Cer$fying procedures

  • Follow the control flow graph
  • Need pre/post/crash condi$ons for each called procedure

PRE POST CRASH if return log_read return

procedure bmap()

slide-48
SLIDE 48

Cer$fying procedures

  • Follow the control flow graph
  • Need pre/post/crash condi$ons for each called procedure
  • Chain pre- and postcondi$ons, forming proof obligaIons

PRE POST CRASH if return log_read return

procedure bmap()

slide-49
SLIDE 49

Cer$fying procedures

  • Follow the control flow graph
  • Need pre/post/crash condi$ons for each called procedure
  • Chain pre- and postcondi$ons, forming proof obligaIons
  • CHL: combines crash condi$ons, get more proof obliga@ons

PRE POST CRASH if return log_read return

procedure bmap()

slide-50
SLIDE 50

Proof automa$on

  • CHL follows the CFG, and generates proof obliga$ons

PRE POST CRASH if return log_read return

procedure bmap()

slide-51
SLIDE 51

Proof automa$on

  • CHL follows the CFG, and generates proof obliga$ons
  • CHL solves trivial obliga$ons automa$cally (common case)

PRE POST CRASH if return log_read return

procedure bmap()

slide-52
SLIDE 52

Proof automa$on

  • CHL follows the CFG, and generates proof obliga$ons
  • CHL solves trivial obliga$ons automa$cally (common case)
  • Remaining proof effort: changing representaIon invariants
  • Show that rep invariant holds at entry and exit

PRE POST CRASH if return log_read return

procedure bmap() inodes_rep inodes_rep inodes_rep

slide-53
SLIDE 53

Specifying an en$re system call (simplified)

SPEC create (dnum, fn) PRE disk | = log rep(NoTxn, start state) start state | = dir rep(tree) ∧

∃ path, tree[path].node = dnum ∧

fn /

∈ tree[path]

slide-54
SLIDE 54

Specifying an en$re system call (simplified)

POST disk | = log rep(NoTxn, new state) new state | = dir rep(new tree) ∧ new tree = tree.update(path, fn, EmptyFile) SPEC create (dnum, fn) PRE disk | = log rep(NoTxn, start state) start state | = dir rep(tree) ∧

∃ path, tree[path].node = dnum ∧

fn /

∈ tree[path]

slide-55
SLIDE 55

Specifying an en$re system call (simplified)

POST disk | = log rep(NoTxn, new state) new state | = dir rep(new tree) ∧ new tree = tree.update(path, fn, EmptyFile) CRASH disk | = log rep(NoTxn, start state) ∨ log rep(NoTxn, new state) ∨ log rep(ActiveTxn, start state, any state) ∨ log rep(CommittingTxn, start state, new state) SPEC create (dnum, fn) PRE disk | = log rep(NoTxn, start state) start state | = dir rep(tree) ∧

∃ path, tree[path].node = dnum ∧

fn /

∈ tree[path]

slide-56
SLIDE 56

Specifying an en$re system call (simplified)

POST disk | = log rep(NoTxn, new state) new state | = dir rep(new tree) ∧ new tree = tree.update(path, fn, EmptyFile) CRASH disk | = log rep(NoTxn, start state) ∨ log rep(NoTxn, new state) ∨ log rep(ActiveTxn, start state, any state) ∨ log rep(CommittingTxn, start state, new state)

would_recover_either (start_state, new_state)

SPEC create (dnum, fn) PRE disk | = log rep(NoTxn, start state) start state | = dir rep(tree) ∧

∃ path, tree[path].node = dnum ∧

fn /

∈ tree[path]

slide-57
SLIDE 57

CRASH disk | = would recover either (start state, new state)

Specifying an en$re system call (simplified)

POST disk | = log rep(NoTxn, new state) new state | = dir rep(new tree) ∧ new tree = tree.update(path, fn, EmptyFile) SPEC create (dnum, fn) PRE disk | = log rep(NoTxn, start state) start state | = dir rep(tree) ∧

∃ path, tree[path].node = dnum ∧

fn /

∈ tree[path]

slide-58
SLIDE 58

Specifying log recovery

  • log_recover() is idempotent:
  • Crash condi$on implies its own precondi$on
  • OK to run log_recover() again aRer a crash in itself

SPEC log recover () PRE disk | = would recover either (last state, committed state) POST disk | = log rep(NoTxn, last state) ∨ log rep(NoTxn, committed state) CRASH disk | = would recover either (last state, committed state)

slide-59
SLIDE 59

procedure bmap()

Recovery execu$on seman$cs

PRE POST CRASH if return log_read return log_recover

slide-60
SLIDE 60

procedure bmap()

Recovery execu$on seman$cs

PRE POST CRASH if return log_read return log_recover

slide-61
SLIDE 61

procedure bmap()

Recovery execu$on seman$cs

PRE POST CRASH if return log_read return log_recover RECOVER

slide-62
SLIDE 62

Joint execu$on of two procedures bmap ⨝ log_recover

Recovery execu$on seman$cs

  • Whenever bmap (or log_recover) crashes, run log_recover aRer reboot

PRE POST CRASH if return log_read return log_recover RECOVER

slide-63
SLIDE 63

End-to-end specifica$on

  • create() is atomic, if log_recover() runs aRer every crash
  • POST is stronger than RECOVER

SPEC create (drum, fn)

  • n log recover ()

PRE disk | = log rep(NoTxn, start state) start state | = dir rep(tree) ∧

∃ path, tree[path].node = drum ∧

fn /

∈ tree[path]

POST disk | = log rep(NoTxn, new state) new state | = dir rep(new tree) ∧ new tree = tree.update(path, fn, EmptyFile) RECOVER disk | = log rep(NoTxn, start state) ∨ log rep(NoTxn, new state)

slide-64
SLIDE 64

CHL summary

  • Key ideas: crash condi@ons and recovery seman@cs
  • CHL benefit: enables precise failure specifica$ons
  • Allows for automa$c chaining of pre/post/crash condi$ons
  • Reduces proof burden
  • CHL cost: must write crash condi$on for every

func$on, loop, etc.

  • Crash condi$ons are oRen simple (above logging layer)
slide-65
SLIDE 65

Outline

  • Crash safety
  • What is the correct behavior aRer a crash?
  • Challenge 1: formalizing crashes
  • Crash Hoare Logic (CHL)
  • Challenge 2: incorpora$ng performance op$miza$ons
  • Disk sequences
  • Building a complete file system
  • Evalua$on

slide-66
SLIDE 66

FSCQ implements many op$miza$ons

  • Group commit
  • Buffer transac$ons in memory, and flush them in a single batch
  • Relax durability guarantee
  • Log-bypass writes
  • File data writes go to the disk (buffer cache) directly
  • Log checksums
  • Checksum log entries to reduce write barriers
  • Deferred apply
  • Apply the log only when the log is full
slide-67
SLIDE 67

disk

Example: group commit

log data

slide-68
SLIDE 68

disk

Example: group commit

transac@on cache log data

  • 1. Each file-system call forms a

transac$on, which is buffered in the transacIon cache

slide-69
SLIDE 69

disk

Example: group commit

➡ mkdir(‘d’) ➡ create(‘d/a’) ➡ rename(‘d/a’, ‘d/b’)

, ,

transac@on cache log data

  • 1. Each file-system call forms a

transac$on, which is buffered in the transacIon cache

slide-70
SLIDE 70

disk

Example: group commit

➡ mkdir(‘d’) ➡ create(‘d/a’) ➡ rename(‘d/a’, ‘d/b’) ➡ fsync(‘d’) transac@on cache log data

  • 1. Each file-system call forms a

transac$on, which is buffered in the transacIon cache

  • 2. fsync() flushes cached transac$ons

to the on-disk log in a batch

  • Preserve order
slide-71
SLIDE 71

➡ mkdir(‘d’) ➡ create(‘d/a’)

Challenge: formalizing group commit

  • Many more crash states (e.g., before or aRer mkdir() )
  • On-disk state can be irrelevant to create() itself, but to some

previous opera$ons

SPEC create (dnum, fn) PRE disk | = log rep(NoTxn, start state) start state | = dir rep(tree) ∧

∃ path, tree[path].node = dnum ∧

fn /

∈ tree[path]

POST disk | = log rep(NoTxn, new state) new state | = dir rep(new tree) ∧ new tree = tree.update(path, fn, EmptyFile) CRASH disk | = would recover either (start state, new state)

slide-72
SLIDE 72

disk sequence

Specifica$on idea: disk sequences

disk0 flushed state disk0

in-memory transac@ons write-ahead log txn1 txn2 txnn

slide-73
SLIDE 73

disk sequence

Specifica$on idea: disk sequences

  • Each (cached) system call adds a new logical disk to the sequence

disk0 flushed state disk1 diskn

disk0

in-memory transac@ons latest write-ahead log txn1 txn2 txnn

slide-74
SLIDE 74

disk sequence

Specifica$on idea: disk sequences

  • Each (cached) system call adds a new logical disk to the sequence
  • Each logical disk has a corresponding tree

disk0 flushed state disk1 diskn

disk0

in-memory transac@ons latest write-ahead log tree_rep tree_rep tree_rep

txn1 txn2 txnn

slide-75
SLIDE 75

disk sequence

Specifica$on idea: disk sequences

  • Each (cached) system call adds a new logical disk to the sequence
  • Each logical disk has a corresponding tree
  • Captures the idea that metadata updates must be ordered

disk0 flushed state disk1 diskn

disk0

in-memory transac@ons latest write-ahead log tree_rep tree_rep tree_rep

txn1 txn2 txnn

slide-76
SLIDE 76

New specifica$on with disk sequences

  • Disk sequences allow for simple specifica$ons

SPEC create (dnum, fn) PRE disk | = log rep(NoTxn, disk seq) disk seq.latest | = dir rep(tree) ∧

∃ path, tree[path].node = dnum ∧

fn /

∈ tree[path]

POST disk | = log rep(NoTxn, disk seq ++ {new state}) new state | = dir rep(new tree) ∧ new tree = tree.update(path, fn, EmptyFile) CRASH disk | = would recover any (disk seq ++ {new state})

slide-77
SLIDE 77

Specifica$on for fsync on directories

  • ARer fsync(), there is only one possible on-disk state (the latest one)

SPEC fsync (dir inum) PRE disk | = log rep(NoTxn, disk seq) disk seq.latest | = tree rep(tree) ∧ IsDir(find inum(tree, dir inum)) POST disk | = log rep(NoTxn, {disk seq.latest}) CRASH disk | = would recover any(disk seq)

slide-78
SLIDE 78

Formaliza$on techniques for op$miza$ons

  • Group commit
  • Disk sequences: captures ordered metadata updates
  • Log-bypass writes
  • Disk rela@ons: enforces safety w.r.t. metadata updates
  • Log checksums
  • Checksum model: soundly reasons about hash collision

slide-79
SLIDE 79

Outline

  • Crash safety
  • What is the correct behavior aRer a crash?
  • Challenge 1: formalizing crashes
  • Crash Hoare Logic (CHL)
  • Challenge 2: incorpora$ng performance op$miza$ons
  • Disk sequences
  • Building a complete file system
  • Evalua$on
slide-80
SLIDE 80

FSCQ: building a complete file system

  • File system design is close

to v6 Unix (+ logging)

FSCQ system calls Directory Directory tree Block-level file Inode Bitmap allocator Buffer cache Write-ahead log

slide-81
SLIDE 81

FSCQ: building a complete file system

  • File system design is close

to v6 Unix (+ logging)

  • Implementa$on aims to

reduce proof effort

FSCQ system calls Directory Directory tree Block-level file Inode Bitmap allocator Buffer cache Write-ahead log

slide-82
SLIDE 82

FSCQ: building a complete file system

  • File system design is close

to v6 Unix (+ logging)

  • Implementa$on aims to

reduce proof effort

  • Many precise internal

abstrac$on layers

  • e.g., split File and Inode

FSCQ system calls Directory Directory tree Block-level file Inode Bitmap allocator Buffer cache Write-ahead log Block-level file Inode

slide-83
SLIDE 83

FSCQ: building a complete file system

  • File system design is close

to v6 Unix (+ logging)

  • Implementa$on aims to

reduce proof effort

  • Many precise internal

abstrac$on layers

  • e.g., split File and Inode
  • Reuse proven components
  • e.g., general bitmap allocator

FSCQ system calls Directory Directory tree Block-level file Inode Bitmap allocator Buffer cache Write-ahead log Bitmap allocator Directory tree Inode

slide-84
SLIDE 84

FSCQ: building a complete file system

  • File system design is close

to v6 Unix (+ logging)

  • Implementa$on aims to

reduce proof effort

  • Many precise internal

abstrac$on layers

  • e.g., split File and Inode
  • Reuse proven components
  • e.g., general bitmap allocator
  • Simpler specifica$ons
  • e.g., no hard link ⇒ tree spec

FSCQ system calls Directory Directory tree Block-level file Inode Bitmap allocator Buffer cache Write-ahead log Directory tree

slide-85
SLIDE 85

Evalua$on

  • What bugs do FSCQ’s theorems eliminate?
  • How much development effort is required for FSCQ?
  • How well does FSCQ perform?
slide-86
SLIDE 86

Does FSCQ eliminate bugs?

  • One data point: once theorems proven, no

implementa$on bugs in proven code

  • Did find some mistakes in spec, as a result of end-to-end checks
  • E.g., forgot to specify that extending a file should zero-fill
  • Systema$c study
  • Categorize bugs from Linux kernel’s patch history
  • Manually examine if FSCQ can eliminate bugs in each category
slide-87
SLIDE 87

FSCQ’s theorems eliminate many bugs

Bug category Prevented?

Mistakes in logging logic
 e.g., combining incompa:ble op:miza:ons

Misuse of logging API
 e.g., releasing indirect block in two transac:ons

Mistakes in recovery protocol e.g., issuing write barrier in the wrong order

Improper corner-case handling e.g., running out of blocks during rename

slide-88
SLIDE 88

FSCQ’s theorems eliminate many bugs

Bug category Prevented?

Mistakes in logging logic
 e.g., combining incompa:ble op:miza:ons

Misuse of logging API
 e.g., releasing indirect block in two transac:ons

Mistakes in recovery protocol e.g., issuing write barrier in the wrong order

Improper corner-case handling e.g., running out of blocks during rename

Low-level bugs e.g., double free, integer overflow Some (memory safe) Returning incorrect error code Some

slide-89
SLIDE 89

FSCQ’s theorems eliminate many bugs

Bug category Prevented?

Mistakes in logging logic
 e.g., combining incompa:ble op:miza:ons

Misuse of logging API
 e.g., releasing indirect block in two transac:ons

Mistakes in recovery protocol e.g., issuing write barrier in the wrong order

Improper corner-case handling e.g., running out of blocks during rename

Low-level bugs e.g., double free, integer overflow Some (memory safe) Returning incorrect error code Some Concurrency Not supported Security Not supported

slide-90
SLIDE 90

Development effort

  • Total of ~50,000 lines of verified code, specs, and proofs in Coq
  • ~3,500 lines of implementa$on; rest is specs, lemmas, and proofs
  • > 50% reusable infrastructure
  • Comparison: ext4 has ~60,000 lines of C code (many more features)
  • What’s the cost of adding new features to FSCQ?

4% 12% 7% 5% 21% 8% 44%

CHL infrastructure General data structures Write-ahead log Buffer cache Inodes and files Directories Top-level API

slide-91
SLIDE 91

Change effort propor$onal to scope of change

FSCQ system calls Directory Directory tree Block-level file Inode Bitmap allocator Buffer cache Write-ahead log

slide-92
SLIDE 92

Change effort propor$onal to scope of change

  • Indirect blocks:
  • + 1,500 lines in Inode

FSCQ system calls Directory Directory tree Block-level file Inode Bitmap allocator Buffer cache Write-ahead log Inode

slide-93
SLIDE 93

Change effort propor$onal to scope of change

  • Indirect blocks:
  • + 1,500 lines in Inode
  • Write-back buffer cache:
  • + 2300 lines beneath log


~ 600 lines in rest of FSCQ

FSCQ system calls Directory Directory tree Block-level file Inode Bitmap allocator Buffer cache Write-ahead log Buffer cache

slide-94
SLIDE 94

Change effort propor$onal to scope of change

  • Indirect blocks:
  • + 1,500 lines in Inode
  • Write-back buffer cache:
  • + 2300 lines beneath log


~ 600 lines in rest of FSCQ

  • Group commit:
  • + 1800 lines in Log


~ 100 lines in rest of FSCQ

  • Changed lines include

code, specs and proofs

FSCQ system calls Directory Directory tree Block-level file Inode Bitmap allocator Buffer cache Write-ahead log

slide-95
SLIDE 95

Performance comparison

  • File-system-intensive workload
  • LFS “largefile” benchmark
  • mailbench, a qmail-like mail server
  • Compare with ext4 (non-cer$fied) in default mode
  • Mount op$on: async,data=ordered
  • Use FUSE to forward and serialize requests (disable concurrency)
  • Running on an hard disk on a desktop
  • Quad-core Intel i7-980X 3.33 GHz / 24 GB / Hitachi HDS721010CLA332
  • Linux 3.11 / GHC 8.0.1 / all file systems run on a separate par$$on
slide-96
SLIDE 96

FSCQ Performance

  • FSCQ’s CPU overhead is high
  • FSCQ’s I/O performance is on par with ext4

Running Time (seconds) 5 10 15 20 25 largefile mailbench FSCQ ext4 Number of disk I/Os per opera@on largefile mailbench write sync write sync FSCQ 1,550 1,290 42.98 13.8 ext4 1,554 1,290 40.40 12.3

slide-97
SLIDE 97

Future direc$ons

  • Extrac$ng to na$ve code
  • Reduce both CPU overhead and TCB
  • Cer$fying crash-safe applica$ons
  • Use FSCQ’s top-level spec to cer$fy a mail server or a KV store
  • Suppor$ng concurrency
  • Run FSCQ in a mul$-user environment
  • Exploit both I/O concurrency and parallelism
slide-98
SLIDE 98

Conclusion

  • CHL helps specify and prove crash safety
  • Crash condi$ons
  • Recovery execu$on seman$cs
  • FSCQ: first cer$fied crash-safe file system
  • Precise specifica$on in presence of crashes
  • I/O performance on par with Linux ext4
  • Moderate development effort

h}ps:/ /github.com/mit-pdos/fscq-impl