[PPT] - The problem: crash consistency Single operaBon updates mulBple PowerPoint Presentation

SLIDE 1

Consistency Without Ordering

Vijay Chidambaram, Tushar Sharma, Andrea Arpaci‐Dusseau, Remzi Arpaci‐Dusseau

The Advanced Systems Laboratory

University of Wisconsin Madison

SLIDE 2

The problem: crash consistency

Single operaBon updates mulBple blocks
System might crash in the middle of operaBon

– Some blocks updated, some blocks not updated

AEer crash, file system needs to be repaired

– In order to restore consistency among blocks

FAST 12 2 2/15/12

SLIDE 3

SoluBon #1: Lazy, opBmisBc approach

Write blocks to disk in any order

– Fix inconsistencies upon reboot

Advantage: Simple, High performance
Disadvantage: Expensive recovery
Example: ext2 with fsck [Card94]

2/15/12 FAST 12 3

SLIDE 4

SoluBon #2: Eager, pessimisBc approach

Carefully order writes to disk
Advantage: Quick recovery
Disadvantage: Perpetual performance penalty
Examples

– SoE updates (FFS) [Ganger94] – Journaling (CFS) [Hangmann87] – Copy‐on‐write (ZFS) [Bonwick04]

2/15/12 FAST 12 4

SLIDE 5

Ordering points considered harmful

Reduce performance

– Constrain scheduling of disk writes

Increase complexity
Require lower‐level primiBves

– IDE/SATA Cache flush commands

2/15/12 FAST 12 5

SLIDE 6

Ordering points require trust

File system runs on stack of virtual devices

– Consistency fails if any device ignores commands to flush cache

2/15/12 FAST 12 6

F_FULLFSYNC “…The operaLon may take quite a while to complete.

Certain FireWire drives have also been known to ignore the request to flush their buffered data.” “If desired, the virtual disk images can be flushed when the guest issues the IDE FLUSH CACHE command. Normally these requests are ignored for improved performance”

VirtualBox

SLIDE 7

Is crash‐consistency possible without ordering points?

Middle ground between lazy and eager approaches
Simplicity and high performance of lazy approach
Strong consistency and availability of eager approach

2/15/12 FAST 12 7

SLIDE 8

Our soluBon: No‐Order File System (NoFS) Order‐less file system which uses mutual agreement between objects to obtain consistency

2/15/12 FAST 12 8

SLIDE 9

Results

Designed a new crash‐consistency technique

– Backpointer‐based consistency (BBC)

TheoreBcally and experimentally verified that

NoFS provides strong consistency

Evaluated NoFS against ext2 and ext3

– NoFS performance comparable to ext2 – NoFS performance equal to or beger than ext3

2/15/12 FAST 12 9

SLIDE 10

Outline

IntroducBon
Crash‐consistency and Object idenBty
The No‐Order File System
Results
Conclusion

2/15/12 FAST 12 10

SLIDE 11

Crash consistency and object idenBty

All file system inconsistencies are due to ambiguity about the logical idenLty of an object

2/15/12 FAST 12 11

Logical idenBty of an object

– Data block: Owner file, offset – File: Parent directories

Common inconsistencies

– Two files claim the same data block – File points to garbage data

SLIDE 12

Crash Scenario

AcBons:

– File A is truncated – The freed data block is allocated to File B – The updated data blocks are wrigen to disk

Problem: Due to a crash, File A is not updated on disk
Result: On disk, both files claim the data block

2/15/12 FAST 12 12

File A File B Data block

MEMORY DISK

File A Data block Data block

SLIDE 13

Outline

IntroducBon
Crash‐consistency and Object idenBty
The No‐Order File System

– Backpointer‐based consistency (BBC) – Non‐persistent allocaBon structures

Results
Conclusion

2/15/12 FAST 12 13

SLIDE 14

Backpointer‐based consistency (BBC)

Associate object with its logical idenBty

– Embed backpointer into each object – Owner(s) of the object found through backpointer

Consistency obtained through mutual agreement
Key AssumpBon

– Object and backpointer wrigen atomically

2/15/12 FAST 12 14

File A Data block

SLIDE 15

Data block

Using backpointers in a crash scenario

2/15/12 FAST 12 15

File A File B Data block

MEMORY DISK

File A

AcBons:

– File A is truncated – The freed data block is allocated to File B – The updated data blocks are wrigen to disk

Problem: Due to a crash, File A is not updated on disk
Result: Using the backpointer, the true owner is idenBfied

Data block

SLIDE 16

Backpointers of different objects

2/15/12 FAST 12 16

Data blocks have a single backpointer to file
Files can have many backpointers

– One for each parent directory

DetecBon of inconsistencies

– Each access of an object involves checking its backpointer

File Data block Directory Directory

SLIDE 17

Formal Model of BBC

Extended a formal model for file systems with

backpointers [Sivathanu05]

Defined the level of consistency provided by BBC

– Data consistency

Proved that a file system with backpointers

provides data consistency

2/15/12 FAST 12 17

SLIDE 18

Outline

IntroducBon
Crash‐consistency and Object idenBty
The No‐Order File System

– Backpointer‐based consistency – Non‐persistent allocaBon structures

Results
Conclusion

2/15/12 FAST 12 18

SLIDE 19

AllocaBon structures

File systems need to track allocaBon status
Crash may leave such structures inconsistent
True allocaBon status needs to be found

2/15/12 FAST 12 19

Data block bitmap

File A Data block 1

MEMORY DISK

Data block bitmap

SLIDE 20

AllocaBon structures

AEer a crash, true allocaBon status of all
bjects must be found
TradiBonal file systems do this proacBvely

– File‐system check scans disk to get status – Journaling file systems write to a log to avoid scan

2/15/12 FAST 12 20

SLIDE 21

Non‐persistent allocaBon structures

NoFS does not persist allocaBon structures
Why?

– Cannot be trusted aEer crash, need to be verified – Complicate update protocol

2/15/12 FAST 12 21

SLIDE 22

Non‐persistent allocaBon structures

How is allocaBon informaBon tracked then?

– Need to know which metadata/data blocks are free

Move the work of finding allocaBon informaBon

to the background

– CreaBon of new objects can proceed without complete allocaBon informaBon

2/15/12 FAST 12 22

SLIDE 23

Non‐persistent allocaBon structures

Backpointers used to determine allocaBon

– Object in use if pointers mutually agree – Check each object individually – Use validity bitmaps to track checked objects

AllocaBon structures built up incrementally

2/15/12 FAST 12 23

SLIDE 24

Determining allocaBon informaBon

2/15/12 FAST 12 24

ext2 NoFS

Data block bitmap File A Data block File C Data block File B Data block File D Data block Data block bitmap Data block validity bitmap File A Data block File B Data block File C Data block File D Data block

1 0 1 0 ‐ ‐ ‐ ‐ 0 0 0 0 ‐ 1 ‐ ‐ 1 1 ‐ 1 0 0 0 1 1 0 0 1 1 1 0 1 1 1 1 1 ‐ ‐ 1 1 1

SLIDE 25

Background Scan

Complete allocaBon informaBon not needed
AllocaBon informaBon discovered using two

background threads

– One for metadata – One for data

Scheduling of scan can be configured

– Run when idle – Run periodically

2/15/12 FAST 12 25

SLIDE 26

Design

2/15/12 FAST 12 26

Memory Disk

File Data block Directory

Inode bitmap Data block bitmap Group descriptor Inode bitmap Data block bitmap Group descriptor Inode validity bitmap Data block validity bitmap

SLIDE 27

ImplementaBon

Based on ext2 codebase
Three types of backpointers

– Data block backpointers {inode num, offset} – Inode backlinks {inode num} – Directory block backpointers {dot directory entry}

Inode size increased to support 32 backlinks
Modified the linux page cache to add checks

2/15/12 FAST 12 27

SLIDE 28

Outline

IntroducBon
Crash‐consistency and Object idenBty
The No‐Order File System

– Backpointer‐based consistency – Non‐persistent allocaBon structures

Results
Conclusion

2/15/12 FAST 12 28

SLIDE 29

EvaluaBon

Q: Is NoFS robust against crashes?

– Fault injecBon tesBng

Q: What is the overhead of NoFS?

– Evaluated on micro and macro benchmarks

Q: How does the background scan affect performance?

– Measured write bandwidth, access latency during scan

2/15/12 FAST 12 29

SLIDE 30

Is NoFS robust against crashes?

2/15/12 FAST 12 30

Disk

Pseudo‐device driver

Writes from file system Selected writes

Fault injecBon tesBng

Interpose pseudo‐device driver

between the file system and disk

Discard writes to selected sectors
Emulate crash with different blocks

successfully updated on disk

20 different crash scenarios

NoFS detected all inconsistencies

Errors returned on invalid access
Orphan inodes/blocks reclaimed

SLIDE 31

What is the overhead of NoFS?

2/15/12 FAST 12 31

0.2 0.4 0.6 0.8 1 SeqWrite RandWrite File Create Varmail

Performance in micro and macro benchmarks

ext2 NoFS ext3

Normalized throughput vs ext2

Writes to 1 GB file 4088 bytes per write to 1 GB file 100K files over 100 directories with fsync Filebench

NoFS performance comparable to ext2 NoFS performance is beger than ext3 for sync heavy workloads

SLIDE 32

How does the background scan affect performance?

Scan reads are interleaved with file system I/O
Access to objects not verified by scan incurs a

performance penalty

2/15/12 FAST 12 32

SLIDE 33

Scan reads are interleaved with file system I/O

Scan reads interfere with applicaBon reads

and writes

Experiment

– Write a 200 MB file every 30 seconds – Measure bandwidth

2/15/12 FAST 12 33

SLIDE 34

Scan reads are interleaved with file system I/O

10 20 30 40 50 60 70 200 400 600 800 1000 1200 1400 1600 Bandwidth (MB/s) Time (s)

Write bandwidth obtained

2/15/12 FAST 12 34

SLIDE 35

Scan reads are interleaved with file system I/O

2/15/12 FAST 12 35

10 20 30 40 50 60 70 30 60 90 120 150 180 210 240 270 300 330 360 Bandwidth (MB/s) Time (s)

Write bandwidth obtained

Scan compleBon I/O bandwidth is reduced during scan, but peak performance achieved on scan compleBon

SLIDE 36

Access to objects not verified by scan costs more

The stat problem

– stat returns number of blocks allocated – This informaBon might be stale for un‐verified inode – NoFS verifies the inode upon stat

Involves checking each inode data block

2/15/12 FAST 12 36

SLIDE 37

Access to objects not verified by scan costs more

Experiment

– Create a number of directories with 128 files (each 1 MB) – At each 50 second interval, starBng from file‐system mount

Run ls –l on directory
This causes a stat call on every inode
stat on un‐verified inodes requires reading all its data

– Measure Bme taken

2/15/12 FAST 12 37

SLIDE 38

Access to objects not verified by scan costs more

2 4 6 8 10 12 14 16 18 50 100 150 200 250 300 350 400 450 500 550 600

Time taken for ls –l (s) Time aKer file‐system mount (s)

2/15/12 FAST 12 38

Scan compleBon There is a performance cost to accessing un‐verified objects during the scan One Bme cost, only unBl scan compleBon

SLIDE 39

Outline

IntroducBon
Crash‐consistency and Object idenBty
The No‐Order File System

– Backpointer‐based consistency – Non‐persistent allocaBon structures

Results
Conclusion

2/15/12 FAST 12 39

SLIDE 40

Summary

Problem: Providing crash‐consistency and high

availability without ordering points

SoluBon: NoFS with Backpointer‐based consistency

– Use mutual agreement to drive consistency

Advantages:

– Strong consistency guarantees – Performance similar to order‐less file system

2/15/12 FAST 12 40

SLIDE 41

Conclusion

Trust is implicit in many layers of storage systems
Removing such trust is key to building robust,

reliable storage systems

2/15/12 FAST 12 41

SLIDE 42

FAST 12 42

Thank you!

2/15/12

Advanced Systems Lab (ADSL) University of Wisconsin‐Madison hcp://www.cs.wisc.edu/adsl

QuesBons?

SLIDE 43

2/15/12 FAST 12 43

Backup Slides

SLIDE 44

2/15/12 FAST 12 44

20 40 60 80 100 120 140 160 1 2 4 8 16 32 64 128 256 512 1024 Time (s) Total data in the file system (MB)

Running Mme of scan

SLIDE 45

2/15/12 FAST 12 45

10 20 30 40 50 60 70 140 210 280 350 420 490 Time for ls system call (s) Time (s)

Performance cost of stat on unverified inodes

Total data: 128 MB Total data: 256 MB Total data: 512 MB

250

Scan compleBon

SLIDE 46

2/15/12 FAST 12 46

10 20 30 40 50 60 70 80 30 60 90 120 150 180 210 240 270 300 330 Write bandwidth (MB/s) Time (s)

Effect of background scan on write bandwidth

Writes starBng at 20s Writes starBng at 0s

Background scan every 30 seconds

SLIDE 47

2/15/12 FAST 12 47

0.01 0.1 1 10 100 1 10 100 1000 Time taken (s) Total data scanned (MB)

Performance of data block scan

SLIDE 48

2/15/12 FAST 12 48

Lines of code: 6765 Kernel: 2869 File system: 3869

SLIDE 49

Use cases

NoFS provides crash‐consistency without
rdering
BBC can be used in convenBonal file systems to ensure

runBme integrity

NoFs can be used as local file system in GFS, HDFS
NoFS allows virtual machines to maintain

consistency without trusBng lower‐layer primiBves

2/15/12 FAST 12 49