The problem: crash consistency Single operaBon updates mulBple - - PowerPoint PPT Presentation

the problem crash consistency
SMART_READER_LITE
LIVE PREVIEW

The problem: crash consistency Single operaBon updates mulBple - - PowerPoint PPT Presentation

Consistency Without Ordering Vijay Chidambaram, Tushar Sharma, Andrea ArpaciDusseau, Remzi ArpaciDusseau The Advanced Systems Laboratory University of Wisconsin Madison The problem: crash consistency Single operaBon updates mulBple


slide-1
SLIDE 1

Consistency Without Ordering

Vijay Chidambaram, Tushar Sharma, Andrea Arpaci‐Dusseau, Remzi Arpaci‐Dusseau

The Advanced Systems Laboratory

University of Wisconsin Madison

slide-2
SLIDE 2

The problem: crash consistency

  • Single operaBon updates mulBple blocks
  • System might crash in the middle of operaBon

– Some blocks updated, some blocks not updated

  • AEer crash, file system needs to be repaired

– In order to restore consistency among blocks

FAST 12 2 2/15/12

slide-3
SLIDE 3

SoluBon #1: Lazy, opBmisBc approach

  • Write blocks to disk in any order

– Fix inconsistencies upon reboot

  • Advantage: Simple, High performance
  • Disadvantage: Expensive recovery
  • Example: ext2 with fsck [Card94]

2/15/12 FAST 12 3

slide-4
SLIDE 4

SoluBon #2: Eager, pessimisBc approach

  • Carefully order writes to disk
  • Advantage: Quick recovery
  • Disadvantage: Perpetual performance penalty
  • Examples

– SoE updates (FFS) [Ganger94] – Journaling (CFS) [Hangmann87] – Copy‐on‐write (ZFS) [Bonwick04]

2/15/12 FAST 12 4

slide-5
SLIDE 5

Ordering points considered harmful

  • Reduce performance

– Constrain scheduling of disk writes

  • Increase complexity
  • Require lower‐level primiBves

– IDE/SATA Cache flush commands

2/15/12 FAST 12 5

slide-6
SLIDE 6

Ordering points require trust

  • File system runs on stack of virtual devices

– Consistency fails if any device ignores commands to flush cache

2/15/12 FAST 12 6

F_FULLFSYNC “…The operaLon may take quite a while to complete.

Certain FireWire drives have also been known to ignore the request to flush their buffered data.” “If desired, the virtual disk images can be flushed when the guest issues the IDE FLUSH CACHE command. Normally these requests are ignored for improved performance”

VirtualBox

slide-7
SLIDE 7

Is crash‐consistency possible without ordering points?

  • Middle ground between lazy and eager approaches
  • Simplicity and high performance of lazy approach
  • Strong consistency and availability of eager approach

2/15/12 FAST 12 7

slide-8
SLIDE 8

Our soluBon: No‐Order File System (NoFS) Order‐less file system which uses mutual agreement between objects to obtain consistency

2/15/12 FAST 12 8

slide-9
SLIDE 9

Results

  • Designed a new crash‐consistency technique

– Backpointer‐based consistency (BBC)

  • TheoreBcally and experimentally verified that

NoFS provides strong consistency

  • Evaluated NoFS against ext2 and ext3

– NoFS performance comparable to ext2 – NoFS performance equal to or beger than ext3

2/15/12 FAST 12 9

slide-10
SLIDE 10

Outline

  • IntroducBon
  • Crash‐consistency and Object idenBty
  • The No‐Order File System
  • Results
  • Conclusion

2/15/12 FAST 12 10

slide-11
SLIDE 11

Crash consistency and object idenBty

All file system inconsistencies are due to ambiguity about the logical idenLty of an object

2/15/12 FAST 12 11

  • Logical idenBty of an object

– Data block: Owner file, offset – File: Parent directories

  • Common inconsistencies

– Two files claim the same data block – File points to garbage data

slide-12
SLIDE 12

Crash Scenario

  • AcBons:

– File A is truncated – The freed data block is allocated to File B – The updated data blocks are wrigen to disk

  • Problem: Due to a crash, File A is not updated on disk
  • Result: On disk, both files claim the data block

2/15/12 FAST 12 12

File A File B Data block

MEMORY DISK

File A Data block Data block

slide-13
SLIDE 13

Outline

  • IntroducBon
  • Crash‐consistency and Object idenBty
  • The No‐Order File System

– Backpointer‐based consistency (BBC) – Non‐persistent allocaBon structures

  • Results
  • Conclusion

2/15/12 FAST 12 13

slide-14
SLIDE 14

Backpointer‐based consistency (BBC)

  • Associate object with its logical idenBty

– Embed backpointer into each object – Owner(s) of the object found through backpointer

  • Consistency obtained through mutual agreement
  • Key AssumpBon

– Object and backpointer wrigen atomically

2/15/12 FAST 12 14

File A Data block

slide-15
SLIDE 15

Data block

Using backpointers in a crash scenario

2/15/12 FAST 12 15

File A File B Data block

MEMORY DISK

File A

  • AcBons:

– File A is truncated – The freed data block is allocated to File B – The updated data blocks are wrigen to disk

  • Problem: Due to a crash, File A is not updated on disk
  • Result: Using the backpointer, the true owner is idenBfied

Data block

slide-16
SLIDE 16

Backpointers of different objects

2/15/12 FAST 12 16

  • Data blocks have a single backpointer to file
  • Files can have many backpointers

– One for each parent directory

  • DetecBon of inconsistencies

– Each access of an object involves checking its backpointer

File Data block Directory Directory

slide-17
SLIDE 17

Formal Model of BBC

  • Extended a formal model for file systems with

backpointers [Sivathanu05]

  • Defined the level of consistency provided by BBC

– Data consistency

  • Proved that a file system with backpointers

provides data consistency

2/15/12 FAST 12 17

slide-18
SLIDE 18

Outline

  • IntroducBon
  • Crash‐consistency and Object idenBty
  • The No‐Order File System

– Backpointer‐based consistency – Non‐persistent allocaBon structures

  • Results
  • Conclusion

2/15/12 FAST 12 18

slide-19
SLIDE 19

AllocaBon structures

  • File systems need to track allocaBon status
  • Crash may leave such structures inconsistent
  • True allocaBon status needs to be found

2/15/12 FAST 12 19

Data block bitmap

File A Data block 1

MEMORY DISK

Data block bitmap

slide-20
SLIDE 20

AllocaBon structures

  • AEer a crash, true allocaBon status of all
  • bjects must be found
  • TradiBonal file systems do this proacBvely

– File‐system check scans disk to get status – Journaling file systems write to a log to avoid scan

2/15/12 FAST 12 20

slide-21
SLIDE 21

Non‐persistent allocaBon structures

  • NoFS does not persist allocaBon structures
  • Why?

– Cannot be trusted aEer crash, need to be verified – Complicate update protocol

2/15/12 FAST 12 21

slide-22
SLIDE 22

Non‐persistent allocaBon structures

  • How is allocaBon informaBon tracked then?

– Need to know which metadata/data blocks are free

  • Move the work of finding allocaBon informaBon

to the background

– CreaBon of new objects can proceed without complete allocaBon informaBon

2/15/12 FAST 12 22

slide-23
SLIDE 23

Non‐persistent allocaBon structures

  • Backpointers used to determine allocaBon

– Object in use if pointers mutually agree – Check each object individually – Use validity bitmaps to track checked objects

  • AllocaBon structures built up incrementally

2/15/12 FAST 12 23

slide-24
SLIDE 24

Determining allocaBon informaBon

2/15/12 FAST 12 24

ext2 NoFS

Data block bitmap File A Data block File C Data block File B Data block File D Data block Data block bitmap Data block validity bitmap File A Data block File B Data block File C Data block File D Data block

1 0 1 0 ‐ ‐ ‐ ‐ 0 0 0 0 ‐ 1 ‐ ‐ 1 1 ‐ 1 0 0 0 1 1 0 0 1 1 1 0 1 1 1 1 1 ‐ ‐ 1 1 1

slide-25
SLIDE 25

Background Scan

  • Complete allocaBon informaBon not needed
  • AllocaBon informaBon discovered using two

background threads

– One for metadata – One for data

  • Scheduling of scan can be configured

– Run when idle – Run periodically

2/15/12 FAST 12 25

slide-26
SLIDE 26

Design

2/15/12 FAST 12 26

Memory Disk

File Data block Directory

Inode bitmap Data block bitmap Group descriptor Inode bitmap Data block bitmap Group descriptor Inode validity bitmap Data block validity bitmap

slide-27
SLIDE 27

ImplementaBon

  • Based on ext2 codebase
  • Three types of backpointers

– Data block backpointers {inode num, offset} – Inode backlinks {inode num} – Directory block backpointers {dot directory entry}

  • Inode size increased to support 32 backlinks
  • Modified the linux page cache to add checks

2/15/12 FAST 12 27

slide-28
SLIDE 28

Outline

  • IntroducBon
  • Crash‐consistency and Object idenBty
  • The No‐Order File System

– Backpointer‐based consistency – Non‐persistent allocaBon structures

  • Results
  • Conclusion

2/15/12 FAST 12 28

slide-29
SLIDE 29

EvaluaBon

  • Q: Is NoFS robust against crashes?

– Fault injecBon tesBng

  • Q: What is the overhead of NoFS?

– Evaluated on micro and macro benchmarks

  • Q: How does the background scan affect performance?

– Measured write bandwidth, access latency during scan

2/15/12 FAST 12 29

slide-30
SLIDE 30

Is NoFS robust against crashes?

2/15/12 FAST 12 30

Disk

Pseudo‐device driver

Writes from file system Selected writes

Fault injecBon tesBng

  • Interpose pseudo‐device driver

between the file system and disk

  • Discard writes to selected sectors
  • Emulate crash with different blocks

successfully updated on disk

  • 20 different crash scenarios

NoFS detected all inconsistencies

  • Errors returned on invalid access
  • Orphan inodes/blocks reclaimed
slide-31
SLIDE 31

What is the overhead of NoFS?

2/15/12 FAST 12 31

0.2 0.4 0.6 0.8 1 SeqWrite RandWrite File Create Varmail

Performance in micro and macro benchmarks

ext2 NoFS ext3

Normalized throughput vs ext2

Writes to 1 GB file 4088 bytes per write to 1 GB file 100K files over 100 directories with fsync Filebench

NoFS performance comparable to ext2 NoFS performance is beger than ext3 for sync heavy workloads

slide-32
SLIDE 32

How does the background scan affect performance?

  • Scan reads are interleaved with file system I/O
  • Access to objects not verified by scan incurs a

performance penalty

2/15/12 FAST 12 32

slide-33
SLIDE 33

Scan reads are interleaved with file system I/O

  • Scan reads interfere with applicaBon reads

and writes

  • Experiment

– Write a 200 MB file every 30 seconds – Measure bandwidth

2/15/12 FAST 12 33

slide-34
SLIDE 34

Scan reads are interleaved with file system I/O

10 20 30 40 50 60 70 200 400 600 800 1000 1200 1400 1600 Bandwidth (MB/s) Time (s)

Write bandwidth obtained

2/15/12 FAST 12 34

slide-35
SLIDE 35

Scan reads are interleaved with file system I/O

2/15/12 FAST 12 35

10 20 30 40 50 60 70 30 60 90 120 150 180 210 240 270 300 330 360 Bandwidth (MB/s) Time (s)

Write bandwidth obtained

Scan compleBon I/O bandwidth is reduced during scan, but peak performance achieved on scan compleBon

slide-36
SLIDE 36

Access to objects not verified by scan costs more

  • The stat problem

– stat returns number of blocks allocated – This informaBon might be stale for un‐verified inode – NoFS verifies the inode upon stat

  • Involves checking each inode data block

2/15/12 FAST 12 36

slide-37
SLIDE 37

Access to objects not verified by scan costs more

  • Experiment

– Create a number of directories with 128 files (each 1 MB) – At each 50 second interval, starBng from file‐system mount

  • Run ls –l on directory
  • This causes a stat call on every inode
  • stat on un‐verified inodes requires reading all its data

– Measure Bme taken

2/15/12 FAST 12 37

slide-38
SLIDE 38

Access to objects not verified by scan costs more

2 4 6 8 10 12 14 16 18 50 100 150 200 250 300 350 400 450 500 550 600

Time taken for ls –l (s) Time aKer file‐system mount (s)

2/15/12 FAST 12 38

Scan compleBon There is a performance cost to accessing un‐verified objects during the scan One Bme cost, only unBl scan compleBon

slide-39
SLIDE 39

Outline

  • IntroducBon
  • Crash‐consistency and Object idenBty
  • The No‐Order File System

– Backpointer‐based consistency – Non‐persistent allocaBon structures

  • Results
  • Conclusion

2/15/12 FAST 12 39

slide-40
SLIDE 40

Summary

  • Problem: Providing crash‐consistency and high

availability without ordering points

  • SoluBon: NoFS with Backpointer‐based consistency

– Use mutual agreement to drive consistency

  • Advantages:

– Strong consistency guarantees – Performance similar to order‐less file system

2/15/12 FAST 12 40

slide-41
SLIDE 41

Conclusion

  • Trust is implicit in many layers of storage systems
  • Removing such trust is key to building robust,

reliable storage systems

2/15/12 FAST 12 41

slide-42
SLIDE 42

FAST 12 42

Thank you!

2/15/12

Advanced Systems Lab (ADSL) University of Wisconsin‐Madison hcp://www.cs.wisc.edu/adsl

QuesBons?

slide-43
SLIDE 43

2/15/12 FAST 12 43

Backup Slides

slide-44
SLIDE 44

2/15/12 FAST 12 44

20 40 60 80 100 120 140 160 1 2 4 8 16 32 64 128 256 512 1024 Time (s) Total data in the file system (MB)

Running Mme of scan

slide-45
SLIDE 45

2/15/12 FAST 12 45

10 20 30 40 50 60 70 140 210 280 350 420 490 Time for ls system call (s) Time (s)

Performance cost of stat on unverified inodes

Total data: 128 MB Total data: 256 MB Total data: 512 MB

250

Scan compleBon

slide-46
SLIDE 46

2/15/12 FAST 12 46

10 20 30 40 50 60 70 80 30 60 90 120 150 180 210 240 270 300 330 Write bandwidth (MB/s) Time (s)

Effect of background scan on write bandwidth

Writes starBng at 20s Writes starBng at 0s

Background scan every 30 seconds

slide-47
SLIDE 47

2/15/12 FAST 12 47

0.01 0.1 1 10 100 1 10 100 1000 Time taken (s) Total data scanned (MB)

Performance of data block scan

slide-48
SLIDE 48

2/15/12 FAST 12 48

Lines of code: 6765 Kernel: 2869 File system: 3869

slide-49
SLIDE 49

Use cases

  • NoFS provides crash‐consistency without
  • rdering
  • BBC can be used in convenBonal file systems to ensure

runBme integrity

  • NoFs can be used as local file system in GFS, HDFS
  • NoFS allows virtual machines to maintain

consistency without trusBng lower‐layer primiBves

2/15/12 FAST 12 49