[PPT] - 1 CephFS fsck: Distributed Filesystem Checking Hi, Im Greg Greg PowerPoint Presentation

SLIDE 1

1

SLIDE 2

CephFS fsck: Distributed Filesystem Checking

SLIDE 3

Hi, I’m Greg

Greg Farnum CephFS Tech Lead, Red Hat gfarnum@redhat.com Been working as a core Ceph developer since June 2009

3

SLIDE 4

4

SLIDE 5

5

What is Ceph?

An awesome, software-based, scalable, distributed storage system that is designed for failures

Object storage (our native API)
Block devices (Linux kernel, QEMU/KVM, others)
RESTful S3 & Swift API object store
POSIX Filesystem

SLIDE 6

6

RADOS

A reliable, autonomous, distributed object store comprised of self-healing, self-managing, intelligent storage nodes

LIBRADOS

A library allowing apps to directly access RADOS, with support for C, C++, Java, Python, Ruby, and PHP

RBD

A reliable and fully- distributed block device, with a Linux kernel client and a QEMU/KVM driver

CEPH FS

A POSIX-compliant distributed file system, with a Linux kernel client and support for FUSE

RADOSGW

A bucket-based REST gateway, compatible with S3 and Swift

APP APP HOST/VM CLIENT

Reliable Autonomic Distributed Object Store

SLIDE 7

7

What is CephFS?

An awesome, software-based, scalable, distributed POSIX- compliant file system that is designed for failures

SLIDE 8

RADOS

A user perspective

8

SLIDE 9

Objects in RADOS

9

Data 01110010101 01010101010 00010101101 xattrs version: 1

map

foo -> bar baz -> qux

SLIDE 10

The librados API

C, C++, Python, Java, shell. File-like API:

read/write (extent), truncate, remove; get/set/remove xattr or key
efficient copy-on-write clone
Snapshots — single object or pool-wide
atomic compound operations/transactions
read + getxattr, write + setxattr
compare xattr value, if match write + setxattr
“object classes”
load new code into cluster to implement new methods
calc sha1, grep/filter, generate thumbnail
encrypt, increment, rotate image
Implement your own access mechanisms — HDF5 on the node
watch/notify: use object as communication channel between clients (locking

primitive)

pgls: list the objects within a placement group

10

SLIDE 11

The RADOS Cluster

11

Object Storage Devices (OSDs) M M M

CLIENT

SLIDE 12

Object 1 Object 2 Object 3 Object 4 Object 1 Object 2 Object 3 Object 4 Pool 1 Pool 2

12

Object Storage Devices (OSDs) Monitor

M

SLIDE 13

13

10 10 01 01 10 10 01 11 01 10

hash(object name) % num pg CRUSH(pg, cluster state, rule set)

SLIDE 14

RADOS data guarantees

Any write acked as safe will be visible to all subsequent readers
Any write ever visible to a reader will be visible to all

subsequent readers

Any write acked as safe will not be lost unless the whole

containing PG is lost

A PG will not be lost unless all N copies are lost (N is admin-

configured, usually 3)…

…and in case of OSD failure the system will try to bring you

back up to N copies (no user intervention required)

14

SLIDE 15

RADOS data guarantees

Data is regularly scrubbed to ensure copies are consistent with

each other, and administrators are alerted if inconsistencies arise

…and while it’s not automated, it’s usually easy to identify

the correct data with “majority voting” or similar.

btrfs maintains checksums for certainty and we think this is

the future

15

SLIDE 16

CephFS

System Design

16

SLIDE 17

CephFS Design Goals

Infinitely scalable Avoid all Single Points Of Failure Self Managing

17

SLIDE 18

18

M M M

CLIENT

01 10

Metadata Server (MDS)

SLIDE 19

Scaling Metadata

So we have to use multiple MetaData Servers (MDSes) Two Issues:

Storage of the metadata
Ownership of the metadata

19

SLIDE 20

Scaling Metadata – Storage

Some systems store metadata on the MDS system itself But that’s a Single Point Of Failure!

Hot standby?
External metadata storage √

20

SLIDE 21

Scaling Metadata – Ownership

Traditionally: assign hierarchies manually to each MDS

But if workloads change, your nodes can unbalance

Newer: hash directories onto MDSes

But then clients have to jump around for every folder traversal

21

SLIDE 22

22

ne tree

two metadata servers

SLIDE 23

23

ne tree

two metadata servers

SLIDE 24

The Ceph Metadata Server

Key insight: If metadata is stored in RADOS, ownership should be impermanent One MDS is authoritative over any given subtree, but...

That MDS doesn’t need to keep the whole tree in-memory
There’s no reason the authoritative MDS can’t be changed!

24

SLIDE 25

The Ceph MDS – Partitioning

Cooperative Partitioning between servers:

Keep track of how hot metadata is
Migrate subtrees to keep heat distribution similar
Cheap because all metadata is in RADOS
Maintains locality

25

SLIDE 26

The Ceph MDS – Persistence

All metadata is written to RADOS

And changes are only visible once in RADOS

26

SLIDE 27

The Ceph MDS – Clustering Benefits

Dynamic adjustment to metadata workloads

Replicate hot data to distribute workload

Dynamic cluster sizing:

Add nodes as you wish
Decommission old nodes at any time

Recover quickly and easily from failures

27

SLIDE 28

28

SLIDE 29

29

SLIDE 30

30

SLIDE 31

31

SLIDE 32

32

DYNAMIC SUBTREE PARTITIONING

SLIDE 33

33

Does it work?

SLIDE 34

Click to edit Master text styles

It scales!

34

SLIDE 35

Click to edit Master text styles

It redistributes!

35

SLIDE 36

Cool Extras

Besides POSIX-compliance and scaling

36

SLIDE 37

Snapshots

$ mkdir foo/.snap/one # create snapshot $ ls foo/.snap

ne

$ ls foo/bar/.snap _one_1099511627776 # parent's snap name is mangled $ rm foo/myfile $ ls -F foo bar/ $ ls foo/.snap/one myfile bar/ $ rmdir foo/.snap/one # remove snapshot

37

SLIDE 38

Recursive statistics

$ ls -alSh | head total 0 drwxr-xr-x 1 root root 9.7T 2011-02-04 15:51 . drwxr-xr-x 1 root root 9.7T 2010-12-16 15:06 .. drwxr-xr-x 1 pomceph pg4194980 9.6T 2011-02-24 08:25 pomceph drwx--x--- 1 fuzyceph adm 1.5G 2011-01-18 10:46 fuzyceph drwxr-xr-x 1 dallasceph pg275 596M 2011-01-14 10:06 dallasceph $ getfattr -d -m ceph. pomceph # file: pomceph ceph.dir.entries="39" ceph.dir.files="37" ceph.dir.rbytes="10550153946827" ceph.dir.rctime="1298565125.590930000" ceph.dir.rentries="2454401" ceph.dir.rfiles="1585288" ceph.dir.rsubdirs="869113" ceph.dir.subdirs="2"

38

SLIDE 39

Different Storage strategies

Set a “virtual xattr” on a directory and all new files underneath

it follow that layout.

Layouts can specify lots of detail about storage:
pool file data goes into
how large file objects and stripes are
how many objects are in a stripe set
So in one cluster you can use
one slow pool with big objects for Hadoop workloads
one fast pool with little objects for a scratch space
one slow pool with small objects for home directories
or whatever else makes sense...

39

SLIDE 40

CephFS

Important Data structures

40

SLIDE 41

Directory objects

One (or more!) per directory
Deterministically named: <inode number>.<directory piece>
Embeds dentries and inodes for each child of the folder
Contains a potentially-stale versioned backtrace (path location)
Located in the metadata pool

41

SLIDE 42

File objects

One or more per file
Deterministically named <ino number>.<object number>
First object contains a potentially-stale versioned backtrace
Located in any of the data pools

42

SLIDE 43

MDS log (objects)

The MDS fully journals all metadata operations. The log is chunked across objects.

Deterministically named <log inode number>.<log piece>
Log objects may or may not be replayable if previous entries are

lost

each entry contains what it needs, but eg a file move can depend on a

previous rename entry

Located in the metadata pool

43

SLIDE 44

MDSTable objects

Single objects
SessionMap (per-MDS)
stores the state of each client Session
particularly: preallocated inodes for each client
InoTable (per-MDS)
Tracks which inodes are available to allocate
(this is not a traditional inode mapping table or similar)
SnapTable (shared)
Tracks system snapshot IDs and their state (in use pending

create/delete)

All located in the metadata pool

44

SLIDE 45

CephFS

Metadata update flow

45

SLIDE 46

Client Sends Request

46

CLIENT

Object Storage Devices (OSDs) MDS Create dir log.1 log.3 log.2 dir.1 dir.3 dir.2

SLIDE 47

MDS Processes Request

“Early Reply” and journaling

47

CLIENT

Object Storage Devices (OSDs) MDS Early Reply Journal Write log.1 log.3 log.2 dir.1 dir.3 dir.2

SLIDE 48

MDS Processes Request

Journaling and safe reply

48

CLIENT

Object Storage Devices (OSDs) MDS Safe Reply Journal ack log.1 log.3 log.2 log.4 dir.1 dir.3 dir.2 dir.3

SLIDE 49

…time passes…

49

Object Storage Devices (OSDs) MDS log.4 log.5 log.7 log.6 log.8 dir.1 dir.3 dir.2

SLIDE 50

MDS Flushes Log

50

Object Storage Devices (OSDs) MDS log.4 log.5 log.7 log.6 log.8 dir.1 dir.3 dir.2 Directory Write

SLIDE 51

MDS Flushes Log

51

Object Storage Devices (OSDs) MDS log.4 log.5 log.7 log.6 log.8 dir.1 dir.3 dir.2 dir.4 Write ack

SLIDE 52

MDS Flushes Log

52

Object Storage Devices (OSDs) MDS log.5 log.7 log.6 log.8 dir.1 dir.3 dir.2 dir.4 Log Delete

SLIDE 53

Traditional fsck

53

SLIDE 54

e2fsck

Drawn from “ffsck: The Fast File System Checker”, FAST ’13 and http://mm.iit.uni-miskolc.hu/Data/texts/Linux/SAG/ node92.html

54

SLIDE 55

Key data structures & checks

Superblock
check free block count, free inode count
Data bitmap: blocks marked free are not in use
Inode bitmap: inodes marked free are not in use
Directories
inodes are allocated, reasonable, and in-tree
inodes
consistent internal state
link counts
blocks claimed are valid and unique

55

SLIDE 56

Procedure

Pass 1: iterate over all inodes
check self-consistency
builds up maps of in-use blocks/inodes/etc
correct any issues with doubly-allocated blocks
Pass 2: iterate over all directories
check dentry validity and that all referenced inodes are valid
cache a tree structure
Pass 3: Check directory connectivity in-memory
Pass 4: Check inode reference counts in-memory
Pass 5: Check cached maps against on-disk maps and overwrite if

needed

56

SLIDE 57

CephFS fsck

What it needs to do

57

SLIDE 58

RADOS is different than a disk

We look at objects, not disk blocks
And we can’t lose them (at least not in a way we can recover)
We can deterministically identify all pieces of file data
and the inode they belong to!
It is not feasible to keep all metadata in-memory at once
Data loss is the result of:
bugs in the system,
simultaneous catastrophic failure of RADOS (probably losing lots of random

data),

or bitrot

58

SLIDE 59

Failure detection: “forward scrub”

Intended to find tree inconsistencies from bugs or bitrot
Runs continuously in the background
Traverse the tree and make sure referents agree in both

directions

59

SLIDE 60

Catastrophic failure repair: backwards repair

Repair the tree after catastrophic failure or scrub detects an

issue

Run only with administrator intervention
Could affordably be offline-only

60

SLIDE 61

Forward Scrub

Being implemented

61

SLIDE 62

Goal: Check the tree

In the background, examine the entire filesystem hierarchy and make sure it’s self-consistent.

62

SLIDE 63

File objects

Every inode we find has the correct data objects
The inode location and data object backtrace is self-consistent
slightly tricky, since they can be stale
(Optionally) The file size is consistent with what objects exist

63

SLIDE 64

Directory Objects

The directory objects are self-consistent with their summaries

and actual contents (the rstats)

The directory’s backtrace and actual parent directory are self-

consistent

This covers any bugs causing double-links

64

SLIDE 65

Constraints

Limited memory: we can’t hold all the metadata in-memory at
nce
Scalable: we have an MDS cluster and this needs to work within

that framework

65

SLIDE 66

On-disk data structure changes

inode_t gets a scrub_stamp (time) and scrub_version to keep

track of the last time it was scrubbed

frag_t gets a scrub_stamp and scrub_version for both “local” and

“recursive” scrubs

Directories can be “fragmented” into multiple pieces; frag_t holds the

metadata about a given directory fragment

These values can be reported to the user via our rstats “virtual

xattr" interface

66

SLIDE 67

Algorithm Design

Obviously we’re doing a depth-first search:
This means in the worst case we restrict memory usage to O(tree depth)
Validating files first means that when we validate directory rstats, they’re

not about to get changed

Basic strategy: construct stack of “CDentry”s to scrub and when

you scrub a directory, push its contents to the top of the stack first

But we want to limit it so we don’t explode memory usage

67

SLIDE 68

ScrubStack

Examine the dentry
If it’s a file: scrub it directly and record scrub stamp/version
If it’s a directory:
on first access, generate list of all dentries, segregated by directory/file status
this lets us kick out the CDentry, CInode, CDir objects we had to read in to get the list
on first access, note the current version and time
push the next unscrubbed directory fragment on to the stack and restart
If there are no frags left, push files onto the stack and restart
When all files are scrubbed, scrub directory frags and record directory scrubbed

as of the start stamp/version values

If we hit a tree owned by another MDS, spin off a request to scrub that

tree and move on to another area of our own hierarchy until it’s completed

68

SLIDE 69

How’s it scale?

Well, each MDS scrubs the data for which it’s authoritative. There’s only interaction at the boundaries, and there’s a lot of machinery around that which makes it fairly easy.

69

SLIDE 70

Backwards scrub/repair

Being designed

70

SLIDE 71

Goal: repair the tree

Once we know that there’s been some failure (either due to catastrophic data loss or a serious bug), examine the raw RADOS state and get all the data back into the filesystem hierarchy. Make all referents consistent with each other.

71

SLIDE 72

File objects

Given an object name and location, we know if it’s a file object

and which inode it belongs to

The first file object for an inode has a backtrace of its location

72

SLIDE 73

Directory Objects

Given an object name and location, we know if it’s a directory
bject, and for which directory inode
The directory has a backtrace on it
The directory has (versioned) forward links to all children

73

SLIDE 74

MDS Logs

The MDS logs can contain much newer versions of inodes than any

f the backing objects
eg, a file got renamed into a directory on a different MDS and it

hasn’t been flushed yet

74

SLIDE 75

Constraints

Limited memory: we can’t hold all the metadata in-memory at
nce
Scalable: we have an MDS cluster and this needs to work within

that framework

75

SLIDE 76

Algorithm Design

Well, we haven’t finished this yet…but we have lots of ideas!

76

SLIDE 77

Building blocks

Forward scrub: lets us identify things which we know are in the

filesystem (potentially: tag them while doing so), and find things we know are missing but shouldn’t be

Backtraces: give us (possibly stale) snapshots of where in the tree

each directory or file lives

importantly, these are versioned! So we can use the backtrace from one file

to update the older backtrace from another

RADOS object listing: we can list every object in the filesystem
Even better: we can inject “filters” to restrict the listing, to the first file
bject, or to objects which we have not tagged in a previous pass
MDS logs: each log segment contains a lot of info about the objects

it changes

RADOS operations:
snapshots could be useful
you can store objects “next to” other objects — create a scratch space!

77

SLIDE 78

Check what we know we have

Flush the MDS logs out to disk
We can probably identify if logs are whole or not — they’re numbered

consecutively, and we can check objects on the boundary against any lost PGs

Missing transactions are likely to be reconstructable from context: if the

path for an inode changes unexpectedly, we can adjust to that!

Cross-MDS transactions are either renames across authoritative zones (can

be resolved by looking at the journals together) or about transfers of authority (we can make up new boundaries)

Have each MDS scrub its auth data, and tag all reached objects
If we expect to find objects and don’t, or they are broken, add them to a

fix list

78

SLIDE 79

Look at the raw objects

Run a (filtered?) listing of all objects in the metadata and data

pools

We can filter out stuff that was tagged if we want; and depending on

thoroughness we can skip objects without a backtrace

This can scale in several ways (eg “map” along PG lines, “reduce” based
n guessed authority from backtraces)
For each found object we don’t have in the hierarchy, attempt

to place them in the tree based on backtraces

is it so new that the dentry never got flushed out (or lost in busted

journal)?

does it belong to a missing directory object?
Create “phantom” directories based on backtrace contents, if needed
If prior guessed data conflicts with new data, take the one with

the newer version

79

SLIDE 80

Try again!

After doing both a hierarchy scan and a deep scan, and fixing

things up, we should be able to touch everything in the system. Run it again to make sure.

80

SLIDE 81

How do we store repaired data?

We have a few options here:

use RADOS snapshots to not change the original data at all, and

write updates in place

use the “locator” functionality to explicitly create our own in-

flight repair objects next to the original data until we can check it and flush back to original data

Or something more complicated? (object class code, explicitly

copying all dentries within a directory object, etc)

81

SLIDE 82

How would repair scale?

Scrubbing scales across MDS authority zones as before
Note that if necessary, we can spin up new MDSes with new authority

zones to scale out the checking

Each MDS can try to repair data for which it is authoritative, and

pass along objects it finds to belong elsewhere

and request stuff it thinks it should own from its peers, too
Obviously a mapping from the raw RADOS listing to these

authoritative zones is required:

partition the PG space up evenly between MDSes, let each one handle the

listing

chop up the raw data into expected authority zones based on backtraces

(probably log them to an area in RADOS)

ship off lists to proper authoritative MDS

82

SLIDE 83

Questions?

#ceph-devel on irc.oftc.net
I’m gregsfortytwo
ceph-devel@vger.kernel.org
Greg Farnum
gfarnum@redhat.com
@gregsfortytwo

83