1
1 CephFS fsck: Distributed Filesystem Checking Hi, Im Greg Greg - - PowerPoint PPT Presentation
1 CephFS fsck: Distributed Filesystem Checking Hi, Im Greg Greg - - PowerPoint PPT Presentation
1 CephFS fsck: Distributed Filesystem Checking Hi, Im Greg Greg Farnum CephFS Tech Lead, Red Hat gfarnum@redhat.com Been working as a core Ceph developer since June 2009 3 4 What is Ceph? An awesome, software-based, scalable,
CephFS fsck: Distributed Filesystem Checking
Hi, I’m Greg
Greg Farnum CephFS Tech Lead, Red Hat gfarnum@redhat.com Been working as a core Ceph developer since June 2009
3
4
5
What is Ceph?
An awesome, software-based, scalable, distributed storage system that is designed for failures
- Object storage (our native API)
- Block devices (Linux kernel, QEMU/KVM, others)
- RESTful S3 & Swift API object store
- POSIX Filesystem
6
RADOS
A reliable, autonomous, distributed object store comprised of self-healing, self-managing, intelligent storage nodes
LIBRADOS
A library allowing apps to directly access RADOS, with support for C, C++, Java, Python, Ruby, and PHP
RBD
A reliable and fully- distributed block device, with a Linux kernel client and a QEMU/KVM driver
CEPH FS
A POSIX-compliant distributed file system, with a Linux kernel client and support for FUSE
RADOSGW
A bucket-based REST gateway, compatible with S3 and Swift
APP APP HOST/VM CLIENT
Reliable Autonomic Distributed Object Store
7
What is CephFS?
An awesome, software-based, scalable, distributed POSIX- compliant file system that is designed for failures
RADOS
A user perspective
8
Objects in RADOS
9
Data 01110010101 01010101010 00010101101 xattrs version: 1
- map
foo -> bar baz -> qux
The librados API
C, C++, Python, Java, shell. File-like API:
- read/write (extent), truncate, remove; get/set/remove xattr or key
- efficient copy-on-write clone
- Snapshots — single object or pool-wide
- atomic compound operations/transactions
- read + getxattr, write + setxattr
- compare xattr value, if match write + setxattr
- “object classes”
- load new code into cluster to implement new methods
- calc sha1, grep/filter, generate thumbnail
- encrypt, increment, rotate image
- Implement your own access mechanisms — HDF5 on the node
- watch/notify: use object as communication channel between clients (locking
primitive)
- pgls: list the objects within a placement group
10
The RADOS Cluster
11
Object Storage Devices (OSDs) M M M
CLIENT
Object 1 Object 2 Object 3 Object 4 Object 1 Object 2 Object 3 Object 4 Pool 1 Pool 2
12
Object Storage Devices (OSDs) Monitor
M
13
10 10 01 01 10 10 01 11 01 10
hash(object name) % num pg CRUSH(pg, cluster state, rule set)
RADOS data guarantees
- Any write acked as safe will be visible to all subsequent readers
- Any write ever visible to a reader will be visible to all
subsequent readers
- Any write acked as safe will not be lost unless the whole
containing PG is lost
- A PG will not be lost unless all N copies are lost (N is admin-
configured, usually 3)…
- …and in case of OSD failure the system will try to bring you
back up to N copies (no user intervention required)
14
RADOS data guarantees
- Data is regularly scrubbed to ensure copies are consistent with
each other, and administrators are alerted if inconsistencies arise
- …and while it’s not automated, it’s usually easy to identify
the correct data with “majority voting” or similar.
- btrfs maintains checksums for certainty and we think this is
the future
15
CephFS
System Design
16
CephFS Design Goals
Infinitely scalable Avoid all Single Points Of Failure Self Managing
17
18
M M M
CLIENT
01 10
Metadata Server (MDS)
Scaling Metadata
So we have to use multiple MetaData Servers (MDSes) Two Issues:
- Storage of the metadata
- Ownership of the metadata
19
Scaling Metadata – Storage
Some systems store metadata on the MDS system itself But that’s a Single Point Of Failure!
- Hot standby?
- External metadata storage √
20
Scaling Metadata – Ownership
Traditionally: assign hierarchies manually to each MDS
- But if workloads change, your nodes can unbalance
Newer: hash directories onto MDSes
- But then clients have to jump around for every folder traversal
21
22
- ne tree
two metadata servers
23
- ne tree
two metadata servers
The Ceph Metadata Server
Key insight: If metadata is stored in RADOS, ownership should be impermanent One MDS is authoritative over any given subtree, but...
- That MDS doesn’t need to keep the whole tree in-memory
- There’s no reason the authoritative MDS can’t be changed!
24
The Ceph MDS – Partitioning
Cooperative Partitioning between servers:
- Keep track of how hot metadata is
- Migrate subtrees to keep heat distribution similar
- Cheap because all metadata is in RADOS
- Maintains locality
25
The Ceph MDS – Persistence
All metadata is written to RADOS
- And changes are only visible once in RADOS
26
The Ceph MDS – Clustering Benefits
Dynamic adjustment to metadata workloads
- Replicate hot data to distribute workload
Dynamic cluster sizing:
- Add nodes as you wish
- Decommission old nodes at any time
Recover quickly and easily from failures
27
28
29
30
31
32
DYNAMIC SUBTREE PARTITIONING
33
Does it work?
Click to edit Master text styles
It scales!
34
Click to edit Master text styles
It redistributes!
35
Cool Extras
Besides POSIX-compliance and scaling
36
Snapshots
$ mkdir foo/.snap/one # create snapshot $ ls foo/.snap
- ne
$ ls foo/bar/.snap _one_1099511627776 # parent's snap name is mangled $ rm foo/myfile $ ls -F foo bar/ $ ls foo/.snap/one myfile bar/ $ rmdir foo/.snap/one # remove snapshot
37
Recursive statistics
$ ls -alSh | head total 0 drwxr-xr-x 1 root root 9.7T 2011-02-04 15:51 . drwxr-xr-x 1 root root 9.7T 2010-12-16 15:06 .. drwxr-xr-x 1 pomceph pg4194980 9.6T 2011-02-24 08:25 pomceph drwx--x--- 1 fuzyceph adm 1.5G 2011-01-18 10:46 fuzyceph drwxr-xr-x 1 dallasceph pg275 596M 2011-01-14 10:06 dallasceph $ getfattr -d -m ceph. pomceph # file: pomceph ceph.dir.entries="39" ceph.dir.files="37" ceph.dir.rbytes="10550153946827" ceph.dir.rctime="1298565125.590930000" ceph.dir.rentries="2454401" ceph.dir.rfiles="1585288" ceph.dir.rsubdirs="869113" ceph.dir.subdirs="2"
38
Different Storage strategies
- Set a “virtual xattr” on a directory and all new files underneath
it follow that layout.
- Layouts can specify lots of detail about storage:
- pool file data goes into
- how large file objects and stripes are
- how many objects are in a stripe set
- So in one cluster you can use
- one slow pool with big objects for Hadoop workloads
- one fast pool with little objects for a scratch space
- one slow pool with small objects for home directories
- or whatever else makes sense...
39
CephFS
Important Data structures
40
Directory objects
- One (or more!) per directory
- Deterministically named: <inode number>.<directory piece>
- Embeds dentries and inodes for each child of the folder
- Contains a potentially-stale versioned backtrace (path location)
- Located in the metadata pool
41
File objects
- One or more per file
- Deterministically named <ino number>.<object number>
- First object contains a potentially-stale versioned backtrace
- Located in any of the data pools
42
MDS log (objects)
The MDS fully journals all metadata operations. The log is chunked across objects.
- Deterministically named <log inode number>.<log piece>
- Log objects may or may not be replayable if previous entries are
lost
- each entry contains what it needs, but eg a file move can depend on a
previous rename entry
- Located in the metadata pool
43
MDSTable objects
- Single objects
- SessionMap (per-MDS)
- stores the state of each client Session
- particularly: preallocated inodes for each client
- InoTable (per-MDS)
- Tracks which inodes are available to allocate
- (this is not a traditional inode mapping table or similar)
- SnapTable (shared)
- Tracks system snapshot IDs and their state (in use pending
create/delete)
- All located in the metadata pool
44
CephFS
Metadata update flow
45
Client Sends Request
46
CLIENT
Object Storage Devices (OSDs) MDS Create dir log.1 log.3 log.2 dir.1 dir.3 dir.2
MDS Processes Request
“Early Reply” and journaling
47
CLIENT
Object Storage Devices (OSDs) MDS Early Reply Journal Write log.1 log.3 log.2 dir.1 dir.3 dir.2
MDS Processes Request
Journaling and safe reply
48
CLIENT
Object Storage Devices (OSDs) MDS Safe Reply Journal ack log.1 log.3 log.2 log.4 dir.1 dir.3 dir.2 dir.3
…time passes…
49
Object Storage Devices (OSDs) MDS log.4 log.5 log.7 log.6 log.8 dir.1 dir.3 dir.2
MDS Flushes Log
50
Object Storage Devices (OSDs) MDS log.4 log.5 log.7 log.6 log.8 dir.1 dir.3 dir.2 Directory Write
MDS Flushes Log
51
Object Storage Devices (OSDs) MDS log.4 log.5 log.7 log.6 log.8 dir.1 dir.3 dir.2 dir.4 Write ack
MDS Flushes Log
52
Object Storage Devices (OSDs) MDS log.5 log.7 log.6 log.8 dir.1 dir.3 dir.2 dir.4 Log Delete
Traditional fsck
53
e2fsck
Drawn from “ffsck: The Fast File System Checker”, FAST ’13 and http://mm.iit.uni-miskolc.hu/Data/texts/Linux/SAG/ node92.html
54
Key data structures & checks
- Superblock
- check free block count, free inode count
- Data bitmap: blocks marked free are not in use
- Inode bitmap: inodes marked free are not in use
- Directories
- inodes are allocated, reasonable, and in-tree
- inodes
- consistent internal state
- link counts
- blocks claimed are valid and unique
55
Procedure
- Pass 1: iterate over all inodes
- check self-consistency
- builds up maps of in-use blocks/inodes/etc
- correct any issues with doubly-allocated blocks
- Pass 2: iterate over all directories
- check dentry validity and that all referenced inodes are valid
- cache a tree structure
- Pass 3: Check directory connectivity in-memory
- Pass 4: Check inode reference counts in-memory
- Pass 5: Check cached maps against on-disk maps and overwrite if
needed
56
CephFS fsck
What it needs to do
57
RADOS is different than a disk
- We look at objects, not disk blocks
- And we can’t lose them (at least not in a way we can recover)
- We can deterministically identify all pieces of file data
- and the inode they belong to!
- It is not feasible to keep all metadata in-memory at once
- Data loss is the result of:
- bugs in the system,
- simultaneous catastrophic failure of RADOS (probably losing lots of random
data),
- or bitrot
58
Failure detection: “forward scrub”
- Intended to find tree inconsistencies from bugs or bitrot
- Runs continuously in the background
- Traverse the tree and make sure referents agree in both
directions
59
Catastrophic failure repair: backwards repair
- Repair the tree after catastrophic failure or scrub detects an
issue
- Run only with administrator intervention
- Could affordably be offline-only
60
Forward Scrub
Being implemented
61
Goal: Check the tree
In the background, examine the entire filesystem hierarchy and make sure it’s self-consistent.
62
File objects
- Every inode we find has the correct data objects
- The inode location and data object backtrace is self-consistent
- slightly tricky, since they can be stale
- (Optionally) The file size is consistent with what objects exist
63
Directory Objects
- The directory objects are self-consistent with their summaries
and actual contents (the rstats)
- The directory’s backtrace and actual parent directory are self-
consistent
- This covers any bugs causing double-links
64
Constraints
- Limited memory: we can’t hold all the metadata in-memory at
- nce
- Scalable: we have an MDS cluster and this needs to work within
that framework
65
On-disk data structure changes
- inode_t gets a scrub_stamp (time) and scrub_version to keep
track of the last time it was scrubbed
- frag_t gets a scrub_stamp and scrub_version for both “local” and
“recursive” scrubs
- Directories can be “fragmented” into multiple pieces; frag_t holds the
metadata about a given directory fragment
- These values can be reported to the user via our rstats “virtual
xattr" interface
66
Algorithm Design
- Obviously we’re doing a depth-first search:
- This means in the worst case we restrict memory usage to O(tree depth)
- Validating files first means that when we validate directory rstats, they’re
not about to get changed
- Basic strategy: construct stack of “CDentry”s to scrub and when
you scrub a directory, push its contents to the top of the stack first
- But we want to limit it so we don’t explode memory usage
67
ScrubStack
- Examine the dentry
- If it’s a file: scrub it directly and record scrub stamp/version
- If it’s a directory:
- on first access, generate list of all dentries, segregated by directory/file status
- this lets us kick out the CDentry, CInode, CDir objects we had to read in to get the list
- on first access, note the current version and time
- push the next unscrubbed directory fragment on to the stack and restart
- If there are no frags left, push files onto the stack and restart
- When all files are scrubbed, scrub directory frags and record directory scrubbed
as of the start stamp/version values
- If we hit a tree owned by another MDS, spin off a request to scrub that
tree and move on to another area of our own hierarchy until it’s completed
68
How’s it scale?
Well, each MDS scrubs the data for which it’s authoritative. There’s only interaction at the boundaries, and there’s a lot of machinery around that which makes it fairly easy.
69
Backwards scrub/repair
Being designed
70
Goal: repair the tree
Once we know that there’s been some failure (either due to catastrophic data loss or a serious bug), examine the raw RADOS state and get all the data back into the filesystem hierarchy. Make all referents consistent with each other.
71
File objects
- Given an object name and location, we know if it’s a file object
and which inode it belongs to
- The first file object for an inode has a backtrace of its location
72
Directory Objects
- Given an object name and location, we know if it’s a directory
- bject, and for which directory inode
- The directory has a backtrace on it
- The directory has (versioned) forward links to all children
73
MDS Logs
The MDS logs can contain much newer versions of inodes than any
- f the backing objects
- eg, a file got renamed into a directory on a different MDS and it
hasn’t been flushed yet
74
Constraints
- Limited memory: we can’t hold all the metadata in-memory at
- nce
- Scalable: we have an MDS cluster and this needs to work within
that framework
75
Algorithm Design
Well, we haven’t finished this yet…but we have lots of ideas!
76
Building blocks
- Forward scrub: lets us identify things which we know are in the
filesystem (potentially: tag them while doing so), and find things we know are missing but shouldn’t be
- Backtraces: give us (possibly stale) snapshots of where in the tree
each directory or file lives
- importantly, these are versioned! So we can use the backtrace from one file
to update the older backtrace from another
- RADOS object listing: we can list every object in the filesystem
- Even better: we can inject “filters” to restrict the listing, to the first file
- bject, or to objects which we have not tagged in a previous pass
- MDS logs: each log segment contains a lot of info about the objects
it changes
- RADOS operations:
- snapshots could be useful
- you can store objects “next to” other objects — create a scratch space!
77
Check what we know we have
- Flush the MDS logs out to disk
- We can probably identify if logs are whole or not — they’re numbered
consecutively, and we can check objects on the boundary against any lost PGs
- Missing transactions are likely to be reconstructable from context: if the
path for an inode changes unexpectedly, we can adjust to that!
- Cross-MDS transactions are either renames across authoritative zones (can
be resolved by looking at the journals together) or about transfers of authority (we can make up new boundaries)
- Have each MDS scrub its auth data, and tag all reached objects
- If we expect to find objects and don’t, or they are broken, add them to a
fix list
78
Look at the raw objects
- Run a (filtered?) listing of all objects in the metadata and data
pools
- We can filter out stuff that was tagged if we want; and depending on
thoroughness we can skip objects without a backtrace
- This can scale in several ways (eg “map” along PG lines, “reduce” based
- n guessed authority from backtraces)
- For each found object we don’t have in the hierarchy, attempt
to place them in the tree based on backtraces
- is it so new that the dentry never got flushed out (or lost in busted
journal)?
- does it belong to a missing directory object?
- Create “phantom” directories based on backtrace contents, if needed
- If prior guessed data conflicts with new data, take the one with
the newer version
79
Try again!
- After doing both a hierarchy scan and a deep scan, and fixing
things up, we should be able to touch everything in the system. Run it again to make sure.
80
How do we store repaired data?
We have a few options here:
- use RADOS snapshots to not change the original data at all, and
write updates in place
- use the “locator” functionality to explicitly create our own in-
flight repair objects next to the original data until we can check it and flush back to original data
- Or something more complicated? (object class code, explicitly
copying all dentries within a directory object, etc)
81
How would repair scale?
- Scrubbing scales across MDS authority zones as before
- Note that if necessary, we can spin up new MDSes with new authority
zones to scale out the checking
- Each MDS can try to repair data for which it is authoritative, and
pass along objects it finds to belong elsewhere
- and request stuff it thinks it should own from its peers, too
- Obviously a mapping from the raw RADOS listing to these
authoritative zones is required:
- partition the PG space up evenly between MDSes, let each one handle the
listing
- chop up the raw data into expected authority zones based on backtraces
(probably log them to an area in RADOS)
- ship off lists to proper authoritative MDS
82
Questions?
- #ceph-devel on irc.oftc.net
- I’m gregsfortytwo
- ceph-devel@vger.kernel.org
- Greg Farnum
- gfarnum@redhat.com
- @gregsfortytwo
83