1 CephFS fsck: Distributed Filesystem Checking Hi, Im Greg Greg - - PowerPoint PPT Presentation

1 cephfs fsck distributed filesystem checking hi i m greg
SMART_READER_LITE
LIVE PREVIEW

1 CephFS fsck: Distributed Filesystem Checking Hi, Im Greg Greg - - PowerPoint PPT Presentation

1 CephFS fsck: Distributed Filesystem Checking Hi, Im Greg Greg Farnum CephFS Tech Lead, Red Hat gfarnum@redhat.com Been working as a core Ceph developer since June 2009 3 4 What is Ceph? An awesome, software-based, scalable,


slide-1
SLIDE 1

1

slide-2
SLIDE 2

CephFS fsck: Distributed Filesystem Checking

slide-3
SLIDE 3

Hi, I’m Greg

Greg Farnum CephFS Tech Lead, Red Hat gfarnum@redhat.com Been working as a core Ceph developer since June 2009

3

slide-4
SLIDE 4

4

slide-5
SLIDE 5

5

What is Ceph?

An awesome, software-based, scalable, distributed storage system that is designed for failures

  • Object storage (our native API)
  • Block devices (Linux kernel, QEMU/KVM, others)
  • RESTful S3 & Swift API object store
  • POSIX Filesystem
slide-6
SLIDE 6

6

RADOS

A reliable, autonomous, distributed object store comprised of self-healing, self-managing, intelligent storage nodes

LIBRADOS

A library allowing apps to directly access RADOS, with support for C, C++, Java, Python, Ruby, and PHP

RBD

A reliable and fully- distributed block device, with a Linux kernel client and a QEMU/KVM driver

CEPH FS

A POSIX-compliant distributed file system, with a Linux kernel client and support for FUSE

RADOSGW

A bucket-based REST gateway, compatible with S3 and Swift

APP APP HOST/VM CLIENT

Reliable Autonomic Distributed Object Store

slide-7
SLIDE 7

7

What is CephFS?

An awesome, software-based, scalable, distributed POSIX- compliant file system that is designed for failures

slide-8
SLIDE 8

RADOS

A user perspective

8

slide-9
SLIDE 9

Objects in RADOS

9

Data 01110010101 01010101010 00010101101 xattrs version: 1

  • map

foo -> bar baz -> qux

slide-10
SLIDE 10

The librados API

C, C++, Python, Java, shell. File-like API:

  • read/write (extent), truncate, remove; get/set/remove xattr or key
  • efficient copy-on-write clone
  • Snapshots — single object or pool-wide
  • atomic compound operations/transactions
  • read + getxattr, write + setxattr
  • compare xattr value, if match write + setxattr
  • “object classes”
  • load new code into cluster to implement new methods
  • calc sha1, grep/filter, generate thumbnail
  • encrypt, increment, rotate image
  • Implement your own access mechanisms — HDF5 on the node
  • watch/notify: use object as communication channel between clients (locking

primitive)

  • pgls: list the objects within a placement group

10

slide-11
SLIDE 11

The RADOS Cluster

11

Object Storage Devices (OSDs) M M M

CLIENT

slide-12
SLIDE 12

Object 1 Object 2 Object 3 Object 4 Object 1 Object 2 Object 3 Object 4 Pool 1 Pool 2

12

Object Storage Devices (OSDs) Monitor

M

slide-13
SLIDE 13

13

10 10 01 01 10 10 01 11 01 10

hash(object name) % num pg CRUSH(pg, cluster state, rule set)

slide-14
SLIDE 14

RADOS data guarantees

  • Any write acked as safe will be visible to all subsequent readers
  • Any write ever visible to a reader will be visible to all

subsequent readers

  • Any write acked as safe will not be lost unless the whole

containing PG is lost

  • A PG will not be lost unless all N copies are lost (N is admin-

configured, usually 3)…

  • …and in case of OSD failure the system will try to bring you

back up to N copies (no user intervention required)

14

slide-15
SLIDE 15

RADOS data guarantees

  • Data is regularly scrubbed to ensure copies are consistent with

each other, and administrators are alerted if inconsistencies arise

  • …and while it’s not automated, it’s usually easy to identify

the correct data with “majority voting” or similar.

  • btrfs maintains checksums for certainty and we think this is

the future

15

slide-16
SLIDE 16

CephFS

System Design

16

slide-17
SLIDE 17

CephFS Design Goals

Infinitely scalable Avoid all Single Points Of Failure Self Managing

17

slide-18
SLIDE 18

18

M M M

CLIENT

01 10

Metadata Server (MDS)

slide-19
SLIDE 19

Scaling Metadata

So we have to use multiple MetaData Servers (MDSes) Two Issues:

  • Storage of the metadata
  • Ownership of the metadata

19

slide-20
SLIDE 20

Scaling Metadata – Storage

Some systems store metadata on the MDS system itself But that’s a Single Point Of Failure!

  • Hot standby?
  • External metadata storage √

20

slide-21
SLIDE 21

Scaling Metadata – Ownership

Traditionally: assign hierarchies manually to each MDS

  • But if workloads change, your nodes can unbalance

Newer: hash directories onto MDSes

  • But then clients have to jump around for every folder traversal

21

slide-22
SLIDE 22

22

  • ne tree

two metadata servers

slide-23
SLIDE 23

23

  • ne tree

two metadata servers

slide-24
SLIDE 24

The Ceph Metadata Server

Key insight: If metadata is stored in RADOS, ownership should be impermanent One MDS is authoritative over any given subtree, but...

  • That MDS doesn’t need to keep the whole tree in-memory
  • There’s no reason the authoritative MDS can’t be changed!

24

slide-25
SLIDE 25

The Ceph MDS – Partitioning

Cooperative Partitioning between servers:

  • Keep track of how hot metadata is
  • Migrate subtrees to keep heat distribution similar
  • Cheap because all metadata is in RADOS
  • Maintains locality

25

slide-26
SLIDE 26

The Ceph MDS – Persistence

All metadata is written to RADOS

  • And changes are only visible once in RADOS

26

slide-27
SLIDE 27

The Ceph MDS – Clustering Benefits

Dynamic adjustment to metadata workloads

  • Replicate hot data to distribute workload

Dynamic cluster sizing:

  • Add nodes as you wish
  • Decommission old nodes at any time

Recover quickly and easily from failures

27

slide-28
SLIDE 28

28

slide-29
SLIDE 29

29

slide-30
SLIDE 30

30

slide-31
SLIDE 31

31

slide-32
SLIDE 32

32

DYNAMIC SUBTREE PARTITIONING

slide-33
SLIDE 33

33

Does it work?

slide-34
SLIDE 34

Click to edit Master text styles

It scales!

34

slide-35
SLIDE 35

Click to edit Master text styles

It redistributes!

35

slide-36
SLIDE 36

Cool Extras

Besides POSIX-compliance and scaling

36

slide-37
SLIDE 37

Snapshots

$ mkdir foo/.snap/one # create snapshot $ ls foo/.snap

  • ne

$ ls foo/bar/.snap _one_1099511627776 # parent's snap name is mangled $ rm foo/myfile $ ls -F foo bar/ $ ls foo/.snap/one myfile bar/ $ rmdir foo/.snap/one # remove snapshot

37

slide-38
SLIDE 38

Recursive statistics

$ ls -alSh | head total 0 drwxr-xr-x 1 root root 9.7T 2011-02-04 15:51 . drwxr-xr-x 1 root root 9.7T 2010-12-16 15:06 .. drwxr-xr-x 1 pomceph pg4194980 9.6T 2011-02-24 08:25 pomceph drwx--x--- 1 fuzyceph adm 1.5G 2011-01-18 10:46 fuzyceph drwxr-xr-x 1 dallasceph pg275 596M 2011-01-14 10:06 dallasceph $ getfattr -d -m ceph. pomceph # file: pomceph ceph.dir.entries="39" ceph.dir.files="37" ceph.dir.rbytes="10550153946827" ceph.dir.rctime="1298565125.590930000" ceph.dir.rentries="2454401" ceph.dir.rfiles="1585288" ceph.dir.rsubdirs="869113" ceph.dir.subdirs="2"

38

slide-39
SLIDE 39

Different Storage strategies

  • Set a “virtual xattr” on a directory and all new files underneath

it follow that layout.

  • Layouts can specify lots of detail about storage:
  • pool file data goes into
  • how large file objects and stripes are
  • how many objects are in a stripe set
  • So in one cluster you can use
  • one slow pool with big objects for Hadoop workloads
  • one fast pool with little objects for a scratch space
  • one slow pool with small objects for home directories
  • or whatever else makes sense...

39

slide-40
SLIDE 40

CephFS

Important Data structures

40

slide-41
SLIDE 41

Directory objects

  • One (or more!) per directory
  • Deterministically named: <inode number>.<directory piece>
  • Embeds dentries and inodes for each child of the folder
  • Contains a potentially-stale versioned backtrace (path location)
  • Located in the metadata pool

41

slide-42
SLIDE 42

File objects

  • One or more per file
  • Deterministically named <ino number>.<object number>
  • First object contains a potentially-stale versioned backtrace
  • Located in any of the data pools

42

slide-43
SLIDE 43

MDS log (objects)

The MDS fully journals all metadata operations. The log is chunked across objects.

  • Deterministically named <log inode number>.<log piece>
  • Log objects may or may not be replayable if previous entries are

lost

  • each entry contains what it needs, but eg a file move can depend on a

previous rename entry

  • Located in the metadata pool

43

slide-44
SLIDE 44

MDSTable objects

  • Single objects
  • SessionMap (per-MDS)
  • stores the state of each client Session
  • particularly: preallocated inodes for each client
  • InoTable (per-MDS)
  • Tracks which inodes are available to allocate
  • (this is not a traditional inode mapping table or similar)
  • SnapTable (shared)
  • Tracks system snapshot IDs and their state (in use pending

create/delete)

  • All located in the metadata pool

44

slide-45
SLIDE 45

CephFS

Metadata update flow

45

slide-46
SLIDE 46

Client Sends Request

46

CLIENT

Object Storage Devices (OSDs) MDS Create dir log.1 log.3 log.2 dir.1 dir.3 dir.2

slide-47
SLIDE 47

MDS Processes Request

“Early Reply” and journaling

47

CLIENT

Object Storage Devices (OSDs) MDS Early Reply Journal Write log.1 log.3 log.2 dir.1 dir.3 dir.2

slide-48
SLIDE 48

MDS Processes Request

Journaling and safe reply

48

CLIENT

Object Storage Devices (OSDs) MDS Safe Reply Journal ack log.1 log.3 log.2 log.4 dir.1 dir.3 dir.2 dir.3

slide-49
SLIDE 49

…time passes…

49

Object Storage Devices (OSDs) MDS log.4 log.5 log.7 log.6 log.8 dir.1 dir.3 dir.2

slide-50
SLIDE 50

MDS Flushes Log

50

Object Storage Devices (OSDs) MDS log.4 log.5 log.7 log.6 log.8 dir.1 dir.3 dir.2 Directory Write

slide-51
SLIDE 51

MDS Flushes Log

51

Object Storage Devices (OSDs) MDS log.4 log.5 log.7 log.6 log.8 dir.1 dir.3 dir.2 dir.4 Write ack

slide-52
SLIDE 52

MDS Flushes Log

52

Object Storage Devices (OSDs) MDS log.5 log.7 log.6 log.8 dir.1 dir.3 dir.2 dir.4 Log Delete

slide-53
SLIDE 53

Traditional fsck

53

slide-54
SLIDE 54

e2fsck

Drawn from “ffsck: The Fast File System Checker”, FAST ’13 and http://mm.iit.uni-miskolc.hu/Data/texts/Linux/SAG/ node92.html

54

slide-55
SLIDE 55

Key data structures & checks

  • Superblock
  • check free block count, free inode count
  • Data bitmap: blocks marked free are not in use
  • Inode bitmap: inodes marked free are not in use
  • Directories
  • inodes are allocated, reasonable, and in-tree
  • inodes
  • consistent internal state
  • link counts
  • blocks claimed are valid and unique

55

slide-56
SLIDE 56

Procedure

  • Pass 1: iterate over all inodes
  • check self-consistency
  • builds up maps of in-use blocks/inodes/etc
  • correct any issues with doubly-allocated blocks
  • Pass 2: iterate over all directories
  • check dentry validity and that all referenced inodes are valid
  • cache a tree structure
  • Pass 3: Check directory connectivity in-memory
  • Pass 4: Check inode reference counts in-memory
  • Pass 5: Check cached maps against on-disk maps and overwrite if

needed

56

slide-57
SLIDE 57

CephFS fsck

What it needs to do

57

slide-58
SLIDE 58

RADOS is different than a disk

  • We look at objects, not disk blocks
  • And we can’t lose them (at least not in a way we can recover)
  • We can deterministically identify all pieces of file data
  • and the inode they belong to!
  • It is not feasible to keep all metadata in-memory at once
  • Data loss is the result of:
  • bugs in the system,
  • simultaneous catastrophic failure of RADOS (probably losing lots of random

data),

  • or bitrot

58

slide-59
SLIDE 59

Failure detection: “forward scrub”

  • Intended to find tree inconsistencies from bugs or bitrot
  • Runs continuously in the background
  • Traverse the tree and make sure referents agree in both

directions

59

slide-60
SLIDE 60

Catastrophic failure repair: backwards repair

  • Repair the tree after catastrophic failure or scrub detects an

issue

  • Run only with administrator intervention
  • Could affordably be offline-only

60

slide-61
SLIDE 61

Forward Scrub

Being implemented

61

slide-62
SLIDE 62

Goal: Check the tree

In the background, examine the entire filesystem hierarchy and make sure it’s self-consistent.

62

slide-63
SLIDE 63

File objects

  • Every inode we find has the correct data objects
  • The inode location and data object backtrace is self-consistent
  • slightly tricky, since they can be stale
  • (Optionally) The file size is consistent with what objects exist

63

slide-64
SLIDE 64

Directory Objects

  • The directory objects are self-consistent with their summaries

and actual contents (the rstats)

  • The directory’s backtrace and actual parent directory are self-

consistent

  • This covers any bugs causing double-links

64

slide-65
SLIDE 65

Constraints

  • Limited memory: we can’t hold all the metadata in-memory at
  • nce
  • Scalable: we have an MDS cluster and this needs to work within

that framework

65

slide-66
SLIDE 66

On-disk data structure changes

  • inode_t gets a scrub_stamp (time) and scrub_version to keep

track of the last time it was scrubbed

  • frag_t gets a scrub_stamp and scrub_version for both “local” and

“recursive” scrubs

  • Directories can be “fragmented” into multiple pieces; frag_t holds the

metadata about a given directory fragment

  • These values can be reported to the user via our rstats “virtual

xattr" interface

66

slide-67
SLIDE 67

Algorithm Design

  • Obviously we’re doing a depth-first search:
  • This means in the worst case we restrict memory usage to O(tree depth)
  • Validating files first means that when we validate directory rstats, they’re

not about to get changed

  • Basic strategy: construct stack of “CDentry”s to scrub and when

you scrub a directory, push its contents to the top of the stack first

  • But we want to limit it so we don’t explode memory usage

67

slide-68
SLIDE 68

ScrubStack

  • Examine the dentry
  • If it’s a file: scrub it directly and record scrub stamp/version
  • If it’s a directory:
  • on first access, generate list of all dentries, segregated by directory/file status
  • this lets us kick out the CDentry, CInode, CDir objects we had to read in to get the list
  • on first access, note the current version and time
  • push the next unscrubbed directory fragment on to the stack and restart
  • If there are no frags left, push files onto the stack and restart
  • When all files are scrubbed, scrub directory frags and record directory scrubbed

as of the start stamp/version values

  • If we hit a tree owned by another MDS, spin off a request to scrub that

tree and move on to another area of our own hierarchy until it’s completed

68

slide-69
SLIDE 69

How’s it scale?

Well, each MDS scrubs the data for which it’s authoritative. There’s only interaction at the boundaries, and there’s a lot of machinery around that which makes it fairly easy.

69

slide-70
SLIDE 70

Backwards scrub/repair

Being designed

70

slide-71
SLIDE 71

Goal: repair the tree

Once we know that there’s been some failure (either due to catastrophic data loss or a serious bug), examine the raw RADOS state and get all the data back into the filesystem hierarchy. Make all referents consistent with each other.

71

slide-72
SLIDE 72

File objects

  • Given an object name and location, we know if it’s a file object

and which inode it belongs to

  • The first file object for an inode has a backtrace of its location

72

slide-73
SLIDE 73

Directory Objects

  • Given an object name and location, we know if it’s a directory
  • bject, and for which directory inode
  • The directory has a backtrace on it
  • The directory has (versioned) forward links to all children

73

slide-74
SLIDE 74

MDS Logs

The MDS logs can contain much newer versions of inodes than any

  • f the backing objects
  • eg, a file got renamed into a directory on a different MDS and it

hasn’t been flushed yet

74

slide-75
SLIDE 75

Constraints

  • Limited memory: we can’t hold all the metadata in-memory at
  • nce
  • Scalable: we have an MDS cluster and this needs to work within

that framework

75

slide-76
SLIDE 76

Algorithm Design

Well, we haven’t finished this yet…but we have lots of ideas!

76

slide-77
SLIDE 77

Building blocks

  • Forward scrub: lets us identify things which we know are in the

filesystem (potentially: tag them while doing so), and find things we know are missing but shouldn’t be

  • Backtraces: give us (possibly stale) snapshots of where in the tree

each directory or file lives

  • importantly, these are versioned! So we can use the backtrace from one file

to update the older backtrace from another

  • RADOS object listing: we can list every object in the filesystem
  • Even better: we can inject “filters” to restrict the listing, to the first file
  • bject, or to objects which we have not tagged in a previous pass
  • MDS logs: each log segment contains a lot of info about the objects

it changes

  • RADOS operations:
  • snapshots could be useful
  • you can store objects “next to” other objects — create a scratch space!

77

slide-78
SLIDE 78

Check what we know we have

  • Flush the MDS logs out to disk
  • We can probably identify if logs are whole or not — they’re numbered

consecutively, and we can check objects on the boundary against any lost PGs

  • Missing transactions are likely to be reconstructable from context: if the

path for an inode changes unexpectedly, we can adjust to that!

  • Cross-MDS transactions are either renames across authoritative zones (can

be resolved by looking at the journals together) or about transfers of authority (we can make up new boundaries)

  • Have each MDS scrub its auth data, and tag all reached objects
  • If we expect to find objects and don’t, or they are broken, add them to a

fix list

78

slide-79
SLIDE 79

Look at the raw objects

  • Run a (filtered?) listing of all objects in the metadata and data

pools

  • We can filter out stuff that was tagged if we want; and depending on

thoroughness we can skip objects without a backtrace

  • This can scale in several ways (eg “map” along PG lines, “reduce” based
  • n guessed authority from backtraces)
  • For each found object we don’t have in the hierarchy, attempt

to place them in the tree based on backtraces

  • is it so new that the dentry never got flushed out (or lost in busted

journal)?

  • does it belong to a missing directory object?
  • Create “phantom” directories based on backtrace contents, if needed
  • If prior guessed data conflicts with new data, take the one with

the newer version

79

slide-80
SLIDE 80

Try again!

  • After doing both a hierarchy scan and a deep scan, and fixing

things up, we should be able to touch everything in the system. Run it again to make sure.

80

slide-81
SLIDE 81

How do we store repaired data?

We have a few options here:

  • use RADOS snapshots to not change the original data at all, and

write updates in place

  • use the “locator” functionality to explicitly create our own in-

flight repair objects next to the original data until we can check it and flush back to original data

  • Or something more complicated? (object class code, explicitly

copying all dentries within a directory object, etc)

81

slide-82
SLIDE 82

How would repair scale?

  • Scrubbing scales across MDS authority zones as before
  • Note that if necessary, we can spin up new MDSes with new authority

zones to scale out the checking

  • Each MDS can try to repair data for which it is authoritative, and

pass along objects it finds to belong elsewhere

  • and request stuff it thinks it should own from its peers, too
  • Obviously a mapping from the raw RADOS listing to these

authoritative zones is required:

  • partition the PG space up evenly between MDSes, let each one handle the

listing

  • chop up the raw data into expected authority zones based on backtraces

(probably log them to an area in RADOS)

  • ship off lists to proper authoritative MDS

82

slide-83
SLIDE 83

Questions?

  • #ceph-devel on irc.oftc.net
  • I’m gregsfortytwo
  • ceph-devel@vger.kernel.org
  • Greg Farnum
  • gfarnum@redhat.com
  • @gregsfortytwo

83