State of the Art: Where we are with the ext3 filesystem Mingming - - PowerPoint PPT Presentation

state of the art where we are with the ext3 filesystem
SMART_READER_LITE
LIVE PREVIEW

State of the Art: Where we are with the ext3 filesystem Mingming - - PowerPoint PPT Presentation

Ottawa Linux Symposium 2005 State of the Art: Where we are with the ext3 filesystem Mingming Cao, Theodore Y. Ts'o, Badari Pulavarty, Suparna Bhattacharya IBM Linux Technology Center Andreas Dilger, Alex Tomas Cluster Filesystem Inc. 1


slide-1
SLIDE 1

1

Ottawa Linux Symposium 2005

State of the Art: Where we are with the ext3 filesystem

Mingming Cao, Theodore Y. Ts'o, Badari Pulavarty, Suparna Bhattacharya IBM Linux Technology Center Andreas Dilger, Alex Tomas Cluster Filesystem Inc.

slide-2
SLIDE 2

2

Introduction

  • Why ext3 filesystem has a large number of

community and many users?

  • Why improving current ext3 filesystem is

important?

  • The plan to “modernize” ext3 filesystem.
  • What happened to ext3 in the past three years?
slide-3
SLIDE 3

3

Activities Summary

Included features Linux-2.4 Linux-2.6.11

Directory Indexing patch/vendor yes Online Resizing patch yes Reduce Locking Contention no yes Block Reservation no yes Extended Attributes patch/vendor yes Under development

Linux-2.4 Linux-2.6

Extents patch patch Delayed Allocation no patch Multiple Block Allocation no patch Reduce File Removal Latency patch easy to port Increased subdir support patch patch Parallel directory operations patch patch Finer Timestamp no patch

slide-4
SLIDE 4

4

Overview

  • Part A: New ext3 features added to Linux 2.6
  • Some ext3 features under development:

– Part B: Features imply filesystem format change – Part C: Improvements with current ext3 layout

  • Future work
slide-5
SLIDE 5

5

Part A: Features Added to Linux 2.6

  • Directory indexing

(By Daniel Phillips, Theodore Y. Ts'o, 2.5)

  • Online resizing

(By Andreas Dilger, Stephen Tweedie, 2.6.10)

  • Removing BKL and improve scalability

(By Andrew Morton, Alex Tomas, 2.5)

  • Extended attributes

(By Andreas Gruenbacher, 2.5)

  • Reservation based block preallocation

(By Mingming Cao, Andrew Morton, Stephen Tweedie, Badari Pulavarty 2.6.10)

slide-6
SLIDE 6

6

Directory Indexing

  • Scalability issues with old ext2/3 directories:

simple linked list.

  • A simplified tree structure (Htree) was designed

for directories.

  • HTree features: 32-bit hashes for keys, high

fanout factor, constant depth

  • Boots ext3 performance on large directories.
slide-7
SLIDE 7

7

Online Resizing

  • Online resizing allows filesystem growing

without having to take down time. A very useful feature in server environments.

  • Handles three primary phases that a filesystem

can grow:

– Grow within the last block group – Need a new block group – Need a new block group descriptor

slide-8
SLIDE 8

8

Reduce Lock Contention

  • Scaling issue for 2.4 ext3/JBD with concurrent

IO

  • A series of effort were made:

– Ext3: replaced per-filesystem superblock lock

with finer-grained locks

– Journalling layer: pushing BKL out of JBD

  • SDET benchmark throughput improved by a

factor of 10

slide-9
SLIDE 9

9

Extended Attributes

  • Extended Attributes:Small amount of custom

metadata with files or directories

  • Added to ext2/3 to support ACL
  • EAs are stored in a single EA block, shareable by

inodes have same extended attributes

  • Can be stored in expanded inode itself(2.6.11+)
  • EA-in-inode noticeably speed up ext3

performance on Samba4 benchmark

slide-10
SLIDE 10

10

Reservation based block preallocation

  • Block preallocation helps

reduce file fragmentation

  • Ext3 uses in-memory block

reservation to support a large preallocation

  • Results in significant

throughput improvements on concurrent sequential writes and the subsequent sequential read Ext3 (before) Ext3 (After)

file 1 file 2 file 3 file 4

slide-11
SLIDE 11

11

Ext3 Block Reservation

  • Key difference: Reservation in memory, rather on

disk

  • Each inode has it's own reservation window,

windows cannot overlapped, indexed by a per- filesystem red-black tree

  • Allocation is within the window. Window could

dynamically move and grow

  • Discard window at the last file close
slide-12
SLIDE 12

12

Reservation Tree

(8, 31) (0, 7) (32, 63) (64, 71)

file 3 file 2 file 1 file 4

disk blocks Files

slide-13
SLIDE 13

13

4 threads 16threads 64threads 5 10 15 20 25 30 35 40

tiobench sequential write

ext3 2.4.29 ext3 2.6.11 JFS XFS

Throughput(MB/sec)

slide-14
SLIDE 14

14

Part B: Extents and Related Work

  • Extents
  • Extents Allocation
  • Delayed allocation for extents

(By Alex Tomas)

slide-15
SLIDE 15

15

1 ... ... 11 12 13 14 200 201 ... ... 211 212 1237 65530

213 ... 1236

1238 ... ... 1239 ... ... 65531 ... ... 65532 ... ... 65533 ... ...

... ... 200 201 ... ... 213 ... ... ... ... 1239 ... ... ... 65533 ... ...

direct block indirect block double indirect block triple indirect block

i_data

Ext2/3 Indirect Block Map

disk blocks

slide-16
SLIDE 16

16

Extents

  • Extent is an efficient way to represent large

contiguous file

  • An extent is a single descriptor for a range of

contiguous blocks

logical length physical 1000 200 32 bit 16 bit 48 bit

slide-17
SLIDE 17

17

Extents Data Structures

struct ext3_extent { __u32 ee_block; /* logical */ __u16 ee_len; /* length */ __u16 ee_start_hi; __u32 ee_start; /*physical* }; struct ext3_extent_header { __u16 eh_magic; __u16 eh_entries; __u16 eh_max; __u16 eh_depth; __u32 eh_generation; };

slide-18
SLIDE 18

18 header

1000 200

1001 2000 6000 ... ...

200 201 ... ... 1199 ... ... ... 6000 6001 ... ... 6199 ... ...

i_data

Extent Map

disk blocks

slide-19
SLIDE 19

19

Extents Tree

  • A simplified tree – like Htree used in directories
  • Constant depth
  • Tree nodes including index node and leaf node
  • Each node start from a header structure
  • A flag in inode indicating the block addressing

type: extent map or indirect block mapping

slide-20
SLIDE 20

20 i_data index node ... ... ... header ... ...

Extent Tree

leaf node disk blocks

extents extents index node header

root

slide-21
SLIDE 21

21

Extent Related Work

  • Multiple block allocation

An efficient way to allocating a chunk of contiguous blocks at a time

  • Delayed allocation

Enable multiple block allocation for buffered IO by deferring and clustering single block allocations

slide-22
SLIDE 22

22 write ()

grab a page prepare_write()

copy data from user

commit_write() exit

writepages ()

page cache disk blocks

lock page cluster pages

flush to disk

delayed allocation pdflush

ext3 block allocation

100 200 201 202 203

Buffered I/O Write Path

slide-23
SLIDE 23

23

Benefits of Delayed Allocation

  • May avoid the need for block allocation for

temporary files

  • Improves chances of allocating contiguous blocks
  • n disk for a file
  • Reduces CPU cycles spent in repeated single

block allocation calls, by clustering allocation for multiple blocks together.

slide-24
SLIDE 24

24 write ()

grab a page prepare_write()

copy data from user

commit_write() exit

writepages ()

page cache disk blocks

lock page cluster pages

flush to disk

pdflush

block allocation

100 200 201 202 203

Delayed Allocation

slide-25
SLIDE 25

25

Cost of Single Block Allocation

To add single block to a file ext3 has to:

  • Open a transaction
  • Load the inode's indirect blocks from disk to memory
  • Search filesystem block bitmap to find a free block
  • update indirect mapping with new block number
  • Add the modified blockmap blocks to the transaction
  • Add the modified inode to the transaction
slide-26
SLIDE 26

26

Multiple Block Allocation

  • Increase the possibility to get

contiguous blocks

  • Reduces CPU cycles spent in

repeated single block allocation call

  • Able to batch metadata

update and journaling once

page cache disk blocks

multiple block allocation

100 200 201 202 203

slide-27
SLIDE 27

27

Buddy Based Extent Allocation

  • Collect per block group free extent info and store it

in buddy data

  • Buddy data is an array of metadata, where each

entry describes the status of a cluster of 2n blocks

  • Combine buddy data and traditional block bitmap

to quickly search free extent length

slide-28
SLIDE 28

28 20 21 22 23 0 0 0 0 0 0 1 0 0 0 0 1 0 1 1

1 2 3 4 5 6 7

free extents buddy info disk block bitmap block bitmap free exent 0 free 22 4 free 21 6 allocated Total extent from block 0 22 + 21 =6

Buddy Multiple Block Allocation Example

slide-29
SLIDE 29

29

Evaluation of Extents Patches

  • Improvements for large file creation/removal/sequential

read/sequential rewrite

  • Various benchmarking was done: dbench, tiobench, FFSB

filemark, sqlbench ,iozone etc.

  • Initial analysis indicates that
  • Extents patch helps file sequential read/rewrite/removal
  • Multiple block allocation and delayed allocation help

sequential write(file creation)

slide-30
SLIDE 30

30

4 threads 16threads 64threads 5 10 15 20 25 30 35 40

Tiobench Sequential Write Comparison With Extents

ext3 2.6.11 ext3+extetns JFS XFS

Throughput(MB/sec)

slide-31
SLIDE 31

31

Large File Sequential I/O Comparison Using FFSB

Sequential Read Sequential write Sequential re-write 20 40 60 80 100 120 140 160 180

127 71 75.7 153.7 91.9 102.7 156.3 89.3 94.8 166.3 104.3 100 ext3 ext3+extents JFS XFS

Throughput(MB/sec)

slide-32
SLIDE 32

32

Part C: Improving Current Ext3

  • Delayed allocation without extents

(By Badari Pulavarty, Suparna Bhattacharya)

  • Allocating multiple blocks without extents

(By Mingming Cao)

  • Reduce file unlink and truncate latency

(By Andreas Dilger)

  • Increased number of subdirectories support

(By Andreas Dilger)

  • Parallel directory operations

(By Alex Tomas)

  • Finer timestamp

(By Alex Tomas, Andreas Gruenbacher)

slide-33
SLIDE 33

33

Delayed Allocation for Current Ext3

  • Concept: deferring block allocation from the

prepare write time to page flush out time

  • Reserve filesystem free blocks at prepare-write

time to avoid allocation failure later

  • Delayed allocation for different journalling

modes

– data=writeback mode – data=ordered mode

slide-34
SLIDE 34

34

Delayed Allocation for Ordering Mode

  • Delayed allocation and Bufferheads

– bufferhead is used to link a page to a disk block – bufferhead is also used to link the page with the

related journal for ordering purpose

– delayed allocation defers bufferhead creation for a

page to writepages time, late for ordering purpose

  • Proposed solution: add a ordered-like journalling

mode which doesn't use bufferheads

slide-35
SLIDE 35

35

Allocating Multiple Blocks for Current Ext3

  • A simple, efficient way to allow allocating

multiple blocks at a time

  • Based on existing indirect block mapping, make

use of block reservation

  • Allocating the first block in the existing way,

then allocating the rest on a best effort basis

slide-36
SLIDE 36

36

1 ... ... 11 12 13 14 200 201 ... ... 211 212

300 ... ... 200 201 ... ... 300 ... ... ... ... 329 330 ... ... 349 ... ...

i_data

ext3_inode reservation

file offset 13: goal block: 300, request 50 blocks, RSV window(300,307) RSV window extended (300,349) first block 300 allocated block 301-329 allocated (30 blocks total) Indirect block is being updated

Allocating Multiple Blocks Example

slide-37
SLIDE 37

37

4 threads 16threads 64threads 5 10 15 20 25 30 35 40

tiobench sequential write comparison with extents

ext3 2.6.11 ext3+dm ext3+extetns JFS XFS

Throughput(MB/sec)

slide-38
SLIDE 38

38

Large File Sequential Write Comparison Using FFSB

Sequential write 10 20 30 40 50 60 70 80 90 100 110

71 90 91.9 89.3 104.3 ext3 ext3+dm ext3+extents JFS XFS

Throughput(MB/sec)

slide-39
SLIDE 39

39

slide-40
SLIDE 40

40

slide-41
SLIDE 41

41

Other ext3 work

  • Reduce file unlink and truncate
  • Increasing ext3 subdirectory limit
  • Parallel directory operations
  • Finer granularity time stamps for ext3
slide-42
SLIDE 42

42

Reduce File Unlink/Truncate Latency

  • Truncating a large indirect-mapped files is slow

and synchronous. Root cause: there is limit to the size of a single journal transaction.

  • Proposed solutions:

– Background file unlink/trucate – Attempt to fit into one transaction if possible – Store i_disksize to expanded inode.

slide-43
SLIDE 43

43

Increasing Ext3 Subdirectory Limit

  • Each subdirectory has a hard link to its parent
  • Number of subdirectories under a single directory

is limited by type of inode's link count(16 bit)

  • Proposed solution to overcome this limit:

– Not counting the subdirectory limit after counter overflow,

storing link count of 1 instead. (every directory start with a link count of 2)

slide-44
SLIDE 44

44

Parallel Directory Operations

  • Concurrent file operations in a single directory is

serialized by per-directory semaphore

  • Allow concurrently create/unlink/rename files

within a single directory

– VFS:lock individual hash entries in a directory – Ext3:add per-directory-leaf-block lock

slide-45
SLIDE 45

45

Finer granularity time stamps for ext3

  • Regular ext3 on-disk inode doesn't have enough

space to store high-precision time stamp

  • Proposed solution: store nanoseonds time stamp
  • n expanded inode
  • Concern:

– additional dirtyings and writeout for

atime/mtime/ctime updates

– expanded inode may be filled up by extended

attributes

slide-46
SLIDE 46

46

Future work

  • Continue to improve current ext3 filesystem

– Improve features already accepted in mainline – Finish work in progress for current ext3 filesystem

  • Moving ext3 forward: extents, and other potential

future work

– Increase 8/16TB filesystem limit (64 bit block

number)

– Extensible inode table(avoid too much inodes pre-

allocation)

slide-47
SLIDE 47

47

Questions?

Contact the authors: Mingming Cao cmm@us.ibm.com Theodore Y. Ts'o tytso@us.ibm.com Badari Pulavarty pbadari@us.ibm.com Suparna Bhattacharya suparna@in.ibm.com Andreas Dilger adilger@clusterfs.com Alex Tomas alex@clusterfs.com sources of the paper and this presentation are at ext2.sourceforge.net

slide-48
SLIDE 48

48

Legal Statement

This work represents the view of the authors and does not necessarily represent the view of IBM. IBM and the IBM logo are trademarks or registered trademarks of International Business Machines Corporation in the United States and/or other countries. Lustre is a trademark of Cluster File Systems, Inc. Unix is a registered trademark of The Open Group in the United States and other countries. Linux is a registered trademark of Linus Torvalds in the United States, other countries, or both. Other company, product, and service names may be trademarks or service marks of others References in this publication to IBM products or services do not imply that IBM intends to make them available in all countries in which IBM operates. This document is provied ``AS IS,'' with no express or implied warranties. Use the information in this document at your own risk.