[PPT] - State of the Art: Where we are with the ext3 filesystem Mingming PowerPoint Presentation

SLIDE 1

1

Ottawa Linux Symposium 2005

State of the Art: Where we are with the ext3 filesystem

Mingming Cao, Theodore Y. Ts'o, Badari Pulavarty, Suparna Bhattacharya IBM Linux Technology Center Andreas Dilger, Alex Tomas Cluster Filesystem Inc.

SLIDE 2

2

Introduction

Why ext3 filesystem has a large number of

community and many users?

Why improving current ext3 filesystem is

important?

The plan to “modernize” ext3 filesystem.
What happened to ext3 in the past three years?

SLIDE 3

3

Activities Summary

Included features Linux-2.4 Linux-2.6.11

Directory Indexing patch/vendor yes Online Resizing patch yes Reduce Locking Contention no yes Block Reservation no yes Extended Attributes patch/vendor yes Under development

Linux-2.4 Linux-2.6

Extents patch patch Delayed Allocation no patch Multiple Block Allocation no patch Reduce File Removal Latency patch easy to port Increased subdir support patch patch Parallel directory operations patch patch Finer Timestamp no patch

SLIDE 4

4

Overview

Part A: New ext3 features added to Linux 2.6
Some ext3 features under development:

– Part B: Features imply filesystem format change – Part C: Improvements with current ext3 layout

Future work

SLIDE 5

5

Part A: Features Added to Linux 2.6

Directory indexing

(By Daniel Phillips, Theodore Y. Ts'o, 2.5)

Online resizing

(By Andreas Dilger, Stephen Tweedie, 2.6.10)

Removing BKL and improve scalability

(By Andrew Morton, Alex Tomas, 2.5)

Extended attributes

(By Andreas Gruenbacher, 2.5)

Reservation based block preallocation

(By Mingming Cao, Andrew Morton, Stephen Tweedie, Badari Pulavarty 2.6.10)

SLIDE 6

6

Directory Indexing

Scalability issues with old ext2/3 directories:

simple linked list.

A simplified tree structure (Htree) was designed

for directories.

HTree features: 32-bit hashes for keys, high

fanout factor, constant depth

Boots ext3 performance on large directories.

SLIDE 7

7

Online Resizing

Online resizing allows filesystem growing

without having to take down time. A very useful feature in server environments.

Handles three primary phases that a filesystem

can grow:

– Grow within the last block group – Need a new block group – Need a new block group descriptor

SLIDE 8

8

Reduce Lock Contention

Scaling issue for 2.4 ext3/JBD with concurrent

IO

A series of effort were made:

– Ext3: replaced per-filesystem superblock lock

with finer-grained locks

– Journalling layer: pushing BKL out of JBD

SDET benchmark throughput improved by a

factor of 10

SLIDE 9

9

Extended Attributes

Extended Attributes:Small amount of custom

metadata with files or directories

Added to ext2/3 to support ACL
EAs are stored in a single EA block, shareable by

inodes have same extended attributes

Can be stored in expanded inode itself(2.6.11+)
EA-in-inode noticeably speed up ext3

performance on Samba4 benchmark

SLIDE 10

10

Reservation based block preallocation

Block preallocation helps

reduce file fragmentation

Ext3 uses in-memory block

reservation to support a large preallocation

Results in significant

throughput improvements on concurrent sequential writes and the subsequent sequential read Ext3 (before) Ext3 (After)

file 1 file 2 file 3 file 4

SLIDE 11

11

Ext3 Block Reservation

Key difference: Reservation in memory, rather on

disk

Each inode has it's own reservation window,

windows cannot overlapped, indexed by a per- filesystem red-black tree

Allocation is within the window. Window could

dynamically move and grow

Discard window at the last file close

SLIDE 12

12 Reservation Tree

(8, 31) (0, 7) (32, 63) (64, 71)

file 3 file 2 file 1 file 4

disk blocks Files

SLIDE 13

13

4 threads 16threads 64threads 5 10 15 20 25 30 35 40

tiobench sequential write

ext3 2.4.29 ext3 2.6.11 JFS XFS

Throughput(MB/sec)

SLIDE 14

14

Part B: Extents and Related Work

Extents
Extents Allocation
Delayed allocation for extents

(By Alex Tomas)

SLIDE 15

15

1 ... ... 11 12 13 14 200 201 ... ... 211 212 1237 65530

213 ... 1236

1238 ... ... 1239 ... ... 65531 ... ... 65532 ... ... 65533 ... ...

... ... 200 201 ... ... 213 ... ... ... ... 1239 ... ... ... 65533 ... ...

direct block indirect block double indirect block triple indirect block

i_data

Ext2/3 Indirect Block Map

disk blocks

SLIDE 16

16

Extents

Extent is an efficient way to represent large

contiguous file

An extent is a single descriptor for a range of

contiguous blocks

logical length physical 1000 200 32 bit 16 bit 48 bit

SLIDE 17

17

Extents Data Structures

struct ext3_extent { __u32 ee_block; /* logical / __u16 ee_len; / length / __u16 ee_start_hi; __u32 ee_start; /physical* }; struct ext3_extent_header { __u16 eh_magic; __u16 eh_entries; __u16 eh_max; __u16 eh_depth; __u32 eh_generation; };

SLIDE 18

18 header

1000 200

1001 2000 6000 ... ...

200 201 ... ... 1199 ... ... ... 6000 6001 ... ... 6199 ... ...

i_data

Extent Map

disk blocks

SLIDE 19

19

Extents Tree

A simplified tree – like Htree used in directories
Constant depth
Tree nodes including index node and leaf node
Each node start from a header structure
A flag in inode indicating the block addressing

type: extent map or indirect block mapping

SLIDE 20

20 i_data index node ... ... ... header ... ...

Extent Tree

leaf node disk blocks

extents extents index node header

root

SLIDE 21

21

Extent Related Work

Multiple block allocation

An efficient way to allocating a chunk of contiguous blocks at a time

Delayed allocation

Enable multiple block allocation for buffered IO by deferring and clustering single block allocations

SLIDE 22

22 write ()

grab a page prepare_write()

copy data from user

commit_write() exit

writepages ()

page cache disk blocks

lock page cluster pages

flush to disk

delayed allocation pdflush

ext3 block allocation

100 200 201 202 203

Buffered I/O Write Path

SLIDE 23

23

Benefits of Delayed Allocation

May avoid the need for block allocation for

temporary files

Improves chances of allocating contiguous blocks
n disk for a file
Reduces CPU cycles spent in repeated single

block allocation calls, by clustering allocation for multiple blocks together.

SLIDE 24

24 write ()

grab a page prepare_write()

copy data from user

commit_write() exit

writepages ()

page cache disk blocks

lock page cluster pages

flush to disk

pdflush

block allocation

100 200 201 202 203

Delayed Allocation

SLIDE 25

25

Cost of Single Block Allocation

To add single block to a file ext3 has to:

Open a transaction
Load the inode's indirect blocks from disk to memory
Search filesystem block bitmap to find a free block
update indirect mapping with new block number
Add the modified blockmap blocks to the transaction
Add the modified inode to the transaction

SLIDE 26

26

Multiple Block Allocation

Increase the possibility to get

contiguous blocks

Reduces CPU cycles spent in

repeated single block allocation call

Able to batch metadata

update and journaling once

page cache disk blocks

multiple block allocation

100 200 201 202 203

SLIDE 27

27

Buddy Based Extent Allocation

Collect per block group free extent info and store it

in buddy data

Buddy data is an array of metadata, where each

entry describes the status of a cluster of 2n blocks

Combine buddy data and traditional block bitmap

to quickly search free extent length

SLIDE 28

28 20 21 22 23 0 0 0 0 0 0 1 0 0 0 0 1 0 1 1

1 2 3 4 5 6 7

free extents buddy info disk block bitmap block bitmap free exent 0 free 22 4 free 21 6 allocated Total extent from block 0 22 + 21 =6

Buddy Multiple Block Allocation Example

SLIDE 29

29

Evaluation of Extents Patches

Improvements for large file creation/removal/sequential

read/sequential rewrite

Various benchmarking was done: dbench, tiobench, FFSB

filemark, sqlbench ,iozone etc.

Initial analysis indicates that
Extents patch helps file sequential read/rewrite/removal
Multiple block allocation and delayed allocation help

sequential write(file creation)

SLIDE 30

30

4 threads 16threads 64threads 5 10 15 20 25 30 35 40

Tiobench Sequential Write Comparison With Extents

ext3 2.6.11 ext3+extetns JFS XFS

Throughput(MB/sec)

SLIDE 31

31

Large File Sequential I/O Comparison Using FFSB

Sequential Read Sequential write Sequential re-write 20 40 60 80 100 120 140 160 180

127 71 75.7 153.7 91.9 102.7 156.3 89.3 94.8 166.3 104.3 100 ext3 ext3+extents JFS XFS

Throughput(MB/sec)

SLIDE 32

32

Part C: Improving Current Ext3

Delayed allocation without extents

(By Badari Pulavarty, Suparna Bhattacharya)

Allocating multiple blocks without extents

(By Mingming Cao)

Reduce file unlink and truncate latency

(By Andreas Dilger)

Increased number of subdirectories support

(By Andreas Dilger)

Parallel directory operations

(By Alex Tomas)

Finer timestamp

(By Alex Tomas, Andreas Gruenbacher)

SLIDE 33

33

Delayed Allocation for Current Ext3

Concept: deferring block allocation from the

prepare write time to page flush out time

Reserve filesystem free blocks at prepare-write

time to avoid allocation failure later

Delayed allocation for different journalling

modes

– data=writeback mode – data=ordered mode

SLIDE 34

34

Delayed Allocation for Ordering Mode

Delayed allocation and Bufferheads

– bufferhead is used to link a page to a disk block – bufferhead is also used to link the page with the

related journal for ordering purpose

– delayed allocation defers bufferhead creation for a

page to writepages time, late for ordering purpose

Proposed solution: add a ordered-like journalling

mode which doesn't use bufferheads

SLIDE 35

35

Allocating Multiple Blocks for Current Ext3

A simple, efficient way to allow allocating

multiple blocks at a time

Based on existing indirect block mapping, make

use of block reservation

Allocating the first block in the existing way,

then allocating the rest on a best effort basis

SLIDE 36

36

1 ... ... 11 12 13 14 200 201 ... ... 211 212

300 ... ... 200 201 ... ... 300 ... ... ... ... 329 330 ... ... 349 ... ...

i_data

ext3_inode reservation

file offset 13: goal block: 300, request 50 blocks, RSV window(300,307) RSV window extended (300,349) first block 300 allocated block 301-329 allocated (30 blocks total) Indirect block is being updated

Allocating Multiple Blocks Example

SLIDE 37

37

4 threads 16threads 64threads 5 10 15 20 25 30 35 40

tiobench sequential write comparison with extents

ext3 2.6.11 ext3+dm ext3+extetns JFS XFS

Throughput(MB/sec)

SLIDE 38

38

Large File Sequential Write Comparison Using FFSB

Sequential write 10 20 30 40 50 60 70 80 90 100 110

71 90 91.9 89.3 104.3 ext3 ext3+dm ext3+extents JFS XFS

Throughput(MB/sec)

SLIDE 39

39

SLIDE 40

40

SLIDE 41

41

Other ext3 work

Reduce file unlink and truncate
Increasing ext3 subdirectory limit
Parallel directory operations
Finer granularity time stamps for ext3

SLIDE 42

42

Reduce File Unlink/Truncate Latency

Truncating a large indirect-mapped files is slow

and synchronous. Root cause: there is limit to the size of a single journal transaction.

Proposed solutions:

– Background file unlink/trucate – Attempt to fit into one transaction if possible – Store i_disksize to expanded inode.

SLIDE 43

43

Increasing Ext3 Subdirectory Limit

Each subdirectory has a hard link to its parent
Number of subdirectories under a single directory

is limited by type of inode's link count(16 bit)

Proposed solution to overcome this limit:

– Not counting the subdirectory limit after counter overflow,

storing link count of 1 instead. (every directory start with a link count of 2)

SLIDE 44

44

Parallel Directory Operations

Concurrent file operations in a single directory is

serialized by per-directory semaphore

Allow concurrently create/unlink/rename files

within a single directory

– VFS:lock individual hash entries in a directory – Ext3:add per-directory-leaf-block lock

SLIDE 45

45

Finer granularity time stamps for ext3

Regular ext3 on-disk inode doesn't have enough

space to store high-precision time stamp

Proposed solution: store nanoseonds time stamp
n expanded inode
Concern:

– additional dirtyings and writeout for

atime/mtime/ctime updates

– expanded inode may be filled up by extended

attributes

SLIDE 46

46

Future work

Continue to improve current ext3 filesystem

– Improve features already accepted in mainline – Finish work in progress for current ext3 filesystem

Moving ext3 forward: extents, and other potential

future work

– Increase 8/16TB filesystem limit (64 bit block

number)

– Extensible inode table(avoid too much inodes pre-

allocation)

SLIDE 47

47

Questions?

Contact the authors: Mingming Cao cmm@us.ibm.com Theodore Y. Ts'o tytso@us.ibm.com Badari Pulavarty pbadari@us.ibm.com Suparna Bhattacharya suparna@in.ibm.com Andreas Dilger adilger@clusterfs.com Alex Tomas alex@clusterfs.com sources of the paper and this presentation are at ext2.sourceforge.net

SLIDE 48

48

Legal Statement

This work represents the view of the authors and does not necessarily represent the view of IBM. IBM and the IBM logo are trademarks or registered trademarks of International Business Machines Corporation in the United States and/or other countries. Lustre is a trademark of Cluster File Systems, Inc. Unix is a registered trademark of The Open Group in the United States and other countries. Linux is a registered trademark of Linus Torvalds in the United States, other countries, or both. Other company, product, and service names may be trademarks or service marks of others References in this publication to IBM products or services do not imply that IBM intends to make them available in all countries in which IBM operates. This document is provied ``AS IS,'' with no express or implied warranties. Use the information in this document at your own risk.

1

Ottawa Linux Symposium 2005

State of the Art: Where we are with the ext3 filesystem

2

Introduction

community and many users?

important?

3

Activities Summary

Included features Linux-2.4 Linux-2.6.11

Linux-2.4 Linux-2.6

4

Overview

5

Part A: Features Added to Linux 2.6

(By Daniel Phillips, Theodore Y. Ts'o, 2.5)

(By Andreas Dilger, Stephen Tweedie, 2.6.10)

(By Andrew Morton, Alex Tomas, 2.5)

(By Andreas Gruenbacher, 2.5)

(By Mingming Cao, Andrew Morton, Stephen Tweedie, Badari Pulavarty 2.6.10)

6

Directory Indexing

simple linked list.

for directories.

fanout factor, constant depth

7

Online Resizing

without having to take down time. A very useful feature in server environments.

can grow:

8

Reduce Lock Contention

IO

– Ext3: replaced per-filesystem superblock lock

with finer-grained locks

– Journalling layer: pushing BKL out of JBD

factor of 10

9

Extended Attributes

metadata with files or directories

inodes have same extended attributes

performance on Samba4 benchmark

10

Reservation based block preallocation

reduce file fragmentation

reservation to support a large preallocation

throughput improvements on concurrent sequential writes and the subsequent sequential read Ext3 (before) Ext3 (After)

11

Ext3 Block Reservation

disk

windows cannot overlapped, indexed by a per- filesystem red-black tree

dynamically move and grow

12

Reservation Tree

(8, 31) (0, 7) (32, 63) (64, 71)

file 3 file 2 file 1 file 4

disk blocks Files

13

tiobench sequential write

14

Part B: Extents and Related Work

(By Alex Tomas)

15

i_data

Ext2/3 Indirect Block Map

disk blocks

16

Extents

contiguous file

contiguous blocks

logical length physical 1000 200 32 bit 16 bit 48 bit

17

Extents Data Structures

struct ext3_extent { __u32 ee_block; /* logical */ __u16 ee_len; /* length */ __u16 ee_start_hi; __u32 ee_start; /*physical* }; struct ext3_extent_header { __u16 eh_magic; __u16 eh_entries; __u16 eh_max; __u16 eh_depth; __u32 eh_generation; };

18 header

1000 200

1001 2000 6000 ... ...

i_data

Extent Map

disk blocks

19

struct ext3_extent { __u32 ee_block; /* logical / __u16 ee_len; / length / __u16 ee_start_hi; __u32 ee_start; /physical* }; struct ext3_extent_header { __u16 eh_magic; __u16 eh_entries; __u16 eh_max; __u16 eh_depth; __u32 eh_generation; };