The Design and Implementation of the W arp T ransactional F ilesystem - - PowerPoint PPT Presentation

the design and implementation of the w arp t ransactional
SMART_READER_LITE
LIVE PREVIEW

The Design and Implementation of the W arp T ransactional F ilesystem - - PowerPoint PPT Presentation

The Design and Implementation of the W arp T ransactional F ilesystem Robert Escriva, Emin Gn Sirer Cornell University Symposium on Networked Systems Design and Implementation March 18, 2016 The Design and Implementation of WTF 1 / 28


slide-1
SLIDE 1

The Design and Implementation of the Warp Transactional Filesystem

Robert Escriva, Emin Gün Sirer

Cornell University

Symposium on Networked Systems Design and Implementation March 18, 2016

The Design and Implementation of WTF 1 / 28

slide-2
SLIDE 2

Common Trends in Distributed Filesystems

Compromises or limitations are often introduced in search of higher performance: ✪ Weak guarantees:

Eventual consistency “Consistent, but undefined”

✪ Narrow interfaces:

Writes must be sequential Concurrent writes prohibited

✪ Unscalable design:

Full-bisection bandwidth Large “master” server

The Design and Implementation of WTF Motivation 2 / 28

slide-3
SLIDE 3

Warp Transactional Filesystem (WTF)

WTF represents a new design point in the space of distributed filesystems WTF employs the file slicing abstraction to provide applications with strong guarantees and zero-copy filesystem interfaces ✦ Strong guarantees: transactionally access and modify the filesystem ✦ Expanded interface: traditional POSIX APIs and new zero-copy APIs ✦ Scalable Design: avoids centralized master or expensive network bottlenecks

The Design and Implementation of WTF Design 3 / 28

slide-4
SLIDE 4

Zero-Copy File Slicing APIs

Traditional APIs transfer bytes back and forth through the filesystem interface File-slicing APIs deal in references to data already in the filesystem yank Obtain references to data in the filesystem Analogous to read paste Write referenced data back to the filesystem Analogous to write append Append referenced data to the end of a file Optimized for concurrency concat Merge one or more files to create a new file Does not read or write data from the input files

The Design and Implementation of WTF Design 4 / 28

slide-5
SLIDE 5

The File Slicing Abstraction

The central abstraction is a slice: an immutable, byte-addressable, arbitrarily sized sequence of bytes A file is represented by a sequence of slices that, when overlaid, comprise the file’s contents Overlaid Slices File Contents

The Design and Implementation of WTF Design 5 / 28

slide-6
SLIDE 6

WTF Architecture

End User Application Client Library Metadata Storage Storage Servers

The Design and Implementation of WTF Design 6 / 28

slide-7
SLIDE 7

WTF Architecture

End User Application Client Library Metadata Storage Storage Servers The metadata storage provides transactional operations over the metadata

The Design and Implementation of WTF Design 6 / 28

slide-8
SLIDE 8

WTF Architecture

End User Application Client Library Metadata Storage Storage Servers The client library extends these transactional guarantees to the end user

The Design and Implementation of WTF Design 6 / 28

slide-9
SLIDE 9

Slices and Slice Pointers

s0 s1 c1 c2 c3 c4 A B Slices reside on storage servers, while pointers to slices reside in HyperDex

The Design and Implementation of WTF Design 7 / 28

slide-10
SLIDE 10

Slices and Slice Pointers

s0 s1 c1 c2 c3 c4 A B Slice Pointer A: server: s0 chunk: c1 start: 1,073,816,936 end: 8,589,788,476 Slice Pointer B: server: s1 chunk: c4 start: 10,737,389,932 end: 13,958,442,063 Slice pointers directly indicate a slice’s location in the system

The Design and Implementation of WTF Design 7 / 28

slide-11
SLIDE 11

s0 0 MB 1 MB 2 MB 3 MB 4 MB 0 MB 1 MB 2 MB 3 MB 4 MB 5 MB 6 MB ⇑ cursor An empty file has no metadata and occupies no space on storage servers

The Design and Implementation of WTF Design 8 / 28

slide-12
SLIDE 12

s0 0 MB 1 MB 2 MB 3 MB 4 MB 0 MB 1 MB 2 MB 3 MB 4 MB 5 MB 6 MB

A A

⇑ cursor A @ 0 MB A 2 MB write writes to the storage servers and metadata

The Design and Implementation of WTF Design 8 / 28

slide-13
SLIDE 13

s0 0 MB 1 MB 2 MB 3 MB 4 MB 0 MB 1 MB 2 MB 3 MB 4 MB 5 MB 6 MB

A B A B

⇑ cursor A @ 0 MB B @ 2 MB Another 2 MB write

The Design and Implementation of WTF Design 8 / 28

slide-14
SLIDE 14

s0 0 MB 1 MB 2 MB 3 MB 4 MB 0 MB 1 MB 2 MB 3 MB 4 MB 5 MB 6 MB

A B A B

⇑ cursor A @ 0 MB B @ 2 MB WTF supports writes at arbitrary offsets within files

The Design and Implementation of WTF Design 8 / 28

slide-15
SLIDE 15

s0 0 MB 1 MB 2 MB 3 MB 4 MB 0 MB 1 MB 2 MB 3 MB 4 MB 5 MB 6 MB

A B C A B C

⇑ cursor A @ 0 MB B @ 2 MB C @ 1 MB A 2 MB write that overwrites part of both prior writes

The Design and Implementation of WTF Design 8 / 28

slide-16
SLIDE 16

Metadata Compaction

Compaction reduces the size of the metadata list by removing references to unused portions of slices Because slice pointers directly reference the location of files, they can be modified in the metadata list using local computation Consequently, compaction occurs entirely at the metadata level

The Design and Implementation of WTF Design 9 / 28

slide-17
SLIDE 17

s0 0 MB 1 MB 2 MB 3 MB 4 MB 0 MB 1 MB 2 MB 3 MB 4 MB 5 MB 6 MB

A B C A B C

⇑ cursor A @ 0 MB B @ 2 MB C @ 1 MB

The Design and Implementation of WTF Design 10 / 28

slide-18
SLIDE 18

s0 0 MB 1 MB 2 MB 3 MB 4 MB 0 MB 1 MB 2 MB 3 MB 4 MB 5 MB 6 MB

A B C C A B

A @ 0 MB B @ 2 MB C @ 1 MB Compaction eliminates references to overwritten or erased data

The Design and Implementation of WTF Design 10 / 28

slide-19
SLIDE 19

Garbage Collection

Garbage collection cleans up the slices no longer referenced by any slice pointer WTF periodically scans the filesystem and collects all slice pointers Storage servers use the scan, along with their local data, to determine which data is garbage

The Design and Implementation of WTF Design 11 / 28

slide-20
SLIDE 20

s0 0 MB 1 MB 2 MB 3 MB 4 MB 0 MB 1 MB 2 MB 3 MB 4 MB 5 MB 6 MB

A B C C A B

A @ 0 MB B @ 2 MB C @ 1 MB

The Design and Implementation of WTF Design 12 / 28

slide-21
SLIDE 21

s0 0 MB 1 MB 2 MB 3 MB 4 MB 0 MB 1 MB 2 MB 3 MB 4 MB 5 MB 6 MB

A B C C A B

A @ 0 MB B @ 2 MB C @ 1 MB Garbage is freed from the underlying filesystem

The Design and Implementation of WTF Design 12 / 28

slide-22
SLIDE 22

Locality-Aware Slice Placement

Locality-aware slice placement prevents fragmentation when writing sequentially Slices placed contiguously on storage servers improve locality when reading files Consistent hashing across storage servers in the system on a per-file basis increases probability that sequentially written slices are adjacent The metadata for adjacent slices may be represented in a more compact form

The Design and Implementation of WTF Design 13 / 28

slide-23
SLIDE 23

s0 0 MB 1 MB 2 MB 3 MB 4 MB 0 MB 1 MB 2 MB 3 MB 4 MB 5 MB 6 MB

A B A B

A @ 0 MB B @ 2 MB Locality-aware slice placement reduces fragmentation

The Design and Implementation of WTF Design 14 / 28

slide-24
SLIDE 24

s0 0 MB 1 MB 2 MB 3 MB 4 MB 0 MB 1 MB 2 MB 3 MB 4 MB 5 MB 6 MB

A B A B

A @ 0 MB B @ 2 MB D @ 0 MB Slice Pointer A: server: s0 chunk: c start: 0MB end: 2MB Slice Pointer B: server: s0 chunk: c start: 2MB end: 4MB Slice Pointer D: server: s0 chunk: c start: 0MB end: 4MB Adjacent slices may be represented by a new, merged slice pointer

The Design and Implementation of WTF Design 14 / 28

slide-25
SLIDE 25

s0 0 MB 1 MB 2 MB 3 MB 4 MB 0 MB 1 MB 2 MB 3 MB 4 MB 5 MB 6 MB D @ 0 MB The new slice pointer represents the contiguous range on the storage servers

The Design and Implementation of WTF Design 14 / 28

slide-26
SLIDE 26

WTF Applications

MapReduce Sort: concat enables an efficient bucket-based merge sort Work Queue: append units of work are appended to the file; all contention happens in the metadata layer Video editor: yank and paste enable the editor to reorder scenes without rewriting the movie Fuse Bindings: transactional behavior exposed to the user for easy data exploration

The Design and Implementation of WTF Design 15 / 28

slide-27
SLIDE 27

Application: MapReduce Sort

. . . . . . . . . . . . Input File Buckets Sorted Buckets Output File

The Design and Implementation of WTF Design 16 / 28

slide-28
SLIDE 28

Application: MapReduce Sort

. . . . . . . . . . . . WTF concat Input File Buckets Sorted Buckets Output File

The Design and Implementation of WTF Design 16 / 28

slide-29
SLIDE 29

Application: MapReduce Sort

20 40 60 80 HDFS WTF Execution Time (minutes)

The Design and Implementation of WTF Design 17 / 28

slide-30
SLIDE 30

Application: MapReduce Sort

200 400 600 800 1000 1200 1400 1600 1800 2000 Bucket Sort Merge Execution Time (s) HDFS WTF

The Design and Implementation of WTF Design 18 / 28

slide-31
SLIDE 31

Application: Work Queue

50 100 150 200 HDFS WTF Throughput (ops/s)

The Design and Implementation of WTF Design 19 / 28

slide-32
SLIDE 32

Application: Video Editor

Chronological Order Final Cut

The Design and Implementation of WTF Design 20 / 28

slide-33
SLIDE 33

Application: Video Editor

1 10 100 1000 10000 100000 HDFS WTF Execution Time (s)

WTF can rewrite 377 GB of raw movie footage in 16 s using file slicing—effectively 23 GB/s, as opposed to rewriting the footage using traditional APIs, which requires approximately three hours

The Design and Implementation of WTF Design 21 / 28

slide-34
SLIDE 34

Application: Interactive Transactions

# wtf begin-transaction # ls ./data.0000 ./data.0001 ./data.0002 ./data.0003 .... # rm -rf * # ls # wtf abort-transaction # ls ./data.0000 ./data.0001 ./data.0002 ./data.0003 ....

The Design and Implementation of WTF Design 22 / 28

slide-35
SLIDE 35

Microbenchmark: Baseline Performance

25 50 75 100 125 Write Read Seq.Read Rand. Throughput (MB/s) POSIX HDFS WTF

The Design and Implementation of WTF Design 23 / 28

slide-36
SLIDE 36

Microbenchmark: Write Sequential

100 200 300 400 500 64B 2KB 64KB 2MB 64MB Throughput (MB/s) Block Size (bytes) HDFS WTF

The Design and Implementation of WTF Design 24 / 28

slide-37
SLIDE 37

Microbenchmark: Write Sequential

2 4 6 8 10 64B 128B 256B 512B 1KB Throughput (MB/s) Block Size (bytes) HDFS

The Design and Implementation of WTF Design 25 / 28

slide-38
SLIDE 38

Microbenchmark: Write Sequential

2 4 6 8 10 64B 128B 256B 512B 1KB Throughput (MB/s) Block Size (bytes) HDFS 10ms metadata

The Design and Implementation of WTF Design 25 / 28

slide-39
SLIDE 39

Microbenchmark: Write Sequential

2 4 6 8 10 64B 128B 256B 512B 1KB Throughput (MB/s) Block Size (bytes) HDFS WTF 10ms metadata

The Design and Implementation of WTF Design 25 / 28

slide-40
SLIDE 40

Microbenchmark: Write Sequential

2 4 6 8 10 64B 128B 256B 512B 1KB Throughput (MB/s) Block Size (bytes) HDFS WTF 1ms metadata 10ms metadata

The Design and Implementation of WTF Design 25 / 28

slide-41
SLIDE 41

Microbenchmark: Fault Tolerance

50 100 150 200 250 300 10 20 30 40 50 60 Throughput (MB/s) Time (s) WTF

The Design and Implementation of WTF Design 26 / 28

slide-42
SLIDE 42

Related Work

Distributed Filesystems

Farsite, AFS, xFS, Swift, Petal, Frangipani, NASD, Panasas

Data Center Filesystems

CalvinFS, GFS, HDFS, Salus, Flat Datacenter Storage, Blizzard, f4, Pelican

Transactional Filesystems

QuickSilver, Transactional LFS, Valor, PerDis FS, KBDBFS, Inversion, Amino

The Design and Implementation of WTF Design 27 / 28

slide-43
SLIDE 43

Conclusion

WTF is a new design point in distributed filesystems that leverages the file slicing abstraction to provide: Transactional guarantees Expanded APIs Improved performance

The Design and Implementation of WTF Conclusion 28 / 28