I u eff I u PRE I u u C C A B A B C * A : C is - - PowerPoint PPT Presentation

i u eff i u pre i u u c c a b a b c a c is reachable from
SMART_READER_LITE
LIVE PREVIEW

I u eff I u PRE I u u C C A B A B C * A : C is - - PowerPoint PPT Presentation

File System Replication Pictures Pictures Co-design and Verification of Tool Tool an Available File System Pictures Mahsa Najafzadeh, Marc Shapiro, and Patrick Eugster Low latency Tool High availability Fault tolerance Mahsa


slide-1
SLIDE 1

Co-design and Verification of an Available File System

Mahsa Najafzadeh, Marc Shapiro, and Patrick Eugster

File System Replication

Mahsa Najafzadeh 2

Tool

Pictures

Tool

Pictures

Tool

Pictures

–Low latency –High availability –Fault tolerance

POSIX File Systems vs. Distribution

Mahsa Najafzadeh

POSIX:

  • Assumes operations occur in a total order
  • Requires a synchronous, strong consistency model
  • Synchronisation is costly and not available under partition
  • In practice, concurrency conflicts are rare

Distribution:

  • No synchronisation: processes an update locally, propagates

effects to other replicas later.

  • Weakens consistency and causes conflicts

3

Pictures

Tools Tools

Pictures

4

IMG_1234.jpg

Add Photo Remove Pictures

Pictures

Tools

Pictures

Tools

Update/Remove Conflict

IMG_1234.jpg

Tools

Conflict Example= removing a directory while adding a file into the directory

slide-2
SLIDE 2

Safety

Mahsa Najafzadeh

  • Convergent: do replicas that delivered the same

updates have the same state?

  • Is the invariant preserved?
  • Sequential: single operation in isolation maintains

the invariant

  • Concurrent execution maintains the invariant

5

Tree Invariant

  • Has a fixed root node
  • Root is an ancestor of every node in the tree

(reachability)

  • Every node, which has a name has exactly one parent,

except the root

  • No cycle in the directory structure
  • Unique names within a directory

Mahsa Najafzadeh 6 Mahsa Najafzadeh

Example= sequential move operation fails

7

u

C B A root

ueff mvDir(C,A) I I

C B A root

Mahsa Najafzadeh

Example= do not move directory under self

8

u

C B A root

mvDir(C,A) I

I C is NOT ancestor of A ¬ (C ↓* A ) uPRE C ↓* A : C is reachable from A

slide-3
SLIDE 3

Example= concurrent moves fails

Mahsa Najafzadeh 9

mvDir(B,A) B is NOT ancestor of A r1 r2

B A root B A root

mvDirPRE: ¬ (B ↓* A ) B ↓* A : A is reachable from B uPRE

Mahsa Najafzadeh 10

mvDir(A,B) B is NOT ancestor of A r1 r2

B A root B A root

mvDirPRE: ¬ (B ↓* A )

root B A

B ↓* A : A is reachable from B

Example= concurrent moves fails

uPRE mvDir(B,A)

Mahsa Najafzadeh 11

mvDir(A,B) B is NOT ancestor of A r1 r2

B A root root B A B A root

mvDirPRE: ¬ (B ↓* A ) I ✘

Example= concurrent moves fails

mvDir(B,A) uPRE

root B A

Concurrency Control

Mahsa Najafzadeh 12

Tokens≈ concurrency control abstractions Tokens = {τ, …} Conflict relation ⋈ ⊆ Tokens × Tokens Example - mutual exclusion tokens: Tokens = {τ}; τ ⋈ τ An operation’s generator may acquire a set of tokens Operations associated with conflicting tokens cannot be concurrent

slide-4
SLIDE 4

Mahsa Najafzadeh 13

mvDir(B, A) r1 r2

B A root A B root

Example= moving a directory while updating its content is safe

uPRE

Mahsa Najafzadeh 14

mvDir(B, A) r1 r2 addFile(f,B)

B A root A B root B A root

f

Example= moving a directory while updating its content is ok

uPRE

Mahsa Najafzadeh 15

mvDir(B, A) r1 r2 addFile(f,B)

B A root A B root A B root

f

B A root

f

A B root

f

Example= moving a directory while updating its content is ok

uPRE uPRE

16

  • CAP theorem: Either (Strong) Consistency or

Availability, not both, when Partitions occur

  • This is a design trade-off

When is Synchronization Necessary?

Our approach:

  • Synchronize (CP) only operations where strictly

necessary for safety

  • Other operations are asynchronous (AP)

Safety = convergent + invariants

Mahsa Najafzadeh

slide-5
SLIDE 5

Model

Mahsa Najafzadeh 17

Effects: ueff ∈ State ➞ (State ➞ State) Return value: uval ∈ State ➞ Value

Generator (@origin) reads state from one copy and maps operation u to:

ueff

r1

uval

r2

  • rigin replica
  • ther replica

ueff uPRE

client

Precondition Safety r3

  • ther replica

ueff u

Mahsa Najafzadeh Mahsa Najafzadeh 18

Deliver(@all replicas): causally dependent messages delivered in order

Model

Mahsa Najafzadeh

ueff

r1

uval

r2

  • rigin replica
  • ther replica

ueff uPRE

Precondition Safety r3

  • ther replica

ueff veff veff u v

client

A Mostly-Available, Convergent and Correct File System Design

  • Allows common file system operations can run without

synchronization except for moves

  • Maintains the tree invariant
  • Guarantees convergence using replicated data types

[Shapiro+ 2011]

  • Name conflicts:
  • Merge directories
  • Rename files
  • Update/Remove conflicts: add-wins directory

Mahsa Najafzadeh 19

Pictures

Tools Tools

Pictures

20

IMG_1234.jpg

Add Photo Remove Pictures

Pictures

Tools

Pictures

Tools

Add-wins directory= removing a directory while adding a file into the directory

Update/Remove Conflict

IMG_1234.jpg

Pictures

Tools

slide-6
SLIDE 6

CISE Analysis: Proves Application is Correct

  • Rely-Guarantee reasoning for a causally-consistent system with
  • nly polynomial complexity
  • Consists of three analysis rules:

Effector Safety: Every effect in isolation execution maintains the invariant I (sequential safety) Commutativity: Concurrent operations commute (convergence) Stability: Preconditions are stable under concurrency (concurrent safety)

If satisfied: the invariant I is guaranteed in every possible execution

[Gotsman et al. POPL 2016 ’Cause I’m Strong Enough: Reasoning about Consistency Choices in Distributed Systems]

Mahsa Najafzadeh 21 Mahsa Najafzadeh

Effector Safety: Example= move requires precondition

  • do not move directory under self

22

u

C B A root

ueff mvdir(C,A) I uPRE I

C B A root

invariant invariant

Mahsa Najafzadeh 23

Stability Rule: precondition is stable under concurrent effect

precondition of u holds

ueff

σ σ I u I r1 r2 uPRE

  • 1. Effector Safety: ueff preserves I when executed

in any state satisfying uPRE

Mahsa Najafzadeh

precondition of u holds

ueff

σ σ

veff

I u I ?

ueff

I r1 r2 veff uPRE

24

  • 1. Effector Safety: ueff preserves I when executed

in any state satisfying uPRE

Stability Rule: precondition is stable under concurrent effect

slide-7
SLIDE 7

Mahsa Najafzadeh 25

uPRE?

Is it preserved after executing v?

ueff

σ σ

veff

I u I ?

ueff

I r1 r2 veff uPRE

Stability Rule: precondition is stable under concurrent effect

  • 1. Effector Safety: ueff preserves I when executed

in any state satisfying uPRE

  • 2. Precondition Stability: uPRE will hold when ueff is

applied at any replica

  • 1. Effector Safety: ueff preserves I when executed

in any state satisfying uPRE

  • 2. Precondition Stability: uPRE will hold when ueff is

applied at any replica

Mahsa Najafzadeh 26

ueff

σ σ

veff

I u I

ueff

I r1 r2 veff uPRE

Stability Rule: precondition is stable under concurrent effect

uPRE

Necessary and Sufficient Concurrency Controls for Move

Mahsa Najafzadeh 27

(τ(d) ⋈ τ(d) )

r1 r2 mvDir(A,B)

root B A LCA(A,B) B A

T T T

  • Add tokens, avoid mvDir || mvDir
  • A mutually exclusive token for each

directory d ∈ Dir:

T

Example: avoid conflicting moves

Mahsa Najafzadeh 28

mvDir(B,A) mvDir(A,B) r1

B A root

{τ(B), τ(A)} {τ(A), τ(B)}

(τ(A) ⋈ τ(A) ) (τ(B) ⋈ τ(B) )

✘ ✔

r2

B A root

slide-8
SLIDE 8

Mahsa Najafzadeh

Verification Results

29

Applications #O P #Tokens #Invarian ts Anomaly Average Time(ms)

Sequential 7 7 1 NO 278 Concurrent 7 1 safety violation 1297 Fully-Asynchronous 7 1 duplication 2350 Mostly-Asynchronous 7 2 1

NO

1570

Mahsa Najafzadeh

Conclusion

  • A rigorous approach for modeling file system

behavior for both centralized/synchronous and replicated asynchronous semantics

  • Common operations except move to run without

concurrency controls

  • A hierarchical least-common ancestor concurrency

control mechanism is necessary and sufficient for move operations

30 Mahsa Najafzadeh

  • Translate the move concurrency controls into an

efficient implementation

  • Integrate hard links, devices, and mounts into model
  • Reason about the file system behavior in the

presence of failures

31

Q/A

Future Work

Backup Slides

slide-9
SLIDE 9

Mahsa Najafzadeh 22/04/16 33

Removing Token Over Source Directory

mvDir(A,B) r1

{τ(B), τ(C)}

r2

root F

D

B A C H

Mahsa Najafzadeh 34

Removing Token Over Source Directory

root

F D B A C H

mvDir(A,B) mvDir(A,F) r1 {τ(F)}

{τ(B), τ(C)}

r2

root F B A C H root F D B A C H

D Mahsa Najafzadeh 35

Removing Token Over Source Directory

F D B A C H

mvDir(A,B) mvDir(A,F) {τ(F)}

root F B A C H root F D B A C H root F D B A C H

D root

r1 r2

{τ(B), τ(C)}

Mahsa Najafzadeh 36

Removing Token Over Destination Directory

mvDir(A,B) r1

{τ(A), τ(C)}

r2

root F B A C H

D

slide-10
SLIDE 10

Mahsa Najafzadeh 37

Removing Token Over Destination Directory

mvDir(A,B) mvDir(B,H) r1

{τ(B),τ(A)} {τ(A), τ(C)}

r2

root F B A C H

D

F B A C H root

D

root F D B A C H

Mahsa Najafzadeh 38

Removing Token Over Destination Directory

mvDir(A,B) mvDir(B,H) r1 r2

root F B A C H root F D B A C H root F D B A C H

D

{τ(A), τ(C)} {τ(B),τ(A)}

F B A C H root

D Mahsa Najafzadeh 39

Removing Token Over Ancestors up to LCA

mvDir(A,B) r1

{τ(A), τ(B)}

r2

root F B A C H

D Mahsa Najafzadeh 40

mvDir(A,B) mvDir(C,H) r1

{τ(C),τ(H)} {τ(A), τ(B)}

r2

root F B A C H root F B A C H root F D B A C H

D D

Removing Token Over Ancestors up to LCA

slide-11
SLIDE 11

Mahsa Najafzadeh 41

mvDir(A,B) mvDir(C,H) r1

{τ(A), τ(B)}

r2

root F B A C H root F B A C H root F D B A C H root F D B A C H

D D

{τ(C),τ(H)}

Removing Token Over Ancestors up to LCA

Mahsa Najafzadeh 42

Intuition For Move Tokens

mvDir(A,B) r1 r2

LCA(A,B) B A

Assume that these tokens are not sufficient and we have loop over a node, called E, due to concurrent move operations: E↓….. B ↓ A …… ↓E

Mahsa Najafzadeh 43

Intuition For Move Tokens

mvDir(A,B) r1 r2

LCA(A,B) B A

consider the left side of the loop E↓C….. B ↓ A ……H ↓E

Mahsa Najafzadeh 44

Intuition For Move Tokens

E↓C….. B ↓ A ……H ↓E The left side implies that one of B’s ancestors, called C, concurrently moves to E mvDir(C,E): Precondition: Directory E is not a descendent of C mvDir(A,B) r1 r2 mvDir(C,E)

slide-12
SLIDE 12

Mahsa Najafzadeh 45

mvDir(A,B) r1 r3 mvDir(C,E) Now, consider the right side of loop The right side implies that E concurrently moves to

  • ne of A’s descendants, called H

Tokens over directory H up to LCA(H,E) r2 mvDir(E,H) E↓C….. B ↓ A ……H ↓E mvDir(E,H)

Mahsa Najafzadeh 46

Intuition For Move Tokens

where is LCA(H,E)? mvDir(A,B) r1 r2 mvDir(C,E)

Mahsa Najafzadeh 47

E↓C….. B ↓ A ……H ↓E 1) LCA(H,E) is located between A and LCA(A,B) in this case moving E to H requires token over A that conflicts with tokens for moving A to B

B LCA(A,B) A H

LCA(H,E) Mahsa Najafzadeh 48

E↓C….. B ↓ A ……H ↓E 2) LCA(H,E) is located under A: E is concurrently moved under A which is not possible because this move operation needs to acquire tokens conflicting with mvDir(A,B)

B LCA(A,B) A H

LCA(H,E)

slide-13
SLIDE 13

Exploiting More Parallelism

  • Concurrent moves to the same destination directory
  • Conflicting tokens for each directory A ∈ Dir:

source token τs(A) and destination token τd(A)

Mahsa Najafzadeh 49

(τs(A) ⋈ τd(A) )

r1 r2 mvDir(A,B)

root B A LCA(A,B) B A

T T T T