CS 5412/LECTURE 21 FAULT TOLERANCE IN APACHE
Ken Birman Spring, 2020
HTTP://WWW.CS.CORNELL.EDU/COURSES/CS5412/2020SP 1
CS 5412/LECTURE 21 Ken Birman FAULT TOLERANCE IN APACHE Spring, - - PowerPoint PPT Presentation
CS 5412/LECTURE 21 Ken Birman FAULT TOLERANCE IN APACHE Spring, 2020 HTTP://WWW.CS.CORNELL.EDU/COURSES/CS5412/2020SP 1 HOW DO APACHE SERVICES HANDLE FAILURE? Weve heard about some of the main tools Zookeeper, to manage
Ken Birman Spring, 2020
HTTP://WWW.CS.CORNELL.EDU/COURSES/CS5412/2020SP 1
HTTP://WWW.CS.CORNELL.EDU/COURSES/CS5412/2020SP 2
HTTP://WWW.CS.CORNELL.EDU/COURSES/CS5412/2020SP 3
HTTP://WWW.CS.CORNELL.EDU/COURSES/CS5412/2020SP 4
HTTP://WWW.CS.CORNELL.EDU/COURSES/CS5412/2020SP 5
HTTP://WWW.CS.CORNELL.EDU/COURSES/CS5412/2020SP 6
HTTP://WWW.CS.CORNELL.EDU/COURSES/CS5412/2020SP 7
HTTP://WWW.CS.CORNELL.EDU/COURSES/CS5412/2020SP 8
HTTP://WWW.CS.CORNELL.EDU/COURSES/CS5412/2020SP 9
HTTP://WWW.CS.CORNELL.EDU/COURSES/CS5412/2020SP 10
HTTP://WWW.CS.CORNELL.EDU/COURSES/CS5412/2020SP 11
HTTP://WWW.CS.CORNELL.EDU/COURSES/CS5412/2020SP 12
CS5412 SPRING 2014 (CLOUD COMPUTING: BIRMAN) 13
And it works no matter what correct consensus protocol you started with. This makes the result very general
System starts in S* Events can take it to state S1 Events can take it to state S0 S* denotes bivalent state S0 denotes a decision 0 state S1 denotes a decision 1 state Sooner or later all executions decide 0 Sooner or later all executions decide 1
CS5412 SPRING 2014 (CLOUD COMPUTING: BIRMAN) 14
System starts in S* Events can take it to state S1 Events can take it to state S0 e
e is a critical event that takes us from a bivalent to a univalent state: eventually we’ll “decide” 0
CS5412 SPRING 2014 (CLOUD COMPUTING: BIRMAN) 15
System starts in S* Events can take it to state S1 Events can take it to state S0
They delay e and show that there is a situation in which the system will return to a bivalent state
S’
*
CS5412 SPRING 2014 (CLOUD COMPUTING: BIRMAN) 16
System starts in S* Events can take it to state S1 Events can take it to state S0 S’
*
In this new state they show that we can deliver e and that now, the new state will still be bivalent!
S’’
*
e
CS5412 SPRING 2014 (CLOUD COMPUTING: BIRMAN) 17
System starts in S* Events can take it to state S1 Events can take it to state S0 S’
*
Notice that we made the system do some work and yet it ended up back in an “uncertain” state. We can do this again and again
S’’
*
e
CS5412 SPRING 2014 (CLOUD COMPUTING: BIRMAN) 18
At some step this run switches from bivalent to univalent, when some process receives some message m They now explore executions in which m is delayed
CS5412 SPRING 2014 (CLOUD COMPUTING: BIRMAN) 19
Initially in a bivalent state Delivery of m would make us univalent but we delay m They show that if the protocol is fault-tolerant there must be a run that leads to the
And they show that you can deliver m in this run without a decision being made
If this is true once, it is true as often as we like In effect: we can delay decisions indefinitely
CS5412 SPRING 2014 (CLOUD COMPUTING: BIRMAN) 20
… to much so for us in CS5412 So we’ll skip the real details
CS5412 SPRING 2014 (CLOUD COMPUTING: BIRMAN) 21
Then they allow p to resume execution, but make the system believe that perhaps q has failed The original protocol can only tolerate1 failure, not 2, so it needs to somehow let p rejoin in order to achieve progress
CS5412 SPRING 2014 (CLOUD COMPUTING: BIRMAN) 22
It computes the right thing And it always terminates
These runs are extremely unlikely (“probability zero”) Yet they imply that we can’t find a totally correct solution And so “consensus is impossible” ( “not always possible”)
CS5412 SPRING 2014 (CLOUD COMPUTING: BIRMAN) 23
CS5412 SPRING 2014 (CLOUD COMPUTING: BIRMAN) 24
They assume they have perfect control over which messages the system delivers, and when They can pick the exact state in which a message arrives in the protocol
CS5412 SPRING 2014 (CLOUD COMPUTING: BIRMAN) 25
After all, it is a valid scenario. ... And any valid scenario can happen
A “probability zero” sequence of events Yet in a temporal logic sense, FLP shows that if we can prove correctness for a consensus protocol, we’ll be unable to prove it live in a realistic network setting, like a cloud system
CS5412 SPRING 2014 (CLOUD COMPUTING: BIRMAN) 26
Definitely possible (not even all that hard). Just vote! And we can prove protocols of this kind correct.
If our goal is just a probability-one guarantee, we actually can offer a proof of progress But in temporal logic settings we want perfect guarantees and we can’t achieve that goal
HTTP://WWW.CS.CORNELL.EDU/COURSES/CS5412/2020SP 27
HTTP://WWW.CS.CORNELL.EDU/COURSES/CS5412/2020SP 28
HTTP://WWW.CS.CORNELL.EDU/COURSES/CS5412/2020SP 29
HTTP://WWW.CS.CORNELL.EDU/COURSES/CS5412/2020SP 30
HTTP://WWW.CS.CORNELL.EDU/COURSES/CS5412/2020SP 31
HTTP://WWW.CS.CORNELL.EDU/COURSES/CS5412/2020SP 32
HTTP://WWW.CS.CORNELL.EDU/COURSES/CS5412/2020SP 33
HTTP://WWW.CS.CORNELL.EDU/COURSES/CS5412/2020SP 34
HTTP://WWW.CS.CORNELL.EDU/COURSES/CS5412/2020SP 35
HTTP://WWW.CS.CORNELL.EDU/COURSES/CS5412/2020SP 36
HTTP://WWW.CS.CORNELL.EDU/COURSES/CS5412/2020SP 37
HTTP://WWW.CS.CORNELL.EDU/COURSES/CS5412/2020SP 38
HTTP://WWW.CS.CORNELL.EDU/COURSES/CS5412/2020SP 39
HTTP://WWW.CS.CORNELL.EDU/COURSES/CS5412/2020SP 40
job Mapped tasks: A B C D E … A.out B.out E.out C.out D.out B’ B.out
HTTP://WWW.CS.CORNELL.EDU/COURSES/CS5412/2020SP 41
HTTP://WWW.CS.CORNELL.EDU/COURSES/CS5412/2020SP 42
HTTP://WWW.CS.CORNELL.EDU/COURSES/CS5412/2020SP 43