Haryadi S. Gunawi, Pallavi Joshi, Peter Alvaro,! Joseph M. Hellerstein, and Koushik Sen!
!
Thanh Do, Andrea C. Arpaci-Dusseau, ! and Remzi H. Arpaci-Dusseau!
!
Dhruba Borthakur!
1
Haryadi S. Gunawi, Pallavi Joshi, Peter Alvaro, ! Joseph M. - - PowerPoint PPT Presentation
Haryadi S. Gunawi, Pallavi Joshi, Peter Alvaro, ! Joseph M. Hellerstein, and Koushik Sen ! ! Thanh Do, Andrea C. Arpaci-Dusseau, ! and Remzi H. Arpaci-Dusseau ! ! Dhruba Borthakur ! 1 ! Cloud ! " Thousands of commodity machines ! "
Haryadi S. Gunawi, Pallavi Joshi, Peter Alvaro,! Joseph M. Hellerstein, and Koushik Sen!
!
Thanh Do, Andrea C. Arpaci-Dusseau, ! and Remzi H. Arpaci-Dusseau!
!
Dhruba Borthakur!
1
2
3
4
5
X2 X1
6
7
8
9!
M! 1 C! 2! 3! No failures! Setup Recovery: ! Recreate fresh pipeline (1, 2, 4)! Data Transfer Recovery: ! Continue on surviving nodes (1, 2)! M! 1 C! 2! 3! M! 1 C! 2! 3! 4!
Alloc Req! Setup! Stage! Data! Transfer! X1 X2
10
! Failures !
! FATE!
M! 1! C! 2! 3! M! 1! C! 2! 3! 4!
11
Node2! Node3!
I/O ! information:! OutputStream.read() in! BlockReceiver.java! <stack trace>! Net I/O from N3 to N2! “Data Ack”! Injected failure:! Crash After! Failure ID: 2573!
12!
M! 1 C! 2! 3! A! A! B! A! B! C!
Exp #1: A! Exp #2: B! Exp #3: C!
M! 1 C! 2! 3! A! B! C! B ! A! A!
AB! AC!
B ! C!
BC!
1 failure / run! 2 failures / run!
13
14
! Exercised over 40,000 unique combinations of 1,
2 failures / run! A1 B1 A2 B2 A3 B3
15
16
17
A1 B1 A2 B2 A3 B3
A1 B1 A2 B2 A3 B3
18
clustered!
A1 B1 A2 B2 C1 D1 C2 D2 A1 B1 A2 B2 C1 D1 C2 D2
19
20
Test!
21
[It is] great to document (in a spec) the HDFS write protocol ... ! ! …, but we shouldn't spend too much time on it, … a formal spec may be overkill for a protocol we plan to deprecate imminently. !
Implemen-! tation! Specs! X1 X2
22
23
Implemen-! tation! Specs!
24
25
expectedNodes! (Block, Node)!
B Node 1 B Node 2
actualNodes! (Block, Node)!
B Node 1 B Node 2
incorrectNodes! (Block, Node)!
M! 1 C! 2! 3!
X B! B! Data! Transfer!
26
expectedNodes! (Block, Node)!
B Node 1 B Node 2
actualNodes! (Block, Node)!
B Node 1
incorrectNodes! (Block, Node)!
B Node 2
M! 1 C! 2! 3!
X B! B!
27
M! C!
getBlockPipe(…)! Give me 3 nodes for B! [Node1, Node2, Node3]!
M! C!
expectedNodes! (Block, Node)!
B Node 1 B Node 2 B Node 3
1 2! 3!
X
2
#1: incorrectNodes(B, N) :- expectedNodes(B, N), NOT
28
expectedNodes! (Block, Node)!
B Node 1 B Node 2 B Node 3
M! 1 C! 2! 3!
X B! B! #1: incorrectNodes(B, N) :- expectedNodes(B, N), NOT
#2: expectedNodes(B, N) :- getBlockPipe(B,N);!
29
#1: incorrectNodes(B,N) !:- expectedNodes(B,N), NOT
#2: expectedNodes(B,N) !:- getBlockPipe(B,N);! #3: expectedNodes(B,N) !:- expectedNodes(B,N), fateCrashNode(N), ! writeStg (B,Stage), Stage == “DataTr”! ! #4: writeStg(B, “DataTr”) !:- writeStg (B,“Setup”), nodesCnt(Nc), acksCnt (Ac), Nc==Ac! #5: nodesCnt (B, CNT<N>)!:- pipeNodes (B, N);! #6: pipeNodes (B, N) !:- getBlockPipe (B, N);! #7: acksCnt (B, CNT<A>) !:- setupAcks (B, P , “OK”);! #8: setupAcks (B, P , A) !:- setupAck (B, P , A);! !
30
31
32
33
34
35
errorSingleRack(B) :- rackCnt(B,Cnt), Cnt==1, blkRacks(B,R), connected(R,Rb),! endOfReplicationMonitor (_);!
36
! Reduce #experiments by an order of magnitude!
" Each experiment = 4-9 seconds!
! Found the same number of bugs!
" (by experience) ! # Exps! 7720! 618! 5000! Write +! 2 crashes! Append +! 2 crashes! Brute Force! Pruned! Write +! 3 crashes! Append +! 3 crashes!
37
38
39