On#the#Fault+tolerance#and#High# Performance#of#Replicated#Transactional# Systems
Dr.#Sachin Hirve Virginia#Tech
10th September#2015
On#the#Fault+tolerance#and#High# - - PowerPoint PPT Presentation
On#the#Fault+tolerance#and#High# Performance#of#Replicated#Transactional# Systems Dr.#Sachin Hirve Virginia#Tech 10 th September#2015 Distributed#Operations In#todays#world#distributed#operations#are#ubiquitous# Example#+ 2 Image
10th September#2015
Image sources: rikbasra.com, database.bio, iconsplace.com, iconfinder.com, guillaumekurkdjian.com, mytay.mobi, prepareyourgroundzero.com, iconshut.com
2
Image sources: rikbasra.com, database.bio, iconsplace.com
tx_start: x = x -10; y = 20; tx_end
3
– Fault+tolerance – High#resiliency – Failure#masking
4
", ! $,… ,! & ,#also#
5
– State%variables that#encode#the#state%of#the#system – Commands#that#transform#this#state
– Ordering#layer – Execution# layer
6
R+1 R+2 R+3 Ordering"Layer Execution"Layer Req+1 Req+2 Req+3 Req+1 Req+3 Req+2 Req+1 Req+3 Req+2 Req+1 Req+3 Req+2
Network Network
7
R+1 R+2 R+3 Ordering"Layer Execution"Layer Req+1 Req+2 Req+3 Req+1 Req+3 Req+2 Req+1 Req+3 Req+2 Req+1 Req+3 Req+2 Req+1 Req+3 Req+2 Req+1 Req+3 Req+2 Req+1 Req+3 Req+2
Network Network
8
– Consistent#state – High#availability – Failure#masking
9
– Replicas#define#order#of#requests#“blindly”,#without#looking#at#conflicts – Generally#request#are#serially#executed – Examples#– Paxos [Lamport,#98],#Mencius#(baseline)# [Mao,#08]
– Order#is#defined#among#conflicting#requests – Better#possible#concurrency#for#request#execution – Examples#– Generalized# Paxos [Lamport,#05],#Epaxos [Moraru,#13]
10
– Requests# are#executed# optimistically#prior#to#order#finalization#and#at# final#order,#they#are#validated#and#committed – High#concurrency#and#performance# for#rare#conflicts#among#requests – Fails#to#exploit#concurrency#in#high#conflict#scenarios
– Requests# are#executed# after#the#order#is#finalized – Requests# are#executed# post#final+order,# therefore# conflicts#do#not#lead# to#aborts – Fails#to#benefit#from#concurrency
11
Ordering"Layer Execution"Layer
Speculation Total#order Multi+version# Objects Optimistic+ Delivery Light+weight# commit Partial#order Concurrent# processing Lock+free# execution Deferred# Update Rule#based# routing
12
Speculation
Total#order
Ordering"Layer Execution"Layer
Multi+version# Objects Optimistic+ Delivery Light+weight# commit Concurrent# processing
HiperTM:#High#Performance#Fault+Tolerant# Transactional#Memory
[ICDCN#2013]
Total#order Optimistic+ Delivery Speculation Multi+version# Objects Light+weight# commit Concurrent# processing Lock+free# execution Deferred# Update Partial#order Rule#based# routing
13
Speculation
Total#order
Ordering"Layer Execution"Layer
Multi+version# Objects Optimistic+ Delivery Light+weight# commit Concurrent# processing
Archie:#A#Speculative#Replicated#Transactional#System
[Middleware#2014]
Total#order Optimistic+ Delivery Speculation Multi+version# Objects Light+weight# commit Concurrent# processing Lock+free# execution Deferred# Update Partial#order Rule#based# routing
14
Speculation
Total#order
Ordering"Layer Execution"Layer
Multi+version# Objects Optimistic+ Delivery Light+weight# commit Concurrent# processing
Speculative#Client#Execution# in#Deferred#Update#Replication
[MW4NG#2014]
Total#order Optimistic+ Delivery Speculation Multi+version# Objects Light+weight# commit Concurrent# processing Lock+free# execution Deferred# Update Partial#order Deferred# Update Rule#based# routing
15
Speculation
Total#order
Ordering"Layer Execution"Layer
Multi+version# Objects Optimistic+ Delivery Light+weight# commit Concurrent# processing
Regulating#Consensus#under#the#Authority#of#Caesar
To#be#submitted#to#[Eurosys2016]
Total#order Optimistic+ Delivery Speculation Multi+version# Objects Light+weight# commit Concurrent# processing Lock+free# execution Deferred# Update Partial#order Partial#order Lock+free# execution Rule#based# routing
16
Speculation
Total#order
Ordering"Layer Execution"Layer
Multi+version# Objects Optimistic+ Delivery Light+weight# commit Concurrent# processing
Scaling#up#Active#Replication#using#Staleness
Submitted#to#[TPDS]
Total#order Optimistic+ Delivery Speculation Multi+version# Objects Light+weight# commit Concurrent# processing Lock+free# execution Deferred# Update Partial#order Lock+free# execution Rule#based# routing Rule#based# routing
17
– These#systems#are#composed#of#plugins – Plugins#are#not#specific#to#a#single#system#or#problem – Can#be#mix+matched# to#create#another#system#solving#different# problem
18
Speculation
Total#order
Ordering"Layer Execution"Layer
Multi+version# Objects Light+weight# commit Concurrent# processing Speculative#Client#Execution#in#Deferred#Update#Replication MW4NG#2014 Total#order Speculation Multi+version# Objects Light+weight# commit Concurrent# processing Lock+free# execution Deferred# Update Partial#order Deferred# Update Total#order Deferred# Update Speculation Speculative#Client#Execution#in#Deferred#Update#Replication#with#partial#order Optimistic+ Delivery Optimistic+ Delivery Rule#based# routing Rule#based# routing
19
Speculation
Ordering"Layer Execution"Layer
Multi+version# Objects Light+weight# commit Concurrent# processing Regulating#Consensus#under#the#Authority#of#Caesar To#be#submitted#to#Eurosys 2016 Speculation Multi+version# Objects Light+weight# commit Concurrent# processing Lock+free# execution Deferred# Update Lock+free# execution Lock+free# execution Partial#order Concurrent# processing Optimizing#query#performance#under#the#Authority#of#Caesar Total#order Total#order Partial#order Optimistic+ Delivery Optimistic+ Delivery Rule#based# routing Rule#based# routing
20
– ACM/IFIP/USENIX#15th Middleware#Workshop#for#Next#Generation#Computing# (MW4NG#14)
– To#be#submitted#to#EuroSys 16
21
– ACM/IFIP/USENIX#15th Middleware#Workshop#for#Next#Generation#Computing# (MW4NG#14)
– To#be#submitted#to#EuroSys 16
22
– A#transaction#execute# assuming#all#objects#accessed# by#it#are#up+to+ date#and#no#other#concurrent# transaction#accesses#those#objects
– Collection#of#objects#and#versions#that#are#read#by#transaction
– Collection#of#objects#that#are#updated#by#transaction
– Verifying#the#validity#of#objects#at#commit#time#that#were#read#earlier# during#optimistic#execution
– Updating#the#main#memory#with#object#updates#by#the#current# transaction
23
– Requests# are#executed# optimistically# – Transaction#updates#go#through#certification#phase#before#they#can#be# committed
R+1 R+2 R+3
Ordering"Layer Execution"Layer
Tx+1 Tx+2 Tx+3 Network Network Tx+2:# R(X),#W(Y) Tx+3:# W(X) Tx+4 Tx+4: W(Y) Tx+1: W(X)
24
– Requests# are#executed# optimistically# – Transaction#updates#go#through#certification#phase#before#they#can#be# committed
R+1 R+2 R+3 Tx+1 Tx+2 Tx+3 Network Network Tx+2:# R(X),#W(Y) Tx+3:# W(X) Tx+4: W(Y) Tx+4
Ordering"Layer Execution"Layer
Tx+1: W(X)
25
– Defines#an#order#for#transaction#updates
R+1 R+2 R+3 Network Network Tx+2:#R(X),#W(Y) Tx+1:#W(X) Tx+3:#W(X) Tx+4#W(Y) Tx+2:#R(X),#W(Y) Tx+1:#W(X) Tx+3:#W(X) Tx+4#W(Y)
Ordering"Layer Execution"Layer
Tx+4 Tx+2:#R(X),#W(Y) Tx+1:#W(X) Tx+3:#W(X) Tx+4#W(Y) Tx+1 Tx+2 Tx+3
26
– Validates#transaction#updates#w.r.t.#the#defined#order – On#successful#validation#commits#transaction#by#updating#objects – On#failing#validation,#aborts#the#transaction#and#re+executes
R+1 R+2 R+3 Network Network Tx+2:#R(X),#W(Y) Tx+1:#W(X) Tx+3:#W(X) Tx+4#W(Y) Tx+2:#R(X),#W(Y) Tx+1:#W(X) Tx+3:#W(X) Tx+4#W(Y)
Ordering"Layer Execution"Layer
Tx+4 Tx+3 Tx+2:#R(X),#W(Y) Tx+1:#W(X) Tx+3:#W(X) Tx+4#W(Y) Tx+2:#R(X),#W(Y) Tx+1:#W(X) Tx+3:#W(X) Tx+4#W(Y) Tx+1 Tx+2 Tx+2:#R(X),#W(Y) Tx+1:#W(X) Tx+3:#W(X) Tx+4#W(Y) Tx+2:#R(X),#W(Y) Tx+1:#W(X) Tx+3:#W(X) Tx+4#W(Y)
27
– Inherent#parallelism#of#transaction#processing – In#case#of#rare#conflicts#among#transactions,# DUR#gives#the#best# performance – In#high#conflict#situations,# DUR#performs#poorly#due#to#high#number#of# aborts – Even#in#partitioned# access,#DUR#suffers# from#aborts#among#local# transactions
– Applicable#to#certain#applications#e.g.,#TPC+C,#an#OLTP#benchmark# – Can#we#avoid#aborts#among#local#transactions,# even#in#presence# of# higher#number#of#conflicts?
28
– Performance# of#DUR#various#benchmarks#and#different# contention#levels
%#of#aborted#transactions#on#11#nodes#using#PaxosSTM
Contention# Level Accounts WH Relations High 500 23 250 Medium 2000 115 500 Low 5000 230 1000
29
– Local#transaction#ordering – Speculation#in#optimistic#execution
– Enforcing#local#transaction#order#to#certification#phase
30
– A#local#order#is#defined#among#requests – Speculation#helps#to#pass#on#the#object#updates#among#locally#ordered# transactions
R+1 R+2 R+3
Ordering"Layer Execution"Layer
Tx+1 Tx+2 Tx+3 Network Network Tx+2:#R(X),#W(Y) Y#=>#Y’ Tx+3:# W(X) Tx+4 Tx+1: W(X) Tx+4:#W(Y’) Y’#=>#Y”
31
– Requests# are#executed# optimistically# – Transaction#updates#go#through#certification#phase#before#they#can#be# committed
R+1 R+2 R+3 Tx+1 Tx+2 Tx+3 Network Network Tx+3:# W(X) Tx+4
Ordering"Layer Execution"Layer
Tx+1: W(X) Tx+2:#R(X),#W(Y) Y#=>#Y’ Tx+4:#W(Y’) Y’#=>#Y” Tx82:"R(X),"W(Y) Tx84"W(Y’) Tx+1:#W(X) Tx+3:#W(X) Tx82:"R(X),"W(Y) Tx84"W(Y’) Tx+1:#W(X) Tx+3:#W(X) Tx82:"R(X),"W(Y) Tx84"W(Y’) Tx+1:#W(X) Tx+3:#W(X)
32
– Validates#transaction#updates#w.r.t.#the#defined#order – On#successful#validation#commits#transaction#by#updating#objects – On#failing#validation,#aborts#the#transaction#and#re+executes
R+1 R+2 R+3 Network Network Tx82:"R(X),"W(Y) Tx84"W(Y’) Tx+1:#W(X) Tx+3:#W(X) Tx82:"R(X),"W(Y) Tx84"W(Y’) Tx+1:#W(X) Tx+3:#W(X)
Ordering"Layer Execution"Layer
Tx+4 Tx+3 Tx+2:#R(X),#W(Y) Tx+1:#W(X) Tx+3:#W(X) Tx+4#W(Y) Tx+2:#R(X),#W(Y) Tx+1:#W(X) Tx+3:#W(X) Tx+4#W(Y) Tx+1 Tx+2 Tx82:"R(X),"W(Y) Tx84"W(Y’) Tx+1:#W(X) Tx+3:#W(X) Tx+2:#R(X),#W(Y) Tx+1:#W(X) Tx+3:#W(X) Tx+4#W(Y)
33
– AMD#Opteron#6272,#64+core,#2.1#GHz#CPU – 128#GB#RAM#and#40#Gbps ethernet
– Bank:#A#micro+benchmark# that#mimics#bank#operations – TPC+C:#A#popular#OLTP#benchmark – Vacation:#Distributed# version#of#vacation#application#in#STAMP#[Minh,#08]
– PaxosSTM:#a#DUR+based# system;#it#suffers# from#local#aborts
34
Throughput#for#varying#the#number#
Throughput#for#7#nodes#with#varying#number#of# clients
35
– Transaction#length#is#moderately#long# – Even#low#conflict#leads#to#high#number#of#aborts#for#PaxosSTM
Throughput#for#varying#the#number#
Latency#for#varying#the#number#of# nodes
36
– ACM/IFIP/USENIX#15th Middleware#Workshop#for#Next#Generation#Computing# (MW4NG#14)
– To#be#submitted#to#EuroSys 16
37
– Speculation – Concurrent# processing – Lightweight#commit
– In#DER,#requests#have#to#execute#in#order,#irrespective# of#conflicts – In#DUR,#transactions# commit#in#order,#irrespective# of#conflicts – Are#we#loosing#performance# due#to#total+order?
38
– A#replica#that#is#elected#by#all#replicas# – Gets#the#right#to#propose#the#order#of#requests – Tries#to#convince#other#replicas#about#the#proposed#order
– Only#one#elected#replica#gets#to#propose#the#order#of#requests
– Each#replica#in#the#system#gets#to#propose#the#order#of#requests
– Number#of#times#a#leader#has#to#send#messages#to#finalize#the#order# for#a#proposed#request
39
– Multi+Paxos
– Mencius#(baseline)# [Mao,#08]
– Generalized# Paxos [Lamport,#05]
– EPaxos [Moraru,#13]
40
– Network#layer#finalizes#dependencies# for#each#request
a#directed#dependency#graph
– Local#execution# layer#defines#order#among#conflicting#requests
command
41
R1 R2 R3 R4 R5 PreAccept(A) A,#{} Commit#A,#{} PreAccept(B) B,#{} B,#{A} Accept(B) B,#{A} ACK#B ACK#B Commit#B,#{A} Sends#the#reply#to#the#client# if#the#client#does#not#need# the#result#of#the#execution A,#{} A,#{} B,#{}
42
– If#a#client#waits#for#the#result#of#an#execution# then#the#expensive# cost#
43
– Multi+leader# approach
– Use#of#quorum#to#decide#the#order
– Finalize#the#request#order#in#minimum#possible#communication#delays
– Partial+order
– Highly#concurrent#execution# of#transactions
– Use#loosely#synchronized#clocks#to#timestamp#requests
44
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 R0 R1 R2 R3 R4 T
a
Tb X Burnt#slot:#txs that#conflict#with#Tb cannot# be#delivered#in#1 T
c
T
e
X T
d
X X X
depend#on#T
c
d depends#on#T e 45
– Caesar#uses#naturally#advancing#physical#clocks#to#timestamp#requests
– Caesar#forwards#local#clock#in#case#timestamp#received# from#other# replica#is#in#future
R1 1 4 7 R2 5 8 11
'()&)* =, !(-" !(./01=#!(-" + '()&)* =#0 '( =#!(-" + !(./01 =#0
!(-" A,#TS=5,#{} !(-$
!(./01=#!(-"+ 23 =#+1# '( =!(-" + !(./01 =#5 !(./01=#1 '( =#!(-" + !(./01 =#5
6
!(./01=#0 '( =#7 46
1 2 3 4 T
a
Pre+Accept|%Tb,%2 Tb Receive Reply Tb,2#!{T
a}|Ack
from#R2 T
a and#Tb conflict 47
1 2 3 4 T
a
Accept#|%Commit Tb,2#!{T
a}
Tb% Receive Reply ACK from#R2
ACCEPTED {Ta}
X
48
1 2 3 4 T
a
Pre+Accept#|%T
c,%0
Receive from#R0 Tb and#T
c conflict.#Tbmay#burn#slot#0.
Wait#for#Tb acceptance/stabilization T
c
Tb%
ACCEPTED {Ta, Tc}
Reply T
c,0#!{}#|#Ack 49
1 2 3 4 T
a
Pre+Accept#|%T
c,%0
Receive from#R0 Tb and#T
c conflict.#Tbmay#burn#slot#0.
Wait#for#Tb acceptance/stabilization T
c
Tb%
ACCEPTED {Ta}
Reply T
c ,3#!{Tb}#|#NACK
X Suggest#a#retry#at# slot#3#for#T
c 50
1 2 3 4 5 6 7 T
a
T
c COMMIT {Ta}
X Tb% Pre+Accept#|%T
d,%7
Receive from#R2 There#is#a#burnt,#conflicting#and#non+empty# slot.#T
d waits#for#T c annihilation#
Accept#|%T
c,%5
Receive from#R0 T
c
T
d 51
CA|4 CB|5 CB|5 CC|6 CC|6 CA|4
N0 N1 N2 N3 N4
W1 W1 W2
52
CA|4 CB|5 CB|5 CC|6 CC|6 CA|4
N0 N1 N2 N3 N4
CA|4 CD|8 CA|4 CE|9
W2 W1 W1 W1 W1
53
– Both#waiting#conditions#W1#and#W2#conflict – Waiting#condition#W1#ensures# performance – Waiting#condition#W2#ensures# correctness
– Exchange#dependencies# in#response# to#Accept#message
54
1 2 3 4 5 6 7 T
a
T
c COMMIT {Ta}
X Tb% Pre+Accept#|%T
d,%7
Receive from#R2 There#is#a#burnt,#conflicting#and#non+empty# slot.#T
d waits#for#T c annihilation#
Accept#|%T
c,%5
Receive from#R0 T
c
T
d
Reply T
d ,7,#{T c}
Accept+Ack T
c ,5,#{} 55
1 2 3 4 5 6 7 T
a
T
c COMMIT {Ta}
X Tb% Pre+Accept#|%T
d,%4
Receive from#R2 Accept#|%T
c,%5
Receive from#R0 T
c
T
d
Reply T
d ,4,#{T c}
Accept+Ack T
c ,5,#{T d} 56
R2 R1 R0 R3 R4
Pre+Accept(Tb,#4) Tb,4 !{Ta}#|#ACK Commit(Tb,4,#!{Ta})
Execute#[Ta]# Execute#Tb after#Ta
Tb,4,#{} Tb,4#!{}#|#ACK
Pre+Accept(Ta,#2)
Ta,2,#{} Ta,2,#{}#|#ACK Ta,2,#{}#|#ACK Stable (Ta,2,#{})
Same#dependencies# with#all#Acks,#Decide#
Different#dependencies# with#all#Acks,#Decide#on# Fast+Path Execute#after# dependencies#are# executed;#No#graph# analysis#needed
57
– AMD#Opteron#6272,#64+core,#2.1#GHz#CPU – 128#GB#RAM#and#40#Gbps ethernet
– Key+Value:#A#micro+benchmark# that#does#single#object#read/write#
– TPC+C:#A#popular#OLTP#benchmark – Vacation:#Distributed# version#of#vacation#application#in#STAMP#[Minh,#08]
– Multi+Paxos :#Total#order,#post#final#delivery#serial#execution – Mencius:#Multi+leader# total#order,#post#final#delivery serial#execution – EPaxos:#Multi+leader# partial#order,#post#final#delivery#parallel#processing
58
– Performace of#NG+Epaxos i.e.,#EPaxos without#graph#processing,#confirms#high#cost#of# graph#processing
Ordering#layer#performance#with#varying#the#number#of#nodes
59
Ordering#layer#performance#for#11#nodes#and#varying#number#of#conflicting#clients#per#object#
60
TPC+C#transaction#throughput#for#varying#number#
TPC+C#transaction#throughput#for#varying#number#
61
– Different#contributions#can#be#mix+matched#to#solve#another#set#of#problems#in# distributed#transaction#processing
– DER#and#DUR#both#can#benefit
– Optimistic#order#helps#speculation;#partial#order#helps#concurrent#processing
62
– ICDCN"14
– Submitted#to#TCS
– Middleware"13
– Middleware"14
– MW4NG"14
– To#be#submitted#to#EuroSys 16
– Submitted#to#TPDS
– TPDS"15
– Cluster"13
List#of#Contributions
63
Image sources: rikbasra.com
64