[PPT] - Scaling State Machine Replication Fernando Pedone University of PowerPoint Presentation

SLIDE 1

Scaling State Machine Replication

Fernando Pedone University of Lugano (USI) Switzerland

SLIDE 2

State machine replication

Fundamental approach to fault tolerance

✦ Google Spanner ✦ Apache Zookeeper ✦ Windows Azure Storage ✦ MySQL Group Replication ✦ Galera Cluster, …

2

SLIDE 3

State machine replication is intuitive & simple

Replication transparency

✦ For clients ✦ For application developers

Simple execution model

✦ Replicas order all commands ✦ Replicas execute commands deterministically and in the

same order

3

SLIDE 4

Configurable fault tolerance but bounded performance

Performance is bounded by what one replica can do

✦ Every replica needs to execute every command ✦ More replicas: same (if not worse) performance

4

Servers Throughput

How to scale state machine replication?

SLIDE 5

Scaling performance with partitioning

Partitioning (aka sharding) application state

5

Problem #1: How to order commands in a partitioned system? Problem #2: How to execute commands in a partitioned system?

Scalable performance (for single-partition commands)

Servers Throughput Partition Px Partition Py

SLIDE 6

Ordering commands in a partitioned system

Atomic multicast

✦ Commands addressed (multicast) to one or more partitions ✦ Commands ordered within and across partitions

If S delivers C before C’, then no S’ delivers C’ before C

6

Partition Px Partition Py

C(x) C(y) C(x,y)

Scalable SMR Atomic multicast Multi-Paxos Network

SLIDE 7

Executing multi-partition commands

7

Solution #1: Static partitioning of data Solution #2: Dynamic partitioning of data

Partition X Partition Y

C(x,y) : { x := y }

x x x y y y

SLIDE 8

Solution 1: Static partitioning of data

Execution model

✦ Client queries location oracle to determine partitions ✦ Client multicasts command to involved partitions ✦ Partitions exchange and temporary store objects needed to

execute multi-partition commands

✦ Commands executed by all involved partitions

Location oracle

✦ Simple implementation thanks to static scheme

8

SLIDE 9

How to execute multi-partition commands?

9

Partition X Partition Y

C(x,y): x := y

x x x y y y y y y x x x

Cached entries

SLIDE 10

Static scheme, step-by-step

10

query oracle receive result deliver command all local

bjects?

send result

Client Server

multicast command to involved partitions end execute command

Yes

send needed

bjects/signal to

remote partitions

No

start wait for objects/ signal from remote partitions

SLIDE 11

Solution 2: Dynamic partitioning of data

Execution model (key idea)

✦ Turn every command single-partition ✦ If command involves multiple partitions, move objects to a

single partition before executing command

Location oracle

✦ Oracle implemented as a “special partition” ✦ Move operations involve oracle, source and destination

partitions

11

SLIDE 12

Dynamic scheme, step-by-step

12

query oracle

ne

partition? receive result deliver command all local

bjects?

send result

Client Server

move objects to one partition

No

multicast command to partition

Yes

end

No

retry?

Yes

execute command

Yes

result = retry

No

start

SLIDE 13

Termination and load balance

Ensuring termination of commands

✦ After retrying n times, command is multicast to all partitions ✦ Executed as a multi-partition command

Ensure load balancing among partitions

✦ Target partition in multi-partition command chosen randomly

13

SLIDE 14

Oracle: high availability and performance

Oracle implemented as a partition

✦ For fault tolerance

Clients cache oracle entries

✦ For performance ✦ Real oracle needed at first access and when objects change

location

✦ Client retries command if cached location is stale

14

SLIDE 15

Dynamically (re-)partitioning the state

Decentralized strategy

✦ Client chooses one partition among involved partitions ✦ Each move involves oracle and concerned partitions ✦ No single entity has complete system knowledge ✦ Good performance with strong locality, but… ✦ …slow convergence ✦ Poor performance with weak locality

15

👎  👎 

P1 P2

SLIDE 16

Dynamically (re-)partitioning the state

Centralized strategy

✦ Oracle builds graph of objects and relations (commands) ✦ Oracle partitions O-R graph (METIS) and requests move

perations to place all objects in one partition

✦ Near-optimum partitioning (both strong and weak locality) ✦ Fast convergence ✦ Oracle knows location of and relations among objects ✦ Oracle solves a hard problem

16

👎  👎 

SLIDE 17

Social network application (similar to Twitter)

GetTimeline

✦ Single-object command => always involves one partition

Post

✦ Multi-object command => may involve multiple partitions ✦ Strong locality

0% edge cut, social graph can be perfectly partitioned

✦ Weak locality

1% and 5% of edge cuts, after partitioning social graph

17

SLIDE 18

30 60 90 120 150 1 2 4 8 Throughput (kcps) 0% edge-cut SMR SSMR DSSMR DSSMRv2 SSMRMetis

GetTimelines only (single-partition commands)

18

Throughput

Number of partitions

Classic SMR Static Dyn decentralized Dyn centralized Optimized static

Servers Throughput

all schemes scale! (by design)

SLIDE 19

20 40 60 80 1 2 4 8 Throughput (kcps) Number of partitions 0% edge-cut SMR SSMR DSSMR DSSMRv2 SSMRMetis

Posts only, strong locality (0% edge cut)

19

Number of partitions

Classic SMR Static Dyn decentralized Dyn centralized Optimized static

dynamic schemes and optimized scale, but not static

SLIDE 20

10 20 30 40 1 2 4 8 Throughput (kcps) Number of partitions 1% edge-cut SMR SSMR DSSMR DSSMRv2 SSMRMetis

Posts only, weak locality (1% edge cut)

20

Number of partitions

Classic SMR Static Dyn decentralized Dyn centralized Optimized static

nly optimized and

centralized dynamic schemes scale

SLIDE 21

Conclusions

Scaling State Machine Replication

✦ Possible but locality is fundamental

OSs and DBs have known this for years

✦ Replication and partitioning transparency

The future ahead

✦ Decentralized schemes with quality of centralized schemes ✦ Expand scope of applications (e.g., data structures) ✦ “The inherent limits of scalable state machine replication”

21

SLIDE 22