Orchestrator on Ra fu : internals, benefits and considerations - - PowerPoint PPT Presentation

▶

May 14, 2023 357 likes •643 views

Orchestrator on Ra fu : internals, benefits and considerations Shlomi Noach GitHub PerconaLive 2018 About me @github/database-infrastructure Author of orchestrator , gh-ost , freno , ccql and others. Blog at http://openark.org @ShlomiNoach

SLIDE 1

Orchestrator on Rafu: internals, benefits and considerations

Shlomi Noach GitHub PerconaLive 2018

SLIDE 2

About me

@github/database-infrastructure Author of orchestrator, gh-ost, freno, ccql and others. Blog at http://openark.org @ShlomiNoach

SLIDE 3

Agenda

Raft overview Why orchestrator/raft

rchestrator/raft implementation and

nuances HA, fencing Service discovery Considerations

SLIDE 4

Rafu

Consensus algorithm Quorum based In-order replication log Delivery, lag Snapshots

SLIDE 5

HashiCorp rafu

golang raft implementation Used by Consul Recently hit 1.0.0 github.com/hashicorp/raft

SLIDE 6

rchestrator

MySQL high availability solution and replication topology manager Developed at GitHub Apache 2 license github.com/github/orchestrator

SLIDE 7

Why orchestrator/rafu

Remove MySQL backend dependency DC fencing And then good things happened that were not planned: Better cross-DC deployments DC-local KV control Kubernetes friendly

SLIDE 8

rchestrator/rafu

n orchestrator nodes form a raft cluster Each node has its own,dedicated backend database (MySQL or SQLite) All nodes probe the topologies All nodes run failure detection Only the leader runs failure recoveries

SLIDE 9

Implementation & deployment @ GitHub

5 Nodes (2xDC1, 2xDC2, 1xDc3) 1 second raft polling interval step-down raft-yield SQLite-backed log store MySQL backend (SQLite backend use case in the works)

2xDC1

2xDC2 DC3

SLIDE 10

A high availability scenario

2 is leader of a 3-node orchestrator/raft

setup

SLIDE 11

Injecting failure

master: killall -9 mysqld

2 detects failure. About to recover, but…
1
2
3

SLIDE 12

Injecting 2nd failure

2: DROP DATABASE orchestrator;
2 freaks out. 5 seconds later it steps

down

SLIDE 13

rchestrator recovery
1 grabs leadership
1
2
3

SLIDE 14

MySQL recovery

1 detected failure even before stepping

up as leader.

1, now leader, kicks recovery, fails over

MySQL master

SLIDE 15

rchestrator self health tests

Meanwhile, o2 panics and bails out.

SLIDE 16

puppet

Some time later, puppet kicks

rchestrator service back on o2.
1
3
2

SLIDE 17

rchestrator startup
rchestrator service on o2 bootstraps,

creates orchestrator schema and tables.

SLIDE 18

Joining rafu cluster

2 recovers from raft snapshot, acquires

raft log from an active node, rejoins the group

SLIDE 19

Grabbing leadership

Some time later, o2 grabs leadership

SLIDE 20

DC fencing

Assume this 3 DC setup One orchestrator node in each DC Master and a few replicas in DC2 What happens if DC2 gets network partitioned? i.e. no network in or out DC2

DC2 DC3

SLIDE 21

DC fencing

From the point of view of DC2 servers, and in particular in the point of view of DC2’s

rchestrator node:

Master and replicas are fine. DC1 and DC3 servers are all dead. No need for fail over. However, DC2’s orchestrator is not part of a quorum, hence not the leader. It doesn’t call the shots.

DC2 DC3

SLIDE 22

DC fencing

In the eyes of either DC1’s or DC3’s

rchestrator:

All DC2 servers, including the master, are dead. There is need for failover. DC1’s and DC3’s orchestrator nodes form a

quorum. One of them will become the leader.

The leader will initiate failover.

DC2 DC3

SLIDE 23

DC fencing

Depicted potential failover result. New master is from DC3.

DC2 DC3

SLIDE 24

rchestrator/rafu & consul
rchestrator is Consul-aware

Upon failover orchestrator updates Consul KV with identity

f promoted master

Consul @ GitHub is DC-local, no replication between Consul setups

rchestrator nodes, update Consul locally on each DC

SLIDE 25

Considerations, watch out for

Eventual consistency is not always your best friend What happens if, upon replay of raft log, you hit two failovers for the same cluster? NOW() and otherwise time-based assumptions Reapplying snapshot/log upon startup

SLIDE 26

rchestrator/rafu roadmap

Kubernetes ClusterIP-based configuration in progress Already container-friendly via auto-reprovisioning of nodes via Raft

SLIDE 27

Orchestrator on Rafu: internals, benefits and considerations

Shlomi Noach GitHub PerconaLive 2018

About me

@github/database-infrastructure Author of orchestrator, gh-ost, freno, ccql and others. Blog at http://openark.org @ShlomiNoach

Agenda

Raft overview Why orchestrator/raft

nuances HA, fencing Service discovery Considerations

Rafu

Consensus algorithm Quorum based In-order replication log Delivery, lag Snapshots

HashiCorp rafu

golang raft implementation Used by Consul Recently hit 1.0.0 github.com/hashicorp/raft

MySQL high availability solution and replication topology manager Developed at GitHub Apache 2 license github.com/github/orchestrator

Why orchestrator/rafu

Remove MySQL backend dependency DC fencing And then good things happened that were not planned: Better cross-DC deployments DC-local KV control Kubernetes friendly

n orchestrator nodes form a raft cluster Each node has its own,dedicated backend database (MySQL or SQLite) All nodes probe the topologies All nodes run failure detection Only the leader runs failure recoveries

Implementation & deployment @ GitHub

5 Nodes (2xDC1, 2xDC2, 1xDc3) 1 second raft polling interval step-down raft-yield SQLite-backed log store MySQL backend (SQLite backend use case in the works)

2xDC2 DC3

A high availability scenario

setup

Injecting failure

master: killall -9 mysqld

Injecting 2nd failure

down

MySQL recovery

up as leader.

MySQL master

Meanwhile, o2 panics and bails out.

puppet

Some time later, puppet kicks

creates orchestrator schema and tables.

Joining rafu cluster

raft log from an active node, rejoins the group

Grabbing leadership

Some time later, o2 grabs leadership

DC fencing

Assume this 3 DC setup One orchestrator node in each DC Master and a few replicas in DC2 What happens if DC2 gets network partitioned? i.e. no network in or out DC2

DC2 DC3

DC fencing

From the point of view of DC2 servers, and in particular in the point of view of DC2’s

Master and replicas are fine. DC1 and DC3 servers are all dead. No need for fail over. However, DC2’s orchestrator is not part of a quorum, hence not the leader. It doesn’t call the shots.

DC2 DC3

DC fencing

In the eyes of either DC1’s or DC3’s

All DC2 servers, including the master, are dead. There is need for failover. DC1’s and DC3’s orchestrator nodes form a

The leader will initiate failover.

DC2 DC3

DC fencing

Depicted potential failover result. New master is from DC3.

DC2 DC3

Upon failover orchestrator updates Consul KV with identity

Consul @ GitHub is DC-local, no replication between Consul setups

Considerations, watch out for

Eventual consistency is not always your best friend What happens if, upon replay of raft log, you hit two failovers for the same cluster? NOW() and otherwise time-based assumptions Reapplying snapshot/log upon startup

Kubernetes ClusterIP-based configuration in progress Already container-friendly via auto-reprovisioning of nodes via Raft

Thank you! Questions?

github.com/shlomi-noach @ShlomiNoach