Orchestrator on Ra fu : internals, benefits and considerations - - PowerPoint PPT Presentation

orchestrator on ra fu internals benefits and
SMART_READER_LITE
LIVE PREVIEW

Orchestrator on Ra fu : internals, benefits and considerations - - PowerPoint PPT Presentation

Orchestrator on Ra fu : internals, benefits and considerations Shlomi Noach GitHub PerconaLive 2018 About me @github/database-infrastructure Author of orchestrator , gh-ost , freno , ccql and others. Blog at http://openark.org @ShlomiNoach


slide-1
SLIDE 1

Orchestrator on Rafu: internals, benefits and considerations

Shlomi Noach GitHub PerconaLive 2018

slide-2
SLIDE 2

About me

@github/database-infrastructure Author of orchestrator, gh-ost, freno, ccql and others. Blog at http://openark.org @ShlomiNoach

slide-3
SLIDE 3

Agenda

Raft overview Why orchestrator/raft

  • rchestrator/raft implementation and

nuances HA, fencing Service discovery Considerations

slide-4
SLIDE 4

Rafu

Consensus algorithm Quorum based In-order replication log Delivery, lag Snapshots

slide-5
SLIDE 5

HashiCorp rafu

golang raft implementation Used by Consul Recently hit 1.0.0 github.com/hashicorp/raft

slide-6
SLIDE 6
  • rchestrator

MySQL high availability solution and replication topology manager Developed at GitHub Apache 2 license github.com/github/orchestrator

slide-7
SLIDE 7

Why orchestrator/rafu

Remove MySQL backend dependency DC fencing And then good things happened that were not planned: Better cross-DC deployments DC-local KV control Kubernetes friendly

slide-8
SLIDE 8
  • rchestrator/rafu

n orchestrator nodes form a raft cluster Each node has its own,dedicated backend database (MySQL or SQLite) All nodes probe the topologies All nodes run failure detection Only the leader runs failure recoveries

slide-9
SLIDE 9

Implementation & deployment @ GitHub

5 Nodes (2xDC1, 2xDC2, 1xDc3) 1 second raft polling interval step-down raft-yield SQLite-backed log store MySQL backend (SQLite backend use case in the works)

  • 2xDC1

2xDC2 DC3

slide-10
SLIDE 10

A high availability scenario

  • 2 is leader of a 3-node orchestrator/raft

setup

  • 1
  • 2
  • 3
slide-11
SLIDE 11

Injecting failure

master: killall -9 mysqld

  • 2 detects failure. About to recover, but…
  • 1
  • 2
  • 3
slide-12
SLIDE 12

Injecting 2nd failure

  • 2: DROP DATABASE orchestrator;
  • 2 freaks out. 5 seconds later it steps

down

  • 1
  • 2
  • 3
slide-13
SLIDE 13
  • rchestrator recovery
  • 1 grabs leadership
  • 1
  • 2
  • 3
slide-14
SLIDE 14

MySQL recovery

  • 1 detected failure even before stepping

up as leader.

  • 1, now leader, kicks recovery, fails over

MySQL master

  • 1
  • 3
  • 2
slide-15
SLIDE 15
  • rchestrator self health tests

Meanwhile, o2 panics and bails out.

  • 1
  • 3
  • 2
slide-16
SLIDE 16

puppet

Some time later, puppet kicks

  • rchestrator service back on o2.
  • 1
  • 3
  • 2
slide-17
SLIDE 17
  • rchestrator startup
  • rchestrator service on o2 bootstraps,

creates orchestrator schema and tables.

  • 1
  • 3
  • 2
slide-18
SLIDE 18

Joining rafu cluster

  • 2 recovers from raft snapshot, acquires

raft log from an active node, rejoins the group

  • 1
  • 3
  • 2
slide-19
SLIDE 19

Grabbing leadership

Some time later, o2 grabs leadership

  • 1
  • 3
  • 2
slide-20
SLIDE 20

DC fencing

Assume this 3 DC setup One orchestrator node in each DC Master and a few replicas in DC2 What happens if DC2 gets network partitioned? i.e. no network in or out DC2

  • DC1

DC2 DC3

slide-21
SLIDE 21

DC fencing

From the point of view of DC2 servers, and in particular in the point of view of DC2’s

  • rchestrator node:

Master and replicas are fine. DC1 and DC3 servers are all dead. No need for fail over. However, DC2’s orchestrator is not part of a quorum, hence not the leader. It doesn’t call the shots.

  • DC1

DC2 DC3

slide-22
SLIDE 22

DC fencing

In the eyes of either DC1’s or DC3’s

  • rchestrator:

All DC2 servers, including the master, are dead. There is need for failover. DC1’s and DC3’s orchestrator nodes form a

  • quorum. One of them will become the leader.

The leader will initiate failover.

  • DC1

DC2 DC3

slide-23
SLIDE 23

DC fencing

Depicted potential failover result. New master is from DC3.

  • DC1

DC2 DC3

slide-24
SLIDE 24
  • rchestrator/rafu & consul
  • rchestrator is Consul-aware

Upon failover orchestrator updates Consul KV with identity

  • f promoted master

Consul @ GitHub is DC-local, no replication between Consul setups

  • rchestrator nodes, update Consul locally on each DC
slide-25
SLIDE 25

Considerations, watch out for

Eventual consistency is not always your best friend What happens if, upon replay of raft log, you hit two failovers for the same cluster? NOW() and otherwise time-based assumptions Reapplying snapshot/log upon startup

slide-26
SLIDE 26
  • rchestrator/rafu roadmap

Kubernetes ClusterIP-based configuration in progress Already container-friendly via auto-reprovisioning of nodes via Raft

slide-27
SLIDE 27

Thank you! Questions?

github.com/shlomi-noach @ShlomiNoach