Availability, Latency and Cost: Withstanding Regional Outages - - PowerPoint PPT Presentation

availability latency and cost withstanding regional
SMART_READER_LITE
LIVE PREVIEW

Availability, Latency and Cost: Withstanding Regional Outages - - PowerPoint PPT Presentation

Availability, Latency and Cost: Withstanding Regional Outages @aaronblohowiak aaronb@netflix.com What to expect Why? Overview! Algebraic Models Availability! Latency! Cost! Architecture! @aaronblohowiak Why? You


slide-1
SLIDE 1

Availability, Latency and Cost: Withstanding Regional Outages

@aaronblohowiak aaronb@netflix.com

slide-2
SLIDE 2

What to expect

  • Why?
  • Overview!
  • Algebraic Models

○ Availability! ○ Latency! ○ Cost!

  • Architecture!

@aaronblohowiak

slide-3
SLIDE 3

Why?

slide-4
SLIDE 4
slide-5
SLIDE 5

You never let a serious crisis go to waste. And what I mean by that it's an

  • pportunity to do things you think you

could not do before.

  • Rahm Emanuel
slide-6
SLIDE 6
slide-7
SLIDE 7

Good, not great.

@aaronblohowiak

slide-8
SLIDE 8

1. Instability

Good, not great.

@aaronblohowiak

slide-9
SLIDE 9

1. Instability 2. Infrequency

Good, not great.

@aaronblohowiak

slide-10
SLIDE 10

1. Instability 2. Infrequency 3. GOTO 1.

Good, not great.

@aaronblohowiak

slide-11
SLIDE 11

Source: https://martinfowler.com/bliki/FrequencyReducesDifficulty.html

slide-12
SLIDE 12

One of my favorite soundbites is: if it hurts, do it more often.

  • Martin Fowler
slide-13
SLIDE 13
slide-14
SLIDE 14

1. Alerts

Operational Burden

@aaronblohowiak

slide-15
SLIDE 15

1. Alerts 2. Canaries

Operational Burden

@aaronblohowiak

slide-16
SLIDE 16

1. Alerts 2. Canaries 3. WoW Metrics

Operational Burden

@aaronblohowiak

slide-17
SLIDE 17

From Burden to Advantage

@aaronblohowiak

slide-18
SLIDE 18

In general, freedom and rapid recovery is better than trying to prevent error. We are in a creative business, not a safety-critical business.

  • jobs.netflix.com/culture
slide-19
SLIDE 19

Overview

slide-20
SLIDE 20

Problem Description Number of Regions

@aaronblohowiak

slide-21
SLIDE 21

@aaronblohowiak

slide-22
SLIDE 22

@aaronblohowiak

slide-23
SLIDE 23

@aaronblohowiak

100% Capacity

slide-24
SLIDE 24

Problem Description Number of Regions

@aaronblohowiak

slide-25
SLIDE 25

N+1 Architecture

@aaronblohowiak

slide-26
SLIDE 26

@aaronblohowiak

100%

1+0 (no spare)

slide-27
SLIDE 27

@aaronblohowiak

100% 100%

1+1

slide-28
SLIDE 28

@aaronblohowiak

100% 100%

1+1 = 200%

slide-29
SLIDE 29

@aaronblohowiak

2+1

50% 50% 50%

slide-30
SLIDE 30

@aaronblohowiak

2+1 = 150%

50% 50% 50%

slide-31
SLIDE 31

@aaronblohowiak

2+1 = 150% ?!?!?!?!?!

50% 50% 50%

slide-32
SLIDE 32

2+1 Overview

@aaronblohowiak

slide-33
SLIDE 33

@aaronblohowiak

slide-34
SLIDE 34

@aaronblohowiak

slide-35
SLIDE 35

@aaronblohowiak

slide-36
SLIDE 36

@aaronblohowiak

Excess Risk

slide-37
SLIDE 37

@aaronblohowiak

slide-38
SLIDE 38

@aaronblohowiak

slide-39
SLIDE 39

@aaronblohowiak

slide-40
SLIDE 40

Algebraic Models

slide-41
SLIDE 41

All models are wrong but some are useful

  • George Box
slide-42
SLIDE 42

Availability

slide-43
SLIDE 43

Distribution of Change Number of Regions Balance of Traffic

@aaronblohowiak

slide-44
SLIDE 44

Distribution of Change Number of Regions Balance of Traffic

@aaronblohowiak

slide-45
SLIDE 45

Distribution of Change Number of Regions Balance of Traffic

@aaronblohowiak

slide-46
SLIDE 46

@aaronblohowiak

slide-47
SLIDE 47

Distribution of Change Number of Regions Balance of Traffic

@aaronblohowiak

slide-48
SLIDE 48

Distribution of Change Number of Regions Balance of Traffic

@aaronblohowiak

slide-49
SLIDE 49

@aaronblohowiak

slide-50
SLIDE 50

Distribution of Change Number of Regions Balance of Traffic Empirical Risk

@aaronblohowiak

slide-51
SLIDE 51

Latency

slide-52
SLIDE 52

Which Latency?

slide-53
SLIDE 53

Normal vs Failover

slide-54
SLIDE 54

Latency Availability Cost ???

@aaronblohowiak

slide-55
SLIDE 55

If you’re successful, hourly demand maps to population by longitude.

  • Blohowiak’s Third Law
slide-56
SLIDE 56

Measuring Latency

@aaronblohowiak

slide-57
SLIDE 57

@aaronblohowiak

slide-58
SLIDE 58

@aaronblohowiak

slide-59
SLIDE 59

@aaronblohowiak

slide-60
SLIDE 60

Measuring Latency

@aaronblohowiak

slide-61
SLIDE 61

Measuring Latency

@aaronblohowiak

slide-62
SLIDE 62

Cost

slide-63
SLIDE 63

@aaronblohowiak

2+1

50% 50% 50%

slide-64
SLIDE 64

@aaronblohowiak

slide-65
SLIDE 65

In N+1 Architecture, minimal failover

  • verhead is 1/N.

@aaronblohowiak

slide-66
SLIDE 66

In N+1 Architecture, minimal failover

  • verhead is 1/N.

Cost = 100% + 1/N

@aaronblohowiak

slide-67
SLIDE 67

In N+1 Architecture, minimal failover

  • verhead is 1/N.

Cost = 100% + 1/N If costs are pure throughput

@aaronblohowiak

slide-68
SLIDE 68

100%

slide-69
SLIDE 69

Throughput Portion Database Portion 100%

slide-70
SLIDE 70

@aaronblohowiak

2+1

slide-71
SLIDE 71

2+1 All data everywhere

slide-72
SLIDE 72

2+1 All data everywhere >150%

slide-73
SLIDE 73

Data Base Portion Region Replication Factor

@aaronblohowiak

slide-74
SLIDE 74

In RRF=All T is Throughput Cost T = (1 - DBP) * (1 + 1/N) D is DB Cost D = DBP * (N + 1) Total = T + D

@aaronblohowiak

slide-75
SLIDE 75

@aaronblohowiak

slide-76
SLIDE 76

In RRF=2 T is Throughput Cost T = (1 - DBP) * (1 + 1/N) D is DB Cost D = DBP * 2 Total = T + D

@aaronblohowiak

slide-77
SLIDE 77

@aaronblohowiak

slide-78
SLIDE 78

@aaronblohowiak

slide-79
SLIDE 79

Cost Summary

  • 50% throughput overhead plus tripled

database cost for 3-region RRF=all.

  • 25% throughput overhead plus

doubled database cost for 5-region RRF=2, plus a lot of complexity.

@aaronblohowiak

slide-80
SLIDE 80

Architecture

slide-81
SLIDE 81

Multi-Site Fault Isolation

  • No cross-region Requests!
  • Stateless or Async* Replication!

○ Cache Replication!

  • Change One Region at a Time!

@aaronblohowiak

slide-82
SLIDE 82

To shard or not to shard? That is the question.

@aaronblohowiak

slide-83
SLIDE 83

To shard or not to shard? That is the question.

  • Steering

@aaronblohowiak

slide-84
SLIDE 84

To shard or not to shard? That is the question.

  • Steering
  • Rebalancing & Rehoming

@aaronblohowiak

slide-85
SLIDE 85

To shard or not to shard? That is the question.

  • Steering
  • Rebalancing & Rehoming
  • Cost

@aaronblohowiak

slide-86
SLIDE 86

To shard or not to shard? That is the question.

  • Steering
  • Rebalancing & Rehoming
  • Cost
  • Satellites

@aaronblohowiak

slide-87
SLIDE 87

To shard or not to shard? That is the question.

  • Steering
  • Rebalancing & Rehoming
  • Cost
  • Satellites
  • Graph vs Multi-tenant

@aaronblohowiak

slide-88
SLIDE 88

How to RRF=2 with 1/N overhead?

  • Central Savior
  • Ring
  • Custom Hashing

@aaronblohowiak

slide-89
SLIDE 89

Central Savior

@aaronblohowiak

slide-90
SLIDE 90

Central Savior

@aaronblohowiak

slide-91
SLIDE 91

Ring Regions

@aaronblohowiak

slide-92
SLIDE 92

Ring Regions

@aaronblohowiak

slide-93
SLIDE 93

Ring Regions

@aaronblohowiak

slide-94
SLIDE 94

One More Thing

@aaronblohowiak

slide-95
SLIDE 95

What percentage of your outages come from regional failures?

@aaronblohowiak

slide-96
SLIDE 96

Many of the availability benefits come from isolation, not regions.

@aaronblohowiak

slide-97
SLIDE 97

What percentage of your outages come from database failures?

@aaronblohowiak

slide-98
SLIDE 98

Maybe for you and your org having logical stacks makes the most sense.

@aaronblohowiak

slide-99
SLIDE 99

Closing Thoughts

@aaronblohowiak

slide-100
SLIDE 100

Questions?

@aaronblohowiak

slide-101
SLIDE 101

Choose Your Own Adventure

@aaronblohowiak

slide-102
SLIDE 102

What do you want more details on?

  • Steering
  • Scaling
  • Demand Mapping

@aaronblohowiak