Netflix Built Its Own Monitoring System (And You Probably Shouldnt) - - PowerPoint PPT Presentation

netflix built its own monitoring system and you probably
SMART_READER_LITE
LIVE PREVIEW

Netflix Built Its Own Monitoring System (And You Probably Shouldnt) - - PowerPoint PPT Presentation

Netflix Built Its Own Monitoring System (And You Probably Shouldnt) Roy Rapoport rsr@netflix.com @royrapoport 6 March 2015 Not So Much About Telemetry I telemetry Architecture track Open Space, 11:30AM, Fleming 3rd Floor


slide-1
SLIDE 1

Netflix Built Its Own Monitoring System (And You Probably Shouldn’t)

Roy Rapoport rsr@netflix.com @royrapoport 6 March 2015

slide-2
SLIDE 2

Not So Much About Telemetry

  • I telemetry
  • Architecture track Open Space,

11:30AM, Fleming 3rd Floor

slide-3
SLIDE 3

The Knights Who Say NIH

slide-4
SLIDE 4

Agenda

  • Introductions
  • On Judgment
  • Your Problem
  • Your (no, really) Solution
  • Mitigation and Anecdotes
  • (Not) building your own monitoring

system

slide-5
SLIDE 5

Introductions: Me

  • About 23 years in technology
  • Systems engineering, networking, sofuware

development, QA, release management

  • Time at Netflix: 2076 days (5y:8m:7d)
  • At Netflix:
  • Systems Engineering, Service Delivery in IT
  • Troubleshooter and Builder of Python Things

in Product Engineering

  • Now: Engineering Manager, Insight Engineering
slide-6
SLIDE 6

Introductions: Netflix

  • Optimize speed of innovation
  • Constrain availability
  • Cost is what it is
  • Hire smart people,


get out of their way

  • Anti-process bias

“Freedom and Responsibility”

slide-7
SLIDE 7

Judgment

slide-8
SLIDE 8

You Have a Problem

(Your job would likely be boring otherwise)

  • Are you the first
  • To have it?
  • To care?
  • Are you sure?

One that looks nice And not too expensive

slide-9
SLIDE 9

You Have a Problem

(Your job would likely be boring otherwise)

  • You’re not the first, or only
  • Good news!
  • Then what?
slide-10
SLIDE 10

Adventures in IT-Land

  • (import disclaimer)
  • Not developers
  • Cautious about ongoing support

load

  • Not well-trusted
slide-11
SLIDE 11

Adventures in IT-Land

slide-12
SLIDE 12

A Little Bit of …

  • Time, courage, knowledge, pride
  • Cynicism, hubris, fear
slide-13
SLIDE 13
slide-14
SLIDE 14

Technical Reasons for Rejection

(Or: It’s Not You, It’s … Actually, It’s You)

  • Financial Cost
  • Technical incompatibility
slide-15
SLIDE 15

Overqualified!

slide-16
SLIDE 16
  • https://www.flickr.com/photos/54945394@N00
slide-17
SLIDE 17

A Moment for Pedantry

Or: Requirements for “Not Invented Here”

slide-18
SLIDE 18

The Knights Who Say IbPWAU

slide-19
SLIDE 19

A Question of Trust

  • Technical: I don’t trust your product
  • Organizational: I don’t trust you
slide-20
SLIDE 20

I Don’t Trust You

To Care About Me as a Customer

  • You’re selling me something
  • I’m not your only customer
  • I’m not an important customer
  • You don’t care about your

customers

slide-21
SLIDE 21

I Don’t Trust You

To build a good product

  • Past performance …
  • “Good for me”
  • Because you said so, that’s why!
slide-22
SLIDE 22

I Don’t Trust You

To build it fast enough

  • Unpredictable velocity
  • When best-case is too slow
  • Or maybe ever (OSS)
slide-23
SLIDE 23

What Now?

slide-24
SLIDE 24

Eventual Consistency

  • Fork n’ merge
  • THE model for OSS
  • Works better for incremental

changes

  • Requires alignment of goals
slide-25
SLIDE 25

Eventual Consistency

No Fork Required

  • Start With a New Idea
  • Eventually merge concepts
slide-26
SLIDE 26

Eventual Consistency Example

Mainline Cloud Orchestration

2011

slide-27
SLIDE 27

Eventual Consistency Example

Mainline Cloud Orchestration

2011 2013

slide-28
SLIDE 28

Eventual Consistency Example

Mainline Cloud Orchestration

2011 2013

Insight Engineering CD Automation

slide-29
SLIDE 29

Eventual Consistency Example

Mainline Cloud Orchestration

2011 2013

Insight Engineering CD Automation

2014

Mainline CD Automation

slide-30
SLIDE 30

Eventual Consistency Example

Mainline Cloud Orchestration

2011 2013

Insight Engineering CD Automation

2014

Mainline CD Automation

2015

slide-31
SLIDE 31

Eventual Consistency Example

Mainline Cloud Orchestration

2011 2013 2014

Mainline CD Automation

2015

Insight Engineering CD Automation

slide-32
SLIDE 32

Composability

  • Want this anyway
  • Map scope to options’ scopes
slide-33
SLIDE 33

Composability: Example

Netflix’s Atlas Telemetry Platform

Global Query Endpoint

slide-34
SLIDE 34

Composability: Example

Netflix’s Atlas Telemetry Platform

Global Query Endpoint Regional Query Endpoint Regional Query Endpoint Regional Query Endpoint Regional Query Endpoint Regional Boundary

slide-35
SLIDE 35

Composability: Example

Netflix’s Atlas Telemetry Platform

Global Query Endpoint Regional Query Endpoint Regional Query Endpoint Regional Query Endpoint Regional Query Endpoint Memory Epic

Cloudwatch

slide-36
SLIDE 36

Composability: Example

Netflix’s Atlas Telemetry Platform

Global Query Endpoint Regional Query Endpoint Regional Query Endpoint Regional Query Endpoint Regional Query Endpoint Memory

Cloudwatch

slide-37
SLIDE 37

Composability: Example

Netflix’s Atlas Telemetry Platform

Global Query Endpoint Regional Query Endpoint Regional Query Endpoint Regional Query Endpoint Regional Query Endpoint Memory

Cloudwatch OpenTSDB InfluxDB

slide-38
SLIDE 38

Composability: Example

Deployments and Automated Canary Analysis at Netflix

Edge Systems Deployment Automation Platform Edge Systems Canary Analysis API A P I Mainline Deployment Automation Platform

slide-39
SLIDE 39

Composability: Example

Deployments and Automated Canary Analysis at Netflix

Edge Systems Deployment Automation Platform Edge Systems Canary Analysis API E m a i l Insight Engineering Canary Analysis Mainline Deployment Automation Platform

slide-40
SLIDE 40

Composability: Example

Deployments and Automated Canary Analysis at Netflix

Edge Systems Deployment Automation Platform Edge Systems Canary Analysis API Insight Engineering Canary Analysis Mainline Deployment Automation Platform

slide-41
SLIDE 41

Composability: Example

Deployments and Automated Canary Analysis at Netflix

Edge Systems Deployment Automation Platform Edge Systems Canary Analysis Insight Engineering Canary Analysis Mainline Deployment Automation Platform

slide-42
SLIDE 42

Composability: Example

Deployments and Automated Canary Analysis at Netflix

Edge Systems Deployment Automation Platform Insight Engineering Canary Analysis Mainline Deployment Automation Platform

slide-43
SLIDE 43

Composability: Example

Deployments and Automated Canary Analysis at Netflix

Edge Systems Deployment Automation Platform Insight Engineering Canary Analysis Mainline Deployment Automation Platform

slide-44
SLIDE 44

Composability: Example

Deployments and Automated Canary Analysis at Netflix

Insight Engineering Canary Analysis Mainline Deployment Automation Platform

slide-45
SLIDE 45

One More Reason

“Think of the glory. Think of your

  • reputation. Think how

great it'll look on your next resume. ”

  • Lois McMaster Bujold
slide-46
SLIDE 46

Judgment

slide-47
SLIDE 47

The Grand Example

Netflix’s Monitoring Platform

  • Prior system owned by IT
slide-48
SLIDE 48

The Grand Example

Netflix’s Monitoring Platform

  • Prior system owned by IT
  • No great OSS products
slide-49
SLIDE 49

The Grand Example

Netflix’s Monitoring Platform

  • Prior system owned by IT
  • No great OSS products
  • Ridiculous scale
slide-50
SLIDE 50

The Grand Example

Netflix’s Monitoring Platform

  • Prior system owned by IT
  • No great OSS products
  • Ridiculous scale
  • Seriously, how hard can it be?
slide-51
SLIDE 51

The Grand Example

Netflix’s Monitoring Platform

  • Took longer than expected
  • Ongoing maintenance
  • UI only recent priority
slide-52
SLIDE 52

The Grand Example

Netflix’s Monitoring Platform

  • Scales efficientlyish
  • impedance match with dev lifestyle
  • Nicely pluggable*
  • Aggressivish OSS efforts

* Ask me about Real-Time Analytics!

slide-53
SLIDE 53

The Grand Example

Netflix’s Monitoring Platform

  • Still the right solution
  • Worried about Sunk Cost Fallacy
  • Most shouldn’t do this
slide-54
SLIDE 54

Can You Repeat That?

Or: What’s Your Point? Or: I was Tweeting. Did I miss something?

  • What’s important to you?
  • Is this a technical decision? Really?
  • Honest and non-judgmental
  • Any mitigation?
  • Don’t build your own monitoring
  • system. Seriously.
slide-55
SLIDE 55

Name This Group

  • United States
  • Europe
  • China
  • Russia
  • India
  • Japan
  • Blue Origin
  • SpaceX
  • Virgin Galactic
slide-56
SLIDE 56

11:30am Frasier Room (3rd Floor)

@royrapoport rsr@netflix.com