Have You Tried Turning It Off and On Again? David N. Blank-Edelman - - PDF document

have you tried turning it off and on again
SMART_READER_LITE
LIVE PREVIEW

Have You Tried Turning It Off and On Again? David N. Blank-Edelman - - PDF document

7/6/18 source: http://leyanda.de/index.php?option=com_content&view=article&id=11 Have You Tried Turning It Off and On Again? David N. Blank-Edelman Senior Cloud Ops Advocate 1 7/6/18 @otterbook source:


slide-1
SLIDE 1

7/6/18 1

source: http://leyanda.de/index.php?option=com_content&view=article&id=11

Have You Tried Turning It Off and On Again?

David N. Blank-Edelman Senior Cloud Ops Advocate

slide-2
SLIDE 2

7/6/18 2

@otterbook

source: https://medium.com/@Ganticdotco/i-cant-help-but-think-of-the-blue-screen-of-death-f7a47be7ac67

slide-3
SLIDE 3

7/6/18 3

@otterbook

slide-4
SLIDE 4

7/6/18 4

This is Production.

slide-5
SLIDE 5

7/6/18 5

source: https://www.flickr.com/photos/mayhem/4970272960/

This is Production.

@otterbook

slide-6
SLIDE 6

7/6/18 6

source: http://leyanda.de/index.php?option=com_content&view=article&id=11

slide-7
SLIDE 7

7/6/18 7

Q&A

@otterbook

slide-8
SLIDE 8

7/6/18 8

Volunteers?

@otterbook

Rules

@otterbook

slide-9
SLIDE 9

7/6/18 9

Level Set: SRE

@otterbook @otterbook

slide-10
SLIDE 10

7/6/18 10

slide-11
SLIDE 11

7/6/18 11

@otterbook

Edited by David N. Blank-Edelman

Seeking

SRE

CONVERSATIONS ABOUT RUNNING PRODUCTION SYSTEMS AT SCALE

slide-12
SLIDE 12

7/6/18 12

  • Airbnb
  • Amazon
  • Apple
  • Baidu
  • Dropbox
  • Etsy
  • Facebook
  • GitHub
  • LinkedIn
  • Microsoft
  • Netflix
  • Pinterest
  • Spotify
  • Stack Exchange
  • Twitter
  • Uber
  • Yahoo!
  • Yelp
slide-13
SLIDE 13

7/6/18 13

slide-14
SLIDE 14

7/6/18 14

What Makes SRE, SRE (dramatic recreation)

  • hire only coders
  • have an SLA for your service
  • measure and report performance against SLA
  • Use Error Budgets and gate launches on them
  • Common staffing pool for SRE and DEV
  • Excess Ops work overflows to DEV team
  • Cap SRE operational load at 50%
  • Share 5% of ops work with DEV team
  • Oncall teams at least 8 people, or 6x2
  • Maximum of 2 events per oncall shift
  • Post mortem for every event
  • Post mortems are blameless and focus on process and technology, not people

What Makes SRE, SRE (dramatic recreation)

  • hire only coders
  • have an SLA for your service
  • measure and report performance against SLA
  • Use Error Budgets and gate launches on them
  • Common staffing pool for SRE and DEV
  • Excess Ops work overflows to DEV team
  • Cap SRE operational load at 50%
  • Share 5% of ops work with DEV team
  • Oncall teams at least 8 people, or 6x2
  • Maximum of 2 events per oncall shift
  • Post mortem for every event
  • Post mortems are blameless and focus on process and technology, not people
slide-15
SLIDE 15

7/6/18 15

SLO

@otterbook

SLO monitor

@otterbook

slide-16
SLIDE 16

7/6/18 16

SLO monitor decide Observation #1:

Create virtuous and reinforcing feedback loops

slide-17
SLIDE 17

7/6/18 17

What Makes SRE, SRE (dramatic recreation)

  • hire only coders
  • have an SLA for your service
  • measure and report performance against SLA
  • Use Error Budgets and gate launches on them
  • Common staffing pool for SRE and DEV
  • Excess Ops work overflows to DEV team
  • Cap SRE operational load at 50%
  • Share 5% of ops work with DEV team
  • Oncall teams at least 8 people, or 6x2
  • Maximum of 2 events per oncall shift
  • Post mortem for every event
  • Post mortems blameless and focus on process and technology, not people

Observation #2:

You can’t fire your way to reliable.

slide-18
SLIDE 18

7/6/18 18

Observation #2:

You can’t fire your way to resilient.

The Actual Talk

@otterbook

slide-19
SLIDE 19

7/6/18 19

Q: What are the characteristics of an

  • perations practice that

actively influence a system towards greater resiliency? Q: What are some of the characteristics of an

  • perations practice that

actively influence a system towards greater resiliency?

slide-20
SLIDE 20

7/6/18 20

The Nature of the Work

@otterbook

slide-21
SLIDE 21

7/6/18 21

Interfaces

@otterbook

slide-22
SLIDE 22

7/6/18 22

Data

@otterbook

slide-23
SLIDE 23

7/6/18 23

Errors

@otterbook

slide-24
SLIDE 24

7/6/18 24

Ambiguity

@otterbook

slide-25
SLIDE 25

7/6/18 25

“...I would like to beg you, dear Sir, as well as I can, to have patience with everything unresolved in your heart and to try to love the questions themselves as if they were locked rooms or books written in a very foreign language. Don't search for the answers, which could not be given to you now, because you would not be able to live them. And the point is, to live everything. Live the questions now. Perhaps then, someday far in the future, you will gradually, without even noticing it, live your way into the answer.”

—Rainer Maria Rilke, Letters to a Young Poet (#4)

slide-26
SLIDE 26

7/6/18 26

Q: What are some more of the characteristics of an

  • perations practice that

actively influence a system towards greater resiliency?

@otterbook

(More) Characteristics of an Operations Practice

@otterbook

slide-27
SLIDE 27

7/6/18 27

Check In

@otterbook

David N. Blank-Edelman

Senior Cloud Ops Advocate

@otterbook dnb@microsoft.com /in/dnblankedelman