5 Years of Metrics & Monitoring Lindsay Holmwood @auxesis - - PowerPoint PPT Presentation

5 years of metrics monitoring
SMART_READER_LITE
LIVE PREVIEW

5 Years of Metrics & Monitoring Lindsay Holmwood @auxesis - - PowerPoint PPT Presentation

5 Years of Metrics & Monitoring Lindsay Holmwood @auxesis Cultural & Technical Key retrospective questions What did we do well? What did we learn? What should we do differently next time? What still puzzles us?


slide-1
SLIDE 1

5 Years of Metrics & Monitoring

Lindsay Holmwood @auxesis

slide-2
SLIDE 2

Cultural & Technical

slide-3
SLIDE 3
  • Key retrospective questions
  • What did we do well?
  • What did we learn?
  • What should we do differently next time?
  • What still puzzles us?
slide-4
SLIDE 4

What got us here

won’t get us there

slide-5
SLIDE 5

What did we do well?

(that if we don’t talk about, we might forget)

slide-6
SLIDE 6

The Pipeline

slide-7
SLIDE 7

storage checking alerting collection graphing aggregation

slide-8
SLIDE 8 collection storage checking alerting graphing aggregation

collectd & statsd

slide-9
SLIDE 9 collection storage checking alerting graphing aggregation

Graphite & OpenTSDB & InfluxDB

slide-10
SLIDE 10 collection storage checking alerting graphing aggregation

Riemann

slide-11
SLIDE 11

Alert fatigue

has become a

recognised

problem

slide-12
SLIDE 12

Cottage industry

slide-13
SLIDE 13

PagerDuty & VictorOps & OpsGenie

slide-14
SLIDE 14
  • Librato
  • Datadog
  • Metafor
  • New Relic
  • Pingdom
  • Dataloop.io
  • Big Panda
  • AppDynamics
  • Stackdriver
  • Pagerduty
  • VictorOps
  • OpsGenie
slide-15
SLIDE 15

#monitoringsucks

slide-16
SLIDE 16 https://github.com//monitoringsucks/tool-repos
slide-17
SLIDE 17 https://github.com//monitoringsucks/metrics-catalog
slide-18
SLIDE 18

If your business had to

choose one metric to alert off,

what would it be?

slide-19
SLIDE 19

#monitoringlove

slide-20
SLIDE 20
slide-21
SLIDE 21

What would we do

differently next time?

slide-22
SLIDE 22

Graphs & Dashboards

slide-23
SLIDE 23

Apparently the hardest problem in monitoring is graphing and dashboarding.

slide-24
SLIDE 24

What we’re doing

wrong

slide-25
SLIDE 25

Strip charts

slide-26
SLIDE 26
slide-27
SLIDE 27
slide-28
SLIDE 28
slide-29
SLIDE 29

We have a problem

slide-30
SLIDE 30

Strip charts: the PHP hammer of graphing

slide-31
SLIDE 31

What can the data tell us?

slide-32
SLIDE 32

What is the distribution?

slide-33
SLIDE 33

It’s not a problem

with the tools

slide-34
SLIDE 34

Our approach

is tainted

slide-35
SLIDE 35 graphing problems we have graphing problems serviced by strip charts
slide-36
SLIDE 36
slide-37
SLIDE 37

Basic graph layout

slide-38
SLIDE 38

Black on white

slide-39
SLIDE 39

bounding box with x + y axes labels

1 2 3 4 5 5 3 1 5 3 1 1 2 3 4 5

slide-40
SLIDE 40

Colour

slide-41
SLIDE 41

Differential colour engine

slide-42
SLIDE 42
slide-43
SLIDE 43

Maximum of 15 colours on-screen

slide-44
SLIDE 44

8%

slide-45
SLIDE 45

Adjust saturation, not hue

slide-46
SLIDE 46

This is hue

slide-47
SLIDE 47

This is saturation

slide-48
SLIDE 48
slide-49
SLIDE 49

Use minimal hue to call out data

slide-50
SLIDE 50
slide-51
SLIDE 51

Fucking Pie Charts

slide-52
SLIDE 52
slide-53
SLIDE 53

Experiment:

Compare segment sizes

slide-54
SLIDE 54
slide-55
SLIDE 55
slide-56
SLIDE 56
slide-57
SLIDE 57
slide-58
SLIDE 58 – William S. Cleveland, p.86 Principles of Graphing Data

This allows us to see very clearly that the pie chart judgements are less accurate than the bar chart judgements.

slide-59
SLIDE 59

Pie chart comparisons are more error prone

slide-60
SLIDE 60

Pie not eaten Pie eaten

The only time you should use a pie chart

slide-61
SLIDE 61

Or maybe this

slide-62
SLIDE 62
slide-63
SLIDE 63

What did we learn?

slide-64
SLIDE 64

Democratisation of

graphing tool

development

slide-65
SLIDE 65

Scratch our itches

slide-66
SLIDE 66

Same poor UX, better paint job

slide-67
SLIDE 67
slide-68
SLIDE 68
slide-69
SLIDE 69

We get the graphing tools we deserve

slide-70
SLIDE 70
slide-71
SLIDE 71

Nagios is

here to stay

(at least for ops)

slide-72
SLIDE 72 kartar.net/2014/11/monitoring-survey---tools
slide-73
SLIDE 73 N a g i
  • s
O t h e r I c i n g a S e n s u Z a b b i x N e w R e l i c H
  • m
e g r
  • w
n t
  • l
Z e n
  • s
s R i e m a n n S h i n k e n n i l O p e n N M S G a n g l i a 6 6 7 8 9 17 38 47 68 77 86 193 283
slide-74
SLIDE 74

Inertia

slide-75
SLIDE 75

No

strong, compelling

alternative

slide-76
SLIDE 76

Sensu

slide-77
SLIDE 77

When I hear people say “I'm not using Sensu because it's too complex” I think “and Nagios isn't hiding the same complexity from you?”

slide-78
SLIDE 78

This is a problem

slide-79
SLIDE 79

Using Nagios? Look at

Icinga & Naemon

slide-80
SLIDE 80
slide-81
SLIDE 81

We don’t know stats

slide-82
SLIDE 82

storage checking alerting collection graphing aggregation checks

slide-83
SLIDE 83

Poor statistical literacy

has implications for

graphs & checks

slide-84
SLIDE 84

Graphs

slide-85
SLIDE 85 – Niels Bohr

We need many partially overlapping and always somehow contradictory descriptive layers to approximate a rendition of reality

slide-86
SLIDE 86
slide-87
SLIDE 87
slide-88
SLIDE 88
slide-89
SLIDE 89
slide-90
SLIDE 90
slide-91
SLIDE 91
slide-92
SLIDE 92
slide-93
SLIDE 93
slide-94
SLIDE 94

D3 & NVD3

slide-95
SLIDE 95
slide-96
SLIDE 96

Checks

slide-97
SLIDE 97

Numbers & Strings & Behaviour

slide-98
SLIDE 98

Numbers

slide-99
SLIDE 99

Fault detection

(thresholding)

slide-100
SLIDE 100

Anomaly detection

(trend analysis)

slide-101
SLIDE 101

Anscombe’s Quartet

slide-102
SLIDE 102 I II III IV x y x y x y x y 10.0 8.04 10.0 9.14 10.0 7.46 8.0 6.58 8.0 6.95 8.0 8.14 8.0 6.77 8.0 5.76 13.0 7.58 13.0 8.74 13.0 12.74 8.0 7.71 9.0 8.81 9.0 8.77 9.0 7.11 8.0 8.84 11.0 8.33 11.0 9.26 11.0 7.81 8.0 8.47 14.0 9.96 14.0 8.10 14.0 8.84 8.0 7.04 6.0 7.24 6.0 6.13 6.0 6.08 8.0 5.25 4.0 4.26 4.0 3.10 4.0 5.39 19.0 12.50 12.0 10.84 12.0 9.13 12.0 8.15 8.0 5.56 7.0 4.82 7.0 7.26 7.0 6.42 8.0 7.91 5.0 5.68 5.0 4.74 5.0 5.73 8.0 6.89
slide-103
SLIDE 103
slide-104
SLIDE 104

Mean of (x, y) (9, 7.5) Variance of (x, y) (7.5, 11) Correlation between x + y 0.816 Linear regression line y = 0.5x + 3

for all series

slide-105
SLIDE 105

Abe Stanway’s

“Mom! My algorithms suck!”

slide-106
SLIDE 106

Toufic Boubez’s

“Some simple math for anomaly detection”

slide-107
SLIDE 107

Behaviour

slide-108
SLIDE 108

Commercial Open Source StillAlive Selenium + WebDriver New Relic Synthetics PhantomJS Mechanize

slide-109
SLIDE 109

Monitoring is CI for Production

slide-110
SLIDE 110
  • 1. checkout
  • 2. build
  • 3. test
  • 4. notify

can I see my app?

Continuous Integration Monitoring

slide-111
SLIDE 111
  • serverspec &
  • sensu
slide-112
SLIDE 112
slide-113
SLIDE 113

What still puzzles us?

(or, what might the future look like?)

slide-114
SLIDE 114

The future is

analysing &

acting on our

alert data

slide-115
SLIDE 115
  • Last 5 years
  • Building new tools
  • Formalising relationships
  • Search for parallels in other industries
  • Measuring the human impact
slide-116
SLIDE 116
  • Next
  • Stabilisation of tools
  • Emerging standards
  • Exploiting parallels
  • Mitigating the human impact
slide-117
SLIDE 117

Analysis:

Ops Weekly

slide-118
SLIDE 118
slide-119
SLIDE 119
slide-120
SLIDE 120

Context:

Nagios Herald

slide-121
SLIDE 121
slide-122
SLIDE 122
slide-123
SLIDE 123

The future is

richer metadata

about our metrics

slide-124
SLIDE 124

Metrics 2.0

slide-125
SLIDE 125

{ server: dfs1 what: diskspace mountpoint: srv/node/dfs10 unit: B type: used metric_type: gauge } meta: { agent: diamond, processed_by: statsd2 }

slide-126
SLIDE 126

Self-describing

slide-127
SLIDE 127

The future is

richer metadata

about our metrics

to automatically build

appropriate

visualisations

slide-128
SLIDE 128
  • Aggregation &
  • Grouping &
  • Unit conversions &
  • Scaling &
  • Axes labelling &
slide-129
SLIDE 129

Death to strip charts

slide-130
SLIDE 130

The future is

monitoring tools

for devs

slide-131
SLIDE 131

Ops must be enablers,

not gatekeepers

slide-132
SLIDE 132

What has made sense about ops being gatekeepers?

slide-133
SLIDE 133

Monitoring is treated

as an operational

responsibility

slide-134
SLIDE 134

Ops team

  • wn ops
slide-135
SLIDE 135

We’ve won the battles

slide-136
SLIDE 136

This is

no longer the world we live in

Ops team

  • wn ops
slide-137
SLIDE 137

How do we become enablers?

slide-138
SLIDE 138

Technical & Cultural

slide-139
SLIDE 139
  • Technical
  • Ops provide the platform
  • Maintain, monitor, and scale the platform
slide-140
SLIDE 140

— Adrian Cockcroft

slide-141
SLIDE 141
  • Cultural
  • Coach on what makes a good check
  • Coach on what is good alert design
  • Listen to the needs of the end-user
slide-142
SLIDE 142

Provide monitoring

as a service

slide-143
SLIDE 143

Monitoring is a

core deliverable

  • n every service
slide-144
SLIDE 144

Ship checks & config

with your applications

slide-145
SLIDE 145

Example: Yelp

slide-146
SLIDE 146

Thomas Doran’s

“Sensu and Sensibility”

slide-147
SLIDE 147 sensu-client sensu-server

application

deploy

monitoring checks + config
slide-148
SLIDE 148

github.com/solarkennedy/sensu-report

slide-149
SLIDE 149

What’s the

barrier

to entry?

slide-150
SLIDE 150

Does the idea just not have traction?

slide-151
SLIDE 151

Are the tools not up to scratch?

slide-152
SLIDE 152

Does monitoring need to be

SaaS (or SaaS-like)

to make this achievable at

scale?

slide-153
SLIDE 153

SaaS as an accelerator

slide-154
SLIDE 154
  • Librato
  • Datadog
  • Metafor
  • New Relic
  • Pingdom
  • Dataloop.io
  • Big Panda
  • AppDynamics
  • Stackdriver
  • Pagerduty
  • VictorOps
  • Rad Alert
slide-155
SLIDE 155 – William Gibson

The future is here – it’s just not very evenly distributed

slide-156
SLIDE 156

Monitoring is

still insular

slide-157
SLIDE 157

We’re building tools

for operations teams

slide-158
SLIDE 158

Not the developers

who need them most

slide-159
SLIDE 159

Monitoring is like a joke.

If you have to explain it,

it’s not that good.

slide-160
SLIDE 160

storage checking alerting collection graphing aggregation

slide-161
SLIDE 161

What can we

do better?

slide-162
SLIDE 162

I’m Lindsay

@auxesis

slide-163
SLIDE 163

Thank you!

Liked the talk? Let @auxesis know.