[PPT] - 5 Years of Metrics & Monitoring Lindsay Holmwood @auxesis PowerPoint Presentation

SLIDE 1

5 Years of Metrics & Monitoring

Lindsay Holmwood @auxesis

SLIDE 2

Cultural & Technical

SLIDE 3

Key retrospective questions
What did we do well?
What did we learn?
What should we do differently next time?
What still puzzles us?

SLIDE 4

What got us here

won’t get us there

SLIDE 5

What did we do well?

(that if we don’t talk about, we might forget)

SLIDE 6

The Pipeline

SLIDE 7

storage checking alerting collection graphing aggregation

SLIDE 8 collection storage checking alerting graphing aggregation

collectd & statsd

SLIDE 9 collection storage checking alerting graphing aggregation

Graphite & OpenTSDB & InfluxDB

SLIDE 10 collection storage checking alerting graphing aggregation

Riemann

SLIDE 11

Alert fatigue

has become a

recognised

problem

SLIDE 12

Cottage industry

SLIDE 13

PagerDuty & VictorOps & OpsGenie

SLIDE 14

Librato
Datadog
Metafor
New Relic
Pingdom
Dataloop.io
Big Panda
AppDynamics
Stackdriver
Pagerduty
VictorOps
OpsGenie

SLIDE 15

#monitoringsucks

SLIDE 16 https://github.com//monitoringsucks/tool-repos

SLIDE 17 https://github.com//monitoringsucks/metrics-catalog

SLIDE 18

If your business had to

choose one metric to alert off,

what would it be?

SLIDE 19

#monitoringlove

SLIDE 20

SLIDE 21

What would we do

differently next time?

SLIDE 22

Graphs & Dashboards

SLIDE 23

Apparently the hardest problem in monitoring is graphing and dashboarding.

SLIDE 24

What we’re doing

wrong

SLIDE 25

Strip charts

SLIDE 26

SLIDE 27

SLIDE 28

SLIDE 29

We have a problem

SLIDE 30

Strip charts: the PHP hammer of graphing

SLIDE 31

What can the data tell us?

SLIDE 32

What is the distribution?

SLIDE 33

It’s not a problem

with the tools

SLIDE 34

Our approach

is tainted

SLIDE 35 graphing problems we have graphing problems serviced by strip charts

SLIDE 36

SLIDE 37

Basic graph layout

SLIDE 38

Black on white

SLIDE 39

bounding box with x + y axes labels

1 2 3 4 5 5 3 1 5 3 1 1 2 3 4 5

SLIDE 40

Colour

SLIDE 41

Differential colour engine

SLIDE 42

SLIDE 43

Maximum of 15 colours on-screen

SLIDE 44

8%

SLIDE 45

Adjust saturation, not hue

SLIDE 46

This is hue

SLIDE 47

This is saturation

SLIDE 48

SLIDE 49

Use minimal hue to call out data

SLIDE 50

SLIDE 51

Fucking Pie Charts

SLIDE 52

SLIDE 53

Experiment:

Compare segment sizes

SLIDE 54

SLIDE 55

SLIDE 56

SLIDE 57

SLIDE 58 – William S. Cleveland, p.86 Principles of Graphing Data

This allows us to see very clearly that the pie chart judgements are less accurate than the bar chart judgements.

SLIDE 59

Pie chart comparisons are more error prone

SLIDE 60

Pie not eaten Pie eaten

The only time you should use a pie chart

SLIDE 61

Or maybe this

SLIDE 62

SLIDE 63

What did we learn?

SLIDE 64

Democratisation of

graphing tool

development

SLIDE 65

Scratch our itches

SLIDE 66

Same poor UX, better paint job

SLIDE 67

SLIDE 68

SLIDE 69

We get the graphing tools we deserve

SLIDE 70

SLIDE 71

Nagios is

here to stay

(at least for ops)

SLIDE 72 kartar.net/2014/11/monitoring-survey---tools

SLIDE 73 N a g i

s

O t h e r I c i n g a S e n s u Z a b b i x N e w R e l i c H

m

e g r

w

n t

l

Z e n

s

s R i e m a n n S h i n k e n n i l O p e n N M S G a n g l i a 6 6 7 8 9 17 38 47 68 77 86 193 283

SLIDE 74

Inertia

SLIDE 75

No

strong, compelling

alternative

SLIDE 76

Sensu

SLIDE 77

When I hear people say “I'm not using Sensu because it's too complex” I think “and Nagios isn't hiding the same complexity from you?”

SLIDE 78

This is a problem

SLIDE 79

Using Nagios? Look at

Icinga & Naemon

SLIDE 80

SLIDE 81

We don’t know stats

SLIDE 82

storage checking alerting collection graphing aggregation checks

SLIDE 83

Poor statistical literacy

has implications for

graphs & checks

SLIDE 84

Graphs

SLIDE 85 – Niels Bohr

We need many partially overlapping and always somehow contradictory descriptive layers to approximate a rendition of reality

SLIDE 86

SLIDE 87

SLIDE 88

SLIDE 89

SLIDE 90

SLIDE 91

SLIDE 92

SLIDE 93

SLIDE 94

D3 & NVD3

SLIDE 95

SLIDE 96

Checks

SLIDE 97

Numbers & Strings & Behaviour

SLIDE 98

Numbers

SLIDE 99

Fault detection

(thresholding)

SLIDE 100

Anomaly detection

(trend analysis)

SLIDE 101

Anscombe’s Quartet

SLIDE 102 I II III IV x y x y x y x y 10.0 8.04 10.0 9.14 10.0 7.46 8.0 6.58 8.0 6.95 8.0 8.14 8.0 6.77 8.0 5.76 13.0 7.58 13.0 8.74 13.0 12.74 8.0 7.71 9.0 8.81 9.0 8.77 9.0 7.11 8.0 8.84 11.0 8.33 11.0 9.26 11.0 7.81 8.0 8.47 14.0 9.96 14.0 8.10 14.0 8.84 8.0 7.04 6.0 7.24 6.0 6.13 6.0 6.08 8.0 5.25 4.0 4.26 4.0 3.10 4.0 5.39 19.0 12.50 12.0 10.84 12.0 9.13 12.0 8.15 8.0 5.56 7.0 4.82 7.0 7.26 7.0 6.42 8.0 7.91 5.0 5.68 5.0 4.74 5.0 5.73 8.0 6.89

SLIDE 103

SLIDE 104

Mean of (x, y) (9, 7.5) Variance of (x, y) (7.5, 11) Correlation between x + y 0.816 Linear regression line y = 0.5x + 3

for all series

SLIDE 105

Abe Stanway’s

“Mom! My algorithms suck!”

SLIDE 106

Toufic Boubez’s

“Some simple math for anomaly detection”

SLIDE 107

Behaviour

SLIDE 108

Commercial Open Source StillAlive Selenium + WebDriver New Relic Synthetics PhantomJS Mechanize

SLIDE 109

Monitoring is CI for Production

SLIDE 110

1. checkout
2. build
3. test
4. notify

can I see my app?

Continuous Integration Monitoring

SLIDE 111

serverspec &
sensu

SLIDE 112

SLIDE 113

What still puzzles us?

(or, what might the future look like?)

SLIDE 114

The future is

analysing &

acting on our

alert data

SLIDE 115

Last 5 years
Building new tools
Formalising relationships
Search for parallels in other industries
Measuring the human impact

SLIDE 116

Next
Stabilisation of tools
Emerging standards
Exploiting parallels
Mitigating the human impact

SLIDE 117

Analysis:

Ops Weekly

SLIDE 118

SLIDE 119

SLIDE 120

Context:

Nagios Herald

SLIDE 121

SLIDE 122

SLIDE 123

The future is

richer metadata

about our metrics

SLIDE 124

Metrics 2.0

SLIDE 125

{ server: dfs1 what: diskspace mountpoint: srv/node/dfs10 unit: B type: used metric_type: gauge } meta: { agent: diamond, processed_by: statsd2 }

SLIDE 126

Self-describing

SLIDE 127

The future is

richer metadata

about our metrics

to automatically build

appropriate

visualisations

SLIDE 128

Aggregation &
Grouping &
Unit conversions &
Scaling &
Axes labelling &
…

SLIDE 129

Death to strip charts

SLIDE 130

The future is

monitoring tools

for devs

SLIDE 131

Ops must be enablers,

not gatekeepers

SLIDE 132

What has made sense about ops being gatekeepers?

SLIDE 133

Monitoring is treated

as an operational

responsibility

SLIDE 134

Ops team

wn ops

SLIDE 135

We’ve won the battles

SLIDE 136

This is

no longer the world we live in

Ops team

wn ops

SLIDE 137

How do we become enablers?

SLIDE 138

Technical & Cultural

SLIDE 139

Technical
Ops provide the platform
Maintain, monitor, and scale the platform

SLIDE 140

— Adrian Cockcroft

SLIDE 141

Cultural
Coach on what makes a good check
Coach on what is good alert design
Listen to the needs of the end-user

SLIDE 142

Provide monitoring

as a service

SLIDE 143

Monitoring is a

core deliverable

n every service

SLIDE 144

Ship checks & config

with your applications

SLIDE 145

Example: Yelp

SLIDE 146

Thomas Doran’s

“Sensu and Sensibility”

SLIDE 147 sensu-client sensu-server

application

deploy

monitoring checks + config

SLIDE 148

github.com/solarkennedy/sensu-report

SLIDE 149

What’s the

barrier

to entry?

SLIDE 150

Does the idea just not have traction?

SLIDE 151

Are the tools not up to scratch?

SLIDE 152

Does monitoring need to be

SaaS (or SaaS-like)

to make this achievable at

scale?

SLIDE 153

SaaS as an accelerator

SLIDE 154

Librato
Datadog
Metafor
New Relic
Pingdom
Dataloop.io
Big Panda
AppDynamics
Stackdriver
Pagerduty
VictorOps
Rad Alert

SLIDE 155 – William Gibson

The future is here – it’s just not very evenly distributed

SLIDE 156

Monitoring is

still insular

SLIDE 157

We’re building tools

for operations teams

SLIDE 158

Not the developers

who need them most

SLIDE 159

Monitoring is like a joke.

If you have to explain it,

it’s not that good.

SLIDE 160

storage checking alerting collection graphing aggregation

SLIDE 161

What can we

do better?

SLIDE 162

I’m Lindsay

@auxesis

SLIDE 163

Thank you!

Liked the talk? Let @auxesis know.