@snehainguva observability and product release: leveraging - - PowerPoint PPT Presentation

snehainguva observability and product release leveraging
SMART_READER_LITE
LIVE PREVIEW

@snehainguva observability and product release: leveraging - - PowerPoint PPT Presentation

@snehainguva observability and product release: leveraging prometheus to build and test new products digitalocean.com about me software engineer @DigitalOcean currently network services <3 cats digitalocean.com some stats digitalocean.com


slide-1
SLIDE 1

@snehainguva

slide-2
SLIDE 2
  • bservability and product release:

leveraging prometheus to build and test new products

digitalocean.com

slide-3
SLIDE 3

digitalocean.com

about me

software engineer @DigitalOcean currently network services <3 cats

slide-4
SLIDE 4

digitalocean.com

some stats

slide-5
SLIDE 5

digitalocean.com

90M+ timeseries 85 instances of prometheus 1.7M+ samples/sec

slide-6
SLIDE 6

digitalocean.com

the history

slide-7
SLIDE 7

digitalocean.com

ye’ olden days

use nagios + various plugins to monitor use collectd + statsd + graphite

  • penTSDB
slide-8
SLIDE 8

digitalocean.com

lovely prometheus

white-box monitoring multi-dimensional data model fantastic querying language

slide-9
SLIDE 9

digitalocean.com

glorious kubernetes

easily deploy and update services scalability combine with prometheus + alertmanager

slide-10
SLIDE 10

digitalocean.com

sneha joins networking

set up monitoring for VPC working on DHCP how can we use prometheus even before release?

slide-11
SLIDE 11

digitalocean.com

the plan:

✔ observability DigitalOcean build --- instrument --- test --- iterate examples

slide-12
SLIDE 12

digitalocean.com

metrics: time-series of sampled data tracing: propagating metadata through different requests, threads, and processes logging: record of discrete events over time

slide-13
SLIDE 13

digitalocean.com

metrics: what do we measure?

slide-14
SLIDE 14

digitalocean.com

four golden signals

slide-15
SLIDE 15

digitalocean.com

latency: time to service a request traffic: requests/second error: error rate of requests saturation: fullness of a service

slide-16
SLIDE 16

digitalocean.com

Utilization Saturation Error rate

slide-17
SLIDE 17

digitalocean.com

“USE metrics often allow users to solve 80% of server issues with 5%

  • f the effort.”
slide-18
SLIDE 18

digitalocean.com

the plan:

✔ observability DigitalOcean ✔ build --- instrument --- test --- iterate examples

slide-19
SLIDE 19

digitalocean.com

build:

design the service write it in go use internally shared libraries

slide-20
SLIDE 20

digitalocean.com

build: doge/dorpc - shared rpc library

var DefaultInterceptors = []string{ StdLoggingInterceptor, StdMetricsInterceptor, StdTracingInterceptor} func NewServer(opt ...ServerOpt) (*Server, error) {

  • pts := serverOpts{

name: "server", clientTLSAuth: tls.VerifyClientCertIfGiven, intercept: interceptor.NewPathInterceptor(interceptor.DefaultInterceptors...), keepAliveParams: DefaultServerKeepAlive, keepAliveEnforce: DefaultServerKeepAliveEnforcement, } … }

slide-21
SLIDE 21

digitalocean.com

instrument:

send logs to centralized logging send spans to trace-collectors set up prometheus metrics

slide-22
SLIDE 22

digitalocean.com

metrics instrumentation: go-client

func (s *server) initalizeMetrics() { s.metrics = metricsConfig{ attemptedConvergeChassis: s.metricsNode.Gauge("attempted_converge_chassis", "number of chassis converger attempting to converge"), failedConvergeChassis: s.metricsNode.Gauge("failed_converge_chassis", "number of chassis that failed to converge"), } } func (s *server) ConvergeAllChassis(...) { ... s.metrics.attemptedConvergeChassis(float64(len(attempted))) s.metrics.failedConvergeChassis(float64(len(failed))) ... }

slide-23
SLIDE 23

digitalocean.com

Quick Q & A: Collector Interface

// A collector must be registered.

prometheus.MustRegister(collector) type Collector interface {

// Describe sends descriptors to channel.

Describe(chan<- *Desc)

// Collect is used by the prometheus registry on a scrape. // Metrics are sent to the provided channel.

Collect(chan<- Metric) }

slide-24
SLIDE 24

digitalocean.com

metrics instrumentation: third-party exporters

Built using the collector interface Sometimes we build our own Often we use others:

github.com/prometheus/mysqld_exporter github.com/kbudde/rabbitmq_exporter github.com/prometheus/node_exporter github.com/digitalocean/openvswitch_exporter

slide-25
SLIDE 25

digitalocean.com

metrics instrumentation: in-service collectors

type RateMap struct { mu sync.Mutex ... rateMap map[string]*rate } var _ prometheus.Collector = &RateMapCollector{} func (r *RateMapCollector) Describe(ch chan<- *prometheus.Desc) { ds := []*prometheus.Desc{ r.RequestRate} for _, d := range ds { ch <- d } } func (r *RateMapCollector) Collect(ch chan<- prometheus.Metric) { ... ch <- prometheus.MustNewConstHistogram( r.RequestRate, count, sum, rateCount) }

slide-26
SLIDE 26

digitalocean.com

metrics instrumentation: dashboards #1

state metrics

slide-27
SLIDE 27

digitalocean.com

metrics instrumentation: dashboard #2

request rate request latency

slide-28
SLIDE 28

digitalocean.com

metrics instrumentation: dashboard #3

utilization metrics

slide-29
SLIDE 29

digitalocean.com

metrics instrumentation: dashboard #4

queries/second utilization

slide-30
SLIDE 30

digitalocean.com

metrics instrumentation: dashboard #6 metrics instrumentation: dashboard #5

saturation metric

slide-31
SLIDE 31

digitalocean.com

test:

load testing:

grpc-clients and goroutines

chaos testing:

take down a component of a system

integration testing:

how does this feature integrate with the cloud?

slide-32
SLIDE 32

digitalocean.com

testing: identify key issues

how is our latency? is there a goroutine leak? does resource usage increase with traffic? is there a high error rate? how are our third-party services?

use tracing to dig down use cpu and memory profiling use a worker pool check logs for types of error

slide-33
SLIDE 33

digitalocean.com

testing: tune metrics + alerts

do we need more labels for our metrics? should we collect more data? State-based alerting: Is our service up or down? Threshold alerting: When does our service fail?

slide-34
SLIDE 34

digitalocean.com

testing: documentation

set up operational playbooks document recovery efforts

slide-35
SLIDE 35

digitalocean.com

iterate! (but really, let’s look at some examples…)

slide-36
SLIDE 36

digitalocean.com

the plan:

✔ observability DigitalOcean ✔ build --- instrument --- test --- iterate ✔ examples

slide-37
SLIDE 37

digitalocean.com

product #1: DHCP

(hvaddrd)

slide-38
SLIDE 38

digitalocean.com

product #1: DHCP

hvflowd hvaddrd OvS br0 RNS

OpenFlow SetParameters

addr0

bolt DHCPv4 NDP gRPC DHCPv6

tapX dropletX

hvaddrd traffic AddFlows

Hypervisor

main

slide-39
SLIDE 39

DHCP: load testing

digitalocean.com

slide-40
SLIDE 40

DHCP: load testing (2)

digitalocean.com

slide-41
SLIDE 41

DHCP: custom conn collector

digitalocean.com

package dhcp4conn var _ prometheus.Collector = &collector{} // A collector gathers connection metrics. type collector struct { ReadBytesTotal *prometheus.Desc ReadPacketsTotal *prometheus.Desc WriteBytesTotal *prometheus.Desc WritePacketsTotal *prometheus.Desc }

Implements the net.conn interface and allows us to process ethernet frames for validation and other purposes.

slide-42
SLIDE 42

DHCP: custom conn collector

digitalocean.com

slide-43
SLIDE 43

DHCP: goroutine worker pools

digitalocean.com

workC := make(chan request, Workers) for i := 0; i < Workers; i++ { go func() { defer workWG.Done() for r := range workC { s.serve(r.buf, r.from) } }() }

Uses buffered channel to process requests, limiting goroutines and resource usage.

slide-44
SLIDE 44

DHCP: rate limiter collector

digitalocean.com

type RateMap struct { mu sync.Mutex ... rateMap map[string]*rate } type RateMapCollector struct { RequestRate *prometheus.Desc rm *RateMap buckets []float64 } func (r *RateMapCollector) Collect(ch chan<- prometheus.Metric) { … ch <- prometheus.MustNewConstHistogram( r.RequestRate, count, sum, rateCount) }

ratemap calculates the exponentially weighted moving average on a per-client basis and limits requests collector gives us a snapshot of rate distributions

slide-45
SLIDE 45

DHCP: rate alerts

digitalocean.com

Rate Limiter Centralized Logging Centralized Logging Centralized Logging Centralized Logging Elastalert emits log line

slide-46
SLIDE 46

DHCP: the final result

digitalocean.com

slide-47
SLIDE 47

digitalocean.com

product #2: VPC

slide-48
SLIDE 48

digitalocean.com

product #2: VPC

slide-49
SLIDE 49

digitalocean.com

VPC: load-testing

load tester repeatedly makes some RPC calls

slide-50
SLIDE 50

digitalocean.com

VPC: latency issues (1)

as load testing continued, started to notice latency in different rpc calls

slide-51
SLIDE 51

digitalocean.com

VPC: latency issues (2)

use tracing to take a look at the /SyncInitialChassis call

slide-52
SLIDE 52

digitalocean.com

VPC: latency issues (3)

Note that spans for some traces were being dropped. Slowing down the load tester, however, eventually ameliorated that problem.

slide-53
SLIDE 53

digitalocean.com

VPC: latency issues (4)

“The fix was to be smarter and do the queries more

  • efficiently. The repetitive

loop of queries to rnsdb really stood out in the lightstep data.”

  • Bob Salmi
slide-54
SLIDE 54

digitalocean.com

VPC: remove component

can queue be replaced with simple request-response system?

source: https://programmingisterrible.com/post/162346490883/how-do-you-cut-a-monolith-in-half

slide-55
SLIDE 55

digitalocean.com

VPC: chaos testing

Induce south service failure and see how rabbit responds Drop primary and recovery from secondary Induce northd failure and ensure failover works

slide-56
SLIDE 56

digitalocean.com

VPC: add alerts (1)

state-based alerts

slide-57
SLIDE 57

digitalocean.com

VPC: add alerts (2)

threshold alert

slide-58
SLIDE 58

digitalocean.com

conclusion

slide-59
SLIDE 59

digitalocean.com

what?

four golden signals, USE metrics

when?

as early as possible

how?

combine with profiling, logging, tracing

slide-60
SLIDE 60

thanks!

@snehainguva