Tracing polyglot systems An OpenTracing Tutorial Yuri Shkuro - - PowerPoint PPT Presentation

tracing polyglot systems
SMART_READER_LITE
LIVE PREVIEW

Tracing polyglot systems An OpenTracing Tutorial Yuri Shkuro - - PowerPoint PPT Presentation

Tracing polyglot systems An OpenTracing Tutorial Yuri Shkuro (Uber), Won Jun Jang (Uber), Prithvi Raj (Uber) Velocity NYC, Oct 1 2018 1 Agenda -- http://bit.do/velocity18 9:00 - 9:15 Introductions 9:15 - 9:45 (talk) Introduction to


slide-1
SLIDE 1

Tracing polyglot systems

An OpenTracing Tutorial

Yuri Shkuro (Uber), Won Jun Jang (Uber), Prithvi Raj (Uber) Velocity NYC, Oct 1 2018

1

slide-2
SLIDE 2

Agenda -- http://bit.do/velocity18

  • 9:00 - 9:15 Introductions
  • 9:15 - 9:45 (talk) Introduction to Distributed Tracing
  • 9:45 - 10:00 Q & A
  • 10:00 - 10:30 Tutorials
  • 10:30 - 11:00 Break
  • 11:00 - 11:15 Part 2: Q & A

○ How far did you get? ○ Any questions about the OpenTracing API?

  • 11:15 - 11:45 Tutorials (continued)
  • 11:45 - 12:00 (talk) Deploying and Using Tracing in Your Organization
  • 12:00 - 12:30 Group discussion / unconference

2

slide-3
SLIDE 3

Getting the most of this workshop

3

  • Learn the ropes.
  • If you already know them, help teach ‘em ropes :)
  • Meet some people

Everyone can walk away with practical tracing experience and a better sense of the space.

slide-4
SLIDE 4

Intros

4

  • Which company / organization are you from?
  • How big is your architecture?
  • What monitoring challenges do you have?
slide-5
SLIDE 5

Why care about Tracing

Tracing is fun

5

slide-6
SLIDE 6

6

Modern applications are very complex. Thanks, microservices!

slide-7
SLIDE 7

7

BILLIONS times a day!

slide-8
SLIDE 8

8

How do we know what’s going on?

slide-9
SLIDE 9

Metrics / Stats

  • Counters, timers,

gauges, histograms

  • Four golden signals

○ utilization ○ saturation ○ throughput ○ errors

  • Statsd, Prometheus,

Grafana

We use MONITORING tools

9

Logging

  • Application events
  • Errors, stack traces
  • ELK, Splunk, Fluentd

Monitoring tools must “tell stories” about your system

slide-10
SLIDE 10

Metrics and logs are per-instance. They don’t tell the full story. We need to understand distributed transactions

Metrics and logs don’t cut it anymore!

10

slide-11
SLIDE 11

Systems are Distributed and Concurrent

11

Distributed Concurrency

“The Simple [Inefficient] Thing” Basic Concurrency Async Concurrency

Distributed Concurrency

slide-12
SLIDE 12

12

How do we “tell stories” about distributed concurrency?

slide-13
SLIDE 13

13

slide-14
SLIDE 14

Distributed Tracing in a Nutshell

14

A B C D E

{context} {context} {context} {context} Unique ID → {context} Edge service

A B E C D time TRACE SPANS

slide-15
SLIDE 15

Let’s look at some traces

demo time: http://bit.do/jaeger-hotrod

15

slide-16
SLIDE 16

16

performance and latency

  • ptimization

distributed transaction monitoring service dependency analysis root cause analysis distributed context propagation

Distributed Tracing Systems

slide-17
SLIDE 17

Great… Why isn’t everyone tracing?

Tracing instrumentation has been too hard, with no standardization.

17

slide-18
SLIDE 18

How are applications instrumented?

18

Application (automatically instrumented) Application Application (manually instrumented) Agent for automatic instrumentation Manually instrumented frameworks Open Source Instrumentation API Tracing library implementation Tracing system / analytics backend

slide-19
SLIDE 19

A Bigger Picture

19

Not Your Service

(Spanner, S3, Kinesis, etc.)

Your Tracing System

(Jaeger, Zipkin)

Your Service Not Your Tracing System

(StackDriver, XRay)

Tracing API context Trace-Data Trace Data Shared Libraries Tracer Describing Transactions Correlating Transactions Recording Transactions Analyzing Transactions Federating Transactions Trace Data

slide-20
SLIDE 20

What is OpenTracing

http://opentracing.io

20

slide-21
SLIDE 21

OpenTracing Mission

Provide an API for describing distributed transactions Unlock open source, vendor-neutral instrumentation

21

slide-22
SLIDE 22

OpenTracing Goals

  • Zero-dependencies, pure API for describing the shape, timing, and

metadata about distributed transactions. Vendor neutral. Data formats agnostic.

  • API primitives for intra-process and inter-process propagation of context,

including general purpose, transaction-scoped “baggage”.

  • A body of reusable, vendor-neutral, open source instrumentation for

existing systems, libraries, and frameworks, and/or enable them to include instrumentation built-in.

  • Semantic conventions for standardized data elements (for tags and log

fields) for describing metadata of common operations, such as http or database calls

22

slide-23
SLIDE 23

Who should care?

Developers building:

  • Cloud-native / microservices-based applications
  • OSS packages, especially near process edges

(web frameworks, managed service clients, etc)

  • Tracing and/or monitoring systems

23

slide-24
SLIDE 24

OpenTracing Architecture

24

OpenTracing API

application logic µ-service frameworks Lambda functions RPC & control-flow frameworks existing instrumentation

tracing infrastructure

main()

I N S T A N A

CNCF Jaeger

microservice process

slide-25
SLIDE 25

2.5 years old (https://opentracing.devstats.cncf.io) Tracer implementations: Jaeger, Zipkin, LightStep, SkyWalking, others All sorts of companies use OpenTracing:

A young, growing project

25

slide-26
SLIDE 26

Rapidly growing OSS and vendor support

26

JDBI

Java Webservlet

Jaxr

slide-27
SLIDE 27

Jaeger

A distributed tracing system

27

slide-28
SLIDE 28
  • Inspired by Google’s Dapper and OpenZipkin
  • Started at Uber in August 2015
  • Open sourced in April 2017
  • Official CNCF project since Sep 2017
  • Built-in OpenTracing support
  • https://jaegertracing.io

Jaeger - /ˈyāɡər/, noun: hunter

28

slide-29
SLIDE 29

Jaeger Technology Stack

  • Backend components in Go
  • Pluggable storage

○ Cassandra, Elasticsearch, memory, ...

  • Web UI in React/Javascript
  • OpenTracing instrumentation libraries

29

slide-30
SLIDE 30

Jaeger: Community

  • Several full time engineers at Uber and Red Hat
  • Over 600 contributors on GitHub (stats)
  • Blog: https://medium.com/jaegertracing
  • Chat: https://gitter.im/jaegertracing/Lobby
  • Twitter: https://twitter.com/JaegerTracing

30

slide-31
SLIDE 31

Doc http://bit.do/velocity18

OpenTracing deep dive

31

slide-32
SLIDE 32

Materials

  • Setup instructions: http://bit.do/velocity18
  • Tutorial: http://bit.do/opentracing-tutorial
  • Q&A: https://gitter.im/opentracing/workshop

32

slide-33
SLIDE 33

33

Lesson 1 Hello, World

slide-34
SLIDE 34

Lesson 1 Objectives

34

  • Basic concepts
  • Instantiate a Tracer
  • Create a simple trace
  • Annotate the trace
slide-35
SLIDE 35

Basic concepts: SPAN

Span: a basic unit of work, timing, and causality. A span contains:

  • operation name
  • start / finish timestamps
  • tags and logs
  • references to other spans

35

slide-36
SLIDE 36

Basic concepts: TRACE

Trace: a directed acyclic graph (DAG) of spans

36

Span A Span B Span C Span D Span E Span F Span G Span H

slide-37
SLIDE 37

Trace as a time sequence diagram

37

A B E C D time F G H

slide-38
SLIDE 38

Basic concepts: OPERATION NAME

38

A human-readable string which concisely represents the work of the span.

  • E.g. an RPC method name, a function name, or the name of a subtask
  • r stage within a larger computation
  • Can be set at span creation or later
  • Should be low cardinality, aggregatable, identifying class of spans

get too general get_account/12345 too specific get_account good, “12345” could be a tag

slide-39
SLIDE 39

Basic concepts: TAG

A key-value pair that describes the span overall. Examples:

  • http.url = “http://google.com”
  • http.status_code = 200
  • peer.service = “mysql”
  • db.statement = “select * from users”

https://github.com/opentracing/specification/blob/master/semantic_conventions.md

39

slide-40
SLIDE 40

Basic concepts: LOG

40

Describes an event at a point in time during the span lifetime.

  • OpenTracing supports structured logging
  • Contains a timestamp and a set of fields

span.log_kv( {'event': 'open_conn', 'port': 433} )

slide-41
SLIDE 41

Basic concepts: TRACER

A tracer is a concrete implementation of the OpenTracing API. tracer := jaeger.New("hello-world") span := tracer.StartSpan("say-hello") // do the work span.Finish()

41

slide-42
SLIDE 42

Understanding Sampling

  • Tracing data > than business traffic
  • Most tracing systems sample transactions
  • Head-based sampling: the sampling decision is made

just before the trace is started, and it is respected by all nodes in the graph

  • Tail-based sampling: the sampling decision is made

after the trace is completed / collected

42

slide-43
SLIDE 43

How to create Jaeger Tracer

43

cfg := &config.Configuration{ Sampler: &config.SamplerConfig{ Type: "const", Param: 1, }, Reporter: &config.ReporterConfig{LogSpans: true}, } tracer, closer, err := cfg.New(serviceName)

slide-44
SLIDE 44

44

Lesson 2 Context and Tracing Functions

slide-45
SLIDE 45

Lesson 2 Objectives

45

  • Trace individual functions
  • Combine multiple spans into a single trace
  • Propagate the in-process context
slide-46
SLIDE 46

46

How do we build a DAG?

span1 := tracer.StartSpan("say-hello") // do the work span1.Finish() span2 := tracer.StartSpan("format-string") // do the work span2.Finish()

This just creates two independent traces!

slide-47
SLIDE 47

47

Build a DAG with Span References

span1 := tracer.StartSpan("say-hello") // do the work span1.Finish() span2 := tracer.StartSpan( "format-string",

  • pentracing.ChildOf(span1.Context()),

) // do the work span2.Finish()

slide-48
SLIDE 48

Basic concepts: SPAN CONTEXT

48

Serializable format for linking spans across network boundaries. Carries trace/span identity and baggage. type SpanContext struct { traceID TraceID spanID SpanID parentID SpanID flags byte baggage map[string]string }

slide-49
SLIDE 49

Basic concepts: SPAN REFERENCE

Describes causal relationship to another span. type Reference struct { Type opentracing.SpanReferenceType Context SpanContext }

49

slide-50
SLIDE 50

Types of Span References

ChildOf: referenced span is an ancestor that depends on the results of the current span. E.g. RPC call, database call, local function FollowsFrom: referenced span is an ancestor that does not depend on the results of the current span. E.g. async fire-n-forget cache write.

50

slide-51
SLIDE 51

In-process Context Propagation

We don’t want to keep passing Spans around. Need a more general request context:

  • Go: context.Context (from std lib)
  • Java, Python: Scope & Scope Manager (thread-locals)
  • Node.js: TBD (internally: @uber/node-context)

51

slide-52
SLIDE 52

52

Lesson 3 Tracing RPC Requests

slide-53
SLIDE 53
  • Trace a transaction across more than one

microservice

  • Pass the context between processes using

Inject and Extract

  • Apply OpenTracing-recommended tags

Lesson 3 Objectives

53

slide-54
SLIDE 54

Anatomy of Tracing Instrumentation

54

MY SERVICE

inbound request

  • utbound

request

Tracer library Send trace data to tracing backend (background thread) 1

instrumentation

Handler

Headers TraceID Context Span Context Span Headers TraceID instrumentation

Client 2 3

slide-55
SLIDE 55

Basic concepts: Inject and Extract

Tracer methods used to serialize Span Context to or from RPC requests (or other network comms) void Inject(SpanContext, Format, Carrier) SpanContext Extract(Format, Carrier)

55

slide-56
SLIDE 56

Basic concepts: Propagation Format

OpenTracing does not define the wire format. It assumes that the frameworks for network comms allow passing the context (request metadata) as one of these (the Format enum):

  • 1. TextMap: Arbitrary string key/value headers
  • 2. Binary: A binary blob
  • 3. HTTPHeaders: as a special case of #1

56

slide-57
SLIDE 57

Basic concepts: Carrier

Each Format defines a corresponding Carrier interface that the Tracer uses to read/write the span context. The instrumentation implements the Carrier interface as an adapter around their custom types

57

slide-58
SLIDE 58

Inject Example

58

Tracer TextMap Carrier Binary Carrier AddHeader(key, value) Write(byte[]) RPC Adapter RPC Request Set(key, value) Write(byte[]) Adapter RPC Request

slide-59
SLIDE 59

59

Lesson 4 Baggage

slide-60
SLIDE 60
  • Understand distributed context propagation
  • Use baggage to pass data through the call graph

Lesson 4 Objectives

60

slide-61
SLIDE 61

Distributed Context Propagation

61

Client Span

button=buy

Frontend Span

button=buy, exp_id=57

Ad Span

button=buy, exp_id=57

Content Span

button=buy, exp_id=57

Shard A Span

button=buy, exp_id=57

Shard B Span

button=buy, exp_id=57

Cassandra Spans

button=buy, exp_id=57

Cassandra Spans

button=buy, exp_id=57

Cassandra Spans

button=buy, exp_id=57

Cassandra Spans

button=buy, exp_id=57

Cassandra Spans

button=buy, exp_id=57

Problem: how to aggregate disk writes in Cassandra by “button” type (or experiment id, etc, etc)?

See the Pivot Tracing paper http://pivottracing.io/

slide-62
SLIDE 62

Baggage is a general purpose in-band key-value store.

span.SetBaggageItem("Bender", "Rodriguez")

Transparent to most services. Powerful but dangerous

  • Bloats the request size

Basic concepts: Baggage

62

A C D E B

slide-63
SLIDE 63

Extra Credit

63

slide-64
SLIDE 64

64

Logging v. Tracing

slide-65
SLIDE 65

Monitoring == Observing Events

65

Metrics - Record events as aggregates (e.g. counters) Tracing - Record transaction-scoped events Logging - Record unique events

Low volume High volume

slide-66
SLIDE 66

Logging v. Tracing

66

Tracing

  • Contextual
  • High granularity (debug and ↓)
  • Per-transaction sampling
  • Lower volume, higher fidelity

Logging

  • No context
  • Low granularity (warn and ↑)
  • Per-process sampling (at best)
  • High volume, low fidelity

Industry advice: don’t log on success (https://vimeo.com/221066726)

slide-67
SLIDE 67

Deploying and Using Tracing in Your Organization

Practitioner’s Advice

67

slide-68
SLIDE 68

Getting the Basics

You will need to:

  • 1. Identify relevant frameworks
  • 2. Install relevant OpenTracing plugins
  • 3. Instrument code with OpenTracing
  • 4. Pick a Tracer that matches your tracing backend

68

slide-69
SLIDE 69

Frameworks & Plugins

Modern systems are a mix of your application code, shared libraries, and shared infrastructure/resources. The opentracing-contrib project adds OT support to popular libraries and frameworks. Leveraging these plugins both expands OT coverage in your app and may reduce the required explicit code instrumentation.

69

slide-70
SLIDE 70

Integrate with your infra libraries early

70

  • Create adapters, do not use outside libraries directly

○ Adapters allow you to customize configuration ○ E.g. do not expect app developers to give your tracer the service name, get it from some environment variable instead

  • If your org is already using common infra libraries, e.g.

for RPC, change them to include tracing by default

  • Tag logs with trace & span ID, people love it
slide-71
SLIDE 71

Decide What to Instrument

  • Identify a high-value business transaction

– E.g.“discover nearby x”, “add to cart”, etc.

  • Identify the points of ingress and egress
  • Breadth-first, not depth-first
  • Get the first end-to-end trace reported

71

slide-72
SLIDE 72

Evangelize

  • Give internal talks
  • Show people examples where tracing helps
  • Get management buy-in

72

slide-73
SLIDE 73

Distributed Context Propagation

  • Identifying synthetic traffic

– Use as a dimension for metrics

  • Product Tenancy

– E.g. top-level product: Docs, Gmail

  • Chaos engineering & fault injection

– Random killings must stop!

73

slide-74
SLIDE 74
  • Tracing data is extremely rich. Don’t let it go to

waste by looking at individual traces only.

  • Create big data jobs to aggregate traces for

meaningful insights, specific to your infrastructure – We write spans to Kafka and run Flink jobs – We also support Hive queries on HDFS

Use Data Mining

74

slide-75
SLIDE 75

In conclusion

Wow, what a great audience!

75

slide-76
SLIDE 76

Contributors are most welcome

76

http://opentracing.io http://jaegertracing.io

slide-77
SLIDE 77

Thank you & Happy Tracing!

  • Hope to see you in Shanghai or Seattle!

– Registration & Sponsorships now open: kubecon.io

  • KubeCon + CloudNativeCon China 2018

– 11月13-15日 November 13 – 15, 2018 | 中国上海 Shanghai, China

  • KubeCon + CloudNativeCon North America 2018

– December 11 – 13, 2018 | Seattle, WA

  • We (Uber) are hiring! https://uber.com/careers/

77

slide-78
SLIDE 78

Appendix

78

slide-79
SLIDE 79

Understanding Sampling

Tracing data can exceed business traffic. Most tracing systems sample transactions:

  • Head-based sampling: the sampling decision is made

just before the trace is started, and it is respected by all nodes in the graph

  • Tail-based sampling: the sampling decision is made

after the trace is completed / collected

79