Seer: Leveraging Big Data to Navigate The Increasing Complexity of - - PowerPoint PPT Presentation

seer leveraging big data to navigate the increasing
SMART_READER_LITE
LIVE PREVIEW

Seer: Leveraging Big Data to Navigate The Increasing Complexity of - - PowerPoint PPT Presentation

Seer: Leveraging Big Data to Navigate The Increasing Complexity of Cloud Debugging Yu Gan, Meghna Pancholi, Dailun Cheng, Siyuan Hu, Yuan He and Christina Delimitrou Cornell University HotCloud July 9 th 2018 Executive Summary


slide-1
SLIDE 1

Yu Gan, Meghna Pancholi, Dailun Cheng, Siyuan Hu, Yuan He and Christina Delimitrou

Cornell University

HotCloud– July 9th 2018

Seer: Leveraging Big Data to Navigate The Increasing Complexity of Cloud Debugging

slide-2
SLIDE 2

2

¨ Microservices puts more pressure on performance predictability

¤ Microservices dependencies à propagate & amplify QoS violations ¤ Finding the culprit of a QoS violation is difficult ¤ Post-QoS violation, returning to nominal operation is hard

¨ Anticipating QoS violations & identifying culprits ¨ Seer: Data-driven Performance Debugging for Microservices

¤ Combines lightweight RPC-level distributed tracing with hardware

monitoring

¤ Leverages scalable deep learning to signal QoS violations with

enough slack to apply corrective action

Executive Summary

slide-3
SLIDE 3

3

From Monoliths to Microservices

slide-4
SLIDE 4

4

¨ Advantages of microservices: ¤ Ease & speed of code development & deployment ¤ Security, error isolation ¤ PL/framework heterogeneity ¨ Challenges of microservices: ¤ Change server design assumptions ¤ Complicate resource management à dependencies ¤ Amplify tail-at-scale effects ¤ More sensitive to performance unpredictability ¤ No representative end-to-end apps with microservices

Motivation

slide-5
SLIDE 5

5

¨ 4 end-to-end applications using popular open-source

microservices à ~30-40 microservices per app

¤ Social Network ¤ Movie Reviewing/Renting/Streaming ¤ E-commerce ¤ Drone control service ¨ Programming languages and frameworks: ¤ node.js, Python, C/C++, Java/Javascript, Scala, PHP

, and Go

¤ Nginx, memcached, MongoDB, CockroachDB, Mahout, Xapian ¤ Apache Thrift RPC, RESTful APIs ¤ Docker containers ¤ Lightweight RPC-level distributed tracing

An End-to-End Suite for Cloud & IoT Microservices

slide-6
SLIDE 6

6

Resource Management Implications

¨ Challenges of microservices: ¤ Dependencies complicate resource management ¤ Dependencies change over time à difficult for users to express ¤ Amplify tail@scale effects

Netflix Twitter Amazon Movie Streaming

slide-7
SLIDE 7

7

¨ Detecting QoS violations after they occur:

¤ Unpredictable performance propagates through system ¤ Long time until return to nominal operation ¤ Does not scale

The Need for Proactive Performance Debugging

slide-8
SLIDE 8

8

Performance Implications

CPU Mem Net Disk Queue

slide-9
SLIDE 9

9

Performance Implications

CPU Mem Net Disk Queue

slide-10
SLIDE 10

10

¨ Leverage the massive amount of traces collected over time 1.

Apply online, practical data mining techniques that identify the culprit of an upcoming QoS violation

2.

Use per-server hardware monitoring to determine the cause of the QoS violation

3.

Take corrective action to prevent the QoS violation from

  • ccurring

¨ Need to predict 100s of msec – a few sec in the future

Seer: Data-Driven Performance Debugging

slide-11
SLIDE 11

11

¨ RPC level tracing ¨ Based on Apache Thrift ¨ Timestamp start-end

for each microservice

¨ Store in centralized DB

(Cassandra)

¨ Record all requests à

No sampling

¨ Overhead: <0.1% in

throughput and <0.2% in tail latency

Tracing Collector

WebUI

Client

http

Cassandra QueryEngine

[…]

microservices latency Gantt charts

zTracer

TCP TCP Proc

uService K

RPC timeTX zTracer

TCP TCP Proc

uService K+1

RPC timeRX TCP procTX TCP procRX App proc

[…]

Tracing Framework

slide-12
SLIDE 12

12

¨ Why?

¤ Architecture-agnostic ¤ Adjusts to changes in

dependencies over time

¤ High accuracy, good

scalability

¤ Inference within the

required window

Deep Learning to the Rescue

slide-13
SLIDE 13

13

¨ Container

utilization

¨ Latency ¨ Queue

depth

DNN Configuration

Output signal

Which microservice will cause a QoS violation in the near future?

Input signal

slide-14
SLIDE 14

14

¨ Container

utilization

¨ Latency ¨ Queue

depth

DNN Configuration

Output signal

Which microservice will cause a QoS violation in the near future?

Input signal

slide-15
SLIDE 15

15

¨ Training once: slow (hours - days)

¤ Across load levels, load distributions, request types ¤ Distributed queue traces, annotated with QoS violations ¤ Weight/bias inference with SGD ¤ Retraining in the background

¨ Inference continuously: streaming trace data

DNN Configuration

93% accuracy in signaling upcoming QoS violations 91% accuracy in attributing QoS violation to correct microservice

slide-16
SLIDE 16

16

¨ Challenges:

¤ In large clusters inference too slow to prevent QoS violations ¤ Offload on TPUs, 10-100x improvement; 10ms for 90th %ile

inference

¤ Fast enough for most corrective actions to take effect (net bw

partitioning, RAPL, cache partitioning, scale-up/out, etc.)

DNN Configuration

Accuracy stable or increasing with cluster size

slide-17
SLIDE 17

17

¨ 40 dedicated servers ¨ ~1000 single-concerned

containers

¨ Machine utilization 80-85% ¨ Inject interference to cause

QoS violation

¤ Using microbenchmarks

(CPU, cache, memory, network, disk I/O)

Experimental Setup

slide-18
SLIDE 18

18

¨ Identify cause of QoS violation ¤ Private cluster: performance counters & utilization monitors ¤ Public cluster: contentious microbenchmarks ¨ Adjust resource allocation ¤ RAPL (fine-grain DVFS) & scale-up for CPU contention ¤ Cache partitioning (CAT) for cache contention ¤ Memory capacity partitioning for memory contention ¤ Network bandwidth partitioning (HTB) for net contention ¤ Storage bandwidth partitioning for I/O contention

Restoring QoS

slide-19
SLIDE 19

19

¨ Post-detection, baseline system à dropped requests ¨ Post-detection, Seer à maintain nominal performance

Restoring QoS

slide-20
SLIDE 20

20

Demo

CPU Mem Net Disk Queue

slide-21
SLIDE 21

21

slide-22
SLIDE 22

22

¨ Security implications of data-driven approaches ¨ Fall-back mechanisms when ML goes wrong ¨ Not a single-layer solution à Predictability needs vertical approaches

Challenges Ahead

Thank you!

Serverless microservices IoT swarms