T AIL B ENCH : A B ENCHMARK S UITE AND E VALUATION M ETHODOLOGY FOR L - - PowerPoint PPT Presentation

t ail b ench a b enchmark s uite and
SMART_READER_LITE
LIVE PREVIEW

T AIL B ENCH : A B ENCHMARK S UITE AND E VALUATION M ETHODOLOGY FOR L - - PowerPoint PPT Presentation

T AIL B ENCH : A B ENCHMARK S UITE AND E VALUATION M ETHODOLOGY FOR L ATENCY - C RITICAL A PPLICATIONS H ARSHAD K ASTURE , D ANIEL S ANCHEZ IISWC 2016 tailbench.csail.mit.edu Executive Summary 2 Latency-critical applications have stringent


slide-1
SLIDE 1

TAILBENCH: A BENCHMARK SUITE AND EVALUATION METHODOLOGY FOR LATENCY- CRITICAL APPLICATIONS

HARSHAD KASTURE, DANIEL SANCHEZ

IISWC 2016 tailbench.csail.mit.edu

slide-2
SLIDE 2

Executive Summary

2

 Latency-critical applications have stringent performance

requirements  low datacenter utilization

 Wastes billions of dollars in energy and equipment annually

 Research in this area hampered by the lack of a

comprehensive benchmark suite

 Few latency-critical applications  Complicated setup and configuration  Methodological issues

 TailBench makes latency-critical applications easy to analyze

 Varied application domains and latency characteristics  Standardized, statistically sound methodology  Supports simplified load-testing configurations

 limited coverage Inaccurate latency measurements

slide-3
SLIDE 3

Outline

3

 Background and Motivation  TailBench Applications  TailBench Harness  Simplified Configurations

slide-4
SLIDE 4

Understanding Latency-Critical Applications

4

Root Node Back End Back End

Datacenter

Leaf Node Back End Back End Leaf Node Back End Back End Leaf Node Client Client Client

slide-5
SLIDE 5

Understanding Latency-Critical Applications

5

Root Node Back End Back End

Datacenter

Leaf Node Back End Back End Leaf Node Back End Back End Leaf Node Client Client Client

slide-6
SLIDE 6

Understanding Latency-Critical Applications

6

Root Node Back End Back End

Datacenter

Leaf Node Back End Back End Leaf Node Back End Back End Leaf Node Client Client Client

slide-7
SLIDE 7

Understanding Latency-Critical Applications

7

 The few slowest responses determine user-perceived latency

 Tail latency (e.g., 95th / 99th percentile), not mean latency, determines

performance

Root Node Back End Back End

Datacenter

Leaf Node Back End Back End Leaf Node Back End Back End Leaf Node

1 ms 1 ms

Client Client Client

slide-8
SLIDE 8

Latency Requirements Cause Low Utilization

8

 End-to-end latency increases rapidly with load

 Must keep utilization low to keep latency within reasonable bounds

 Traditional resource management techniques (e.g., colocation) often cannot

be used since they degrade latency

 Low resource utilization wastes billions of dollars in energy and equipment

 Sparked research in latency-critical systems

slide-9
SLIDE 9

Benchmark Suite Design Goals

9

 Applications from a diverse set of domains  Applications with diverse tail latency characteristics  Easy to set up and run

 Support different measurement scenarios

 Robust latency measurement methodology

K V

你好 Hell

  • 100 μs

1 ms 10 ms 100 ms 1 s DVFS LLC Warmup Live VM Migration

slide-10
SLIDE 10

Outline

10

 Background and Motivation  TailBench Applications  TailBench Harness  Simplified Configurations

slide-11
SLIDE 11

TailBench Applications

11 Online Search

xapian

K V Key-Value Store

masstree Speech Recognition sphinx

Image Recognition

img-dnn

Java Middleware

specjbb

In-memory Database

silo

On-disk Database

shore moses

Statistical Machine Translation

你好

Hello

slide-12
SLIDE 12

Wide Range of End-to-End Latencies 12

100 μs 1 ms 10 ms 100 ms 1 s

silo specjbb masstree shore xapian img-dnn moses sphinx

slide-13
SLIDE 13

Varied Service Time Characteristics

13

 masstree service times are more tightly distributed  xapian service times are more loosely distributed

slide-14
SLIDE 14

End-to-End Latency vs. Load

14

slide-15
SLIDE 15

Tail ≠ Mean

15

 Tail latency increases more rapidly with load than mean

latency

 Relationship between mean and tail latencies is hard to

predict

slide-16
SLIDE 16

Impact of Parallelism

16

slide-17
SLIDE 17

Parallelism Helps Some Applications

17

slide-18
SLIDE 18

…But Hurts Others

18

slide-19
SLIDE 19

Outline

19

 Background and Motivation  TailBench Applications  TailBench Harness  Simplified Configurations

slide-20
SLIDE 20

TailBench Harness

20

 Measuring tail latency accurately is complicated

 Load generation, statistics aggregation, warmup periods…

 Harness encapsulates most of the complexity  Harness makes TailBench easily extensible

 New benchmarks reuse existing harness functionality

 Simplified harness configurations enable different

measurement scenarios

 Trade off some accuracy for reduced setup complexity

slide-21
SLIDE 21

Example: Open- vs. Closed-Loop Clients

21

 Many popular load testers use closed-loop clients

 Clients wait for response before submitting next request  Increase in application load throttles client request rate

 Latency-critical applications typically service a large

number of independent clients

 Request rate independent of application load  Better modeled by open-loop clients

 Closed-loop clients can underestimate latency by orders

  • f magnitude [Tene LLS 2013, Zhang ISCA 2016]

Application

Network

ClientΩ Ω Client

slide-22
SLIDE 22

Networked Harness Configuration

22

App Client

Traffic Shaper Stats Collector

App Client Network TCP/IP TCP/IP

  • Req. Queue

Application

Traffic Shaper Stats Collector

TCP/IP

slide-23
SLIDE 23

Networked Harness Configuration

23

App Client

Traffic Shaper Stats Collector

App Client Network TCP/IP TCP/IP

  • Req. Queue

Application

Traffic Shaper Stats Collector

TCP/IP

 Application and the clients run on separate machines  Traffic Shaper inserts inter-request delays to model load  Request Queue enqueues incoming requests and measures service

times and queuing delays

 Statistics Collector aggregates latency data

slide-24
SLIDE 24

Networked Harness Configuration

24

App Client

Traffic Shaper Stats Collector

App Client Network TCP/IP TCP/IP

  • Req. Queue

Application

Traffic Shaper Stats Collector

TCP/IP

 Application and the clients run on separate machines  Traffic Shaper inserts inter-request delays to model load  Request Queue enqueues incoming requests and measures service

times and queuing delays

 Statistics Collector aggregates latency data

slide-25
SLIDE 25

Networked Harness Configuration

25

App Client

Traffic Shaper Stats Collector

App Client Network TCP/IP TCP/IP

  • Req. Queue

Application

Traffic Shaper Stats Collector

TCP/IP

 Application and the clients run on separate machines  Traffic Shaper inserts inter-request delays to model load  Request Queue enqueues incoming requests and measures service

times and queuing delays

 Statistics Collector aggregates latency data

slide-26
SLIDE 26

Networked Harness Configuration

26

 Application and the clients run on separate machines  Traffic Shaper inserts inter-request delays to model load  Request Queue enqueues incoming requests and measures service

times and queuing delays

 Statistics Collector aggregates latency data

App Client

Traffic Shaper Stats Collector

App Client Network TCP/IP TCP/IP

  • Req. Queue

Application

Traffic Shaper Stats Collector

TCP/IP

slide-27
SLIDE 27

Networked Harness Configuration

27

 Faithfully captures all sources of overhead X Difficult to configure and deploy App Client

Traffic Shaper Stats Collector

App Client Network TCP/IP TCP/IP

  • Req. Queue

Application

Traffic Shaper Stats Collector

TCP/IP

slide-28
SLIDE 28

Outline

28

 Background and Motivation  TailBench Applications  TailBench Harness  Simplified Configurations

slide-29
SLIDE 29

Loopback Harness Configuration

29

 Application and clients reside on the same machine

 Reduced setup complexity  Highly accurate in many cases X Difficult to simulate

App Client Application App Client Loopback Loopback TCP/IP TCP/IP

slide-30
SLIDE 30

Load-Latency for Networked Configuration

30

slide-31
SLIDE 31

Loopback Configuration Highly Accurate 31

 Loopback and Networked configurations have near-identical

performance

 Networking delays minimal in our setup

slide-32
SLIDE 32

Loopback Harness Configuration

32

 Application and clients reside on the same machine

 Reduced setup complexity  Highly accurate in many cases X Still difficult to simulate

App Client Application App Client Loopback Loopback TCP/IP TCP/IP

slide-33
SLIDE 33

Integrated Harness Configuration

33

 Application and client integrated into a single process

 Easy to setup X Some loss of accuracy 

App Client Application Single Process

slide-34
SLIDE 34

Integrated Configuration Validation

34

39% 23%

 Networked/Loopback configurations saturate earlier for

applications with short requests (silo, specjbb)

 TCP/IP processing overhead a significant fraction of request

slide-35
SLIDE 35

Integrated Harness Configuration

35

 Application and client integrated into a single process

 Easy to setup X Some loss of accuracy  Enables user-level simulations

App Client Application Single Process

slide-36
SLIDE 36

Simulation vs. Real System

36

32% 16% 31% 20% 16%

 Performance difference between real and simulated systems well within

usual simulation error bounds

 Average absolute error in saturation QPS: 14%

 zsim IPC error for SPEC CPU2006 applications: 8.5 – 21%

slide-37
SLIDE 37

Conclusions

37

 TailBench includes a diverse set of latency-critical

applications with varied latency characteristics

 TailBench harness implements a statistically sound

experimental methodology to achieve accurate results

 Various harness configurations allow trading off

configuration complexity for some accuracy

 Our results show that the integrated configuration is highly

accurate for six of our eight benchmarks

slide-38
SLIDE 38

THANKS FOR YOUR ATTENTION! QUESTIONS?

tailbench.csail.mit.edu