[PPT] - Latency Trumps All Chris Saari twitter.com/chrissaari PowerPoint Presentation

SLIDE 1

Latency Trumps All

Chris Saari twitter.com/chrissaari blog.chrissaari.com saari@yahoo-inc.com

Thursday, November 19, 2009

SLIDE 2

Packet Latency

Time for a packet to get between points A and B Physical distance + time queued in devices along the way ~60ms

Thursday, November 19, 2009

SLIDE 3

...

Thursday, November 19, 2009

SLIDE 4

Anytime...

... the system is waiting for data The system is end to end

Human response time
Network card buffering
System bus/interconnect speed
Interrupt handling
Network stacks
Process scheduling delays
Application process waiting for data from memory to get

to CPU, or from disk to memory to CPU

Routers, modems, last mile speeds
Backbone speed and operating condition
Inter-cluster/colo performance

Thursday, November 19, 2009

SLIDE 5

Big Picture

CPU

N e t w

r

k

Memory

User

D i s k

Thursday, November 19, 2009

SLIDE 6

Tubes?

Thursday, November 19, 2009

SLIDE 7

Latency vs. Bandwidth

Latency Bandwidth

Bits / Second

Time

Thursday, November 19, 2009

SLIDE 8

Bandwidth of a Truck Full of Tape

Thursday, November 19, 2009

SLIDE 9

Latency Lags Bandwidth -David Patterson

f r- al e d s n f s s- a ts e r n s; n

r

s- t- r t

Thursday, November 19, 2009

SLIDE 10

The Problem

Relative Data Access Latencies, Fastest to Slowest

CPU Registers (1)
L1 Cache (1-2)
L2 Cache (6-10)
Main memory (25-100)
Hard drive (1e7)
LAN (1e7-1e8)
WAN (1e9-2e9)
-- don’t cross this line, don’t go off mother board! ---

Thursday, November 19, 2009

SLIDE 11

Relative Data Access Latency

CPU Register L1 L2 RAM

Fast Slow

Thursday, November 19, 2009

SLIDE 12

Relative Data Access Latency

CPU Register L1 L2 RAM Hard Disk

Fast Slow

Thursday, November 19, 2009

SLIDE 13

Relative Data Access Latency

Register L1 L2 RAM Hard Disk LANFloppy/CD-ROM WAN

Lower Higher

Thursday, November 19, 2009

SLIDE 14

CPU Register

CPU Register Latency - Average Human Height

Thursday, November 19, 2009

SLIDE 15

L1 Cache

Thursday, November 19, 2009

SLIDE 16

L2 Cache

x 6 x 10

Thursday, November 19, 2009

SLIDE 17

RAM

x 25

to

x 100

Thursday, November 19, 2009

SLIDE 18

Hard Drive

0.4 x equatorial circumference of Earth x 10 M

Thursday, November 19, 2009

SLIDE 19

WAN

x 100 M 0.42 x Earth to Moon Distance

Thursday, November 19, 2009

SLIDE 20

To experience pain...

Mobile phone network latency 2-10x that of wired

iPhone 3G 500ms ping

x 500 M 2 x Earth to Moon Distance

Thursday, November 19, 2009

SLIDE 21

500ms isn’t that long...

Thursday, November 19, 2009

SLIDE 22

Google SPDY

“It is designed specifically for minimizing latency through features such as multiplexed streams, request prioritization and HTTP header compression.”

Thursday, November 19, 2009

SLIDE 23

Strategy Pattern: Move Data Up

Relative Data Access Latencies

CPU Registers (1)
L1 Cache (1-2)
L2 Cache (6-10)
Main memory (25-50)
Hard drive (1e7)
LAN (1e7-1e8)
WAN (1e9-2e9)

Thursday, November 19, 2009

SLIDE 24

Batching: Do it Once

Thursday, November 19, 2009

SLIDE 25

Batching: Maximize Data Locality

Thursday, November 19, 2009

SLIDE 26

Let’s Dig In

Relative Data Access Latencies, Fastest to Slowest

CPU Registers (1)
L1 Cache (1-2)
L2 Cache (6-10)
Main memory (25-100)
Hard drive (1e7)
LAN (1e7-1e8)
WAN (1e9-2e9)

Thursday, November 19, 2009

SLIDE 27

Network

If you can’t Move Data Up, minimize accesses

Thursday, November 19, 2009

SLIDE 28

Network

If you can’t Move Data Up, minimize accesses Souders Performance Rules 1) Make fewer HTTP requests

Avoid going halfway to the moon whenever possible

Thursday, November 19, 2009

SLIDE 29

Network

If you can’t Move Data Up, minimize accesses Souders Performance Rules 1) Make fewer HTTP requests

Avoid going halfway to the moon whenever possible

2) Use a content delivery network

Edge caching gets data physically closer to the user

Thursday, November 19, 2009

SLIDE 30

Network

If you can’t Move Data Up, minimize accesses Souders Performance Rules 1) Make fewer HTTP requests

Avoid going halfway to the moon whenever possible

2) Use a content delivery network

Edge caching gets data physically closer to the user

3) Add an expires header

Instead of going halfway to the moon (Network),

climb Godzilla (RAM) or go 40% of the way around the Earth (Disk) instead

Thursday, November 19, 2009

SLIDE 31

Network: Packets and Latency

Less data = less packets = less packet loss = less latency

Thursday, November 19, 2009

SLIDE 32

Network

1) Make fewer HTTP requests 2) Use a content delivery network 3) Add an expires header 4) Gzip components

Thursday, November 19, 2009

SLIDE 33

Disk: Falling of the Latency Cliff

Thursday, November 19, 2009

SLIDE 34

Jim Gray, Microsoft 2006

Tape is Dead Disk is Tape Flash is Disk RAM Locality is King

Thursday, November 19, 2009

SLIDE 35

Strategy: Move Up: Disk to RAM

RAM gets you above the exponential latency line

Linear cost and power consumption = $$$

Main memory (25-50) Hard drive (1e7)

Thursday, November 19, 2009

SLIDE 36

Strategy: Avoidance: Bloom Filters

Probabilistic answer to question if a member is in a set
Constant time via multiple hashes
Constant space bit string
Used in BigTable, Cassandra, Squid

Thursday, November 19, 2009

SLIDE 37

In Memory Indexes

Haystack keeps file system indexes in RAM

Cut disk access per image from 3 to 1

Search index compression GFS master node prefix compression of names

Thursday, November 19, 2009

SLIDE 38

Managing Gigabytes -Witten, Moffat, and Bell

Thursday, November 19, 2009

SLIDE 39

SSDs

Disk SSD I/O Ops / Sec

~ 180 - 200 (15K RPM) ~ 70 - 100 ~ 10K - 100K

Seek times

~ 7 - 3.2 ms ~ 0.085 - 0.05 ms SSDs < 1/5th power consumption of spinning disk

Thursday, November 19, 2009

SLIDE 40

Sequential vs. Random Disk Access

James Hamilton

Thursday, November 19, 2009

SLIDE 41

1TB Sequential Read

Thursday, November 19, 2009

SLIDE 42

1TB Random Read

Sunday Monday Tuesday Wednes day Thursda y Friday Saturda y 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Done!

Thursday, November 19, 2009

SLIDE 43

Strategy: Batching and Streaming

Fewer reads/writes of large contiguous chunks of data

GFS 64MB chunks

Thursday, November 19, 2009

SLIDE 44

Strategy: Batching and Streaming

Fewer reads/writes of large contiguous chunks of data

GFS 64MB chunks

Requires data locality

BigTable app specified data layout and compression

Thursday, November 19, 2009

SLIDE 45

The CPU

Thursday, November 19, 2009

SLIDE 46

“CPU Bound”

Data in RAM CPU access to that data

Thursday, November 19, 2009

SLIDE 47

The Memory Wall

Thursday, November 19, 2009

SLIDE 48

Latency Lags Bandwidth

Dave Patterson

Thursday, November 19, 2009

SLIDE 49

Multicore Makes It Worse!

More cores accelerates the rate of divergence

CPU performance doubled 3x over the past 5 years
Memory performance doubled once

Thursday, November 19, 2009

SLIDE 50

Evolving CPU Memory Access Designs

Intel Nehalem integrated memory controller and new high- speed interconnect 40 percent shorter latency and increased bandwidth, 4-6x faster system

Thursday, November 19, 2009

SLIDE 51

More CPU evolution

Intel Nehalem-EX

8 core, 24MB of cache, 2 integrated memory controllers
ring interconnect on-die network designed to speed

the movement of data among the caches used by each of the cores IBM Power 7

32MB Level 3 cache

AMD Magny-Cours

12 cores, 12MB of Level 3 cache

Thursday, November 19, 2009

SLIDE 52

Cache Hit Ratio

Thursday, November 19, 2009

SLIDE 53

Cache Line Awareness

Linked list

Each node as a separate allocation is Bad

Thursday, November 19, 2009

SLIDE 54

Cache Line Awareness

Linked list

Each node as a separate allocation is Bad

Hash table

Reprobe on collision with stride of 1

Thursday, November 19, 2009

SLIDE 55

Cache Line Awareness

Linked list

Each node as a separate allocation is Bad

Hash table

Reprobe on collision with stride of 1

Stack allocation

Top of stack is usually in cache, top of the heap is

usually not in cache

Thursday, November 19, 2009

SLIDE 56

Cache Line Awareness

Linked list

Each node as a separate allocation is Bad

Hash table

Reprobe on collision with stride of 1

Stack allocation

Top of stack is usually in cache, top of the heap is

usually not in cache Pipeline processing

Stages of operations on a piece of data do them all at
nce vs. each stage separately

Thursday, November 19, 2009

SLIDE 57

Cache Line Awareness

Linked list

Each node as a separate allocation is Bad

Hash table

Reprobe on collision with stride of 1

Stack allocation

Top of stack is usually in cache, top of the heap is

usually not in cache Pipeline processing

Stages of operations on a piece of data do them all at
nce vs. each stage separately

Optimize for size

Might be faster execution than code optimized for speed

Thursday, November 19, 2009

SLIDE 58

Cycles to Burn

1) Make fewer HTTP requests 2) Use a content delivery network 3) Add an expires header 4) Gzip components

Use excess compute for compression

Thursday, November 19, 2009

SLIDE 59

Datacenter

Thursday, November 19, 2009

SLIDE 60

Datacenter Storage Heiracrchy

Jeff Dean, Google

Thursday, November 19, 2009

SLIDE 61

Intra-Datacenter Round Trip

x 500,000 ~500 miles ~NYC to Columbus, OH

Thursday, November 19, 2009

SLIDE 62

Datacenter Level Systems Facebook Cassandra Google BigTable memcached Redis Project Voldemort Yahoo Sherpa Sawzall / Pig Google File System RethinkDB MonetDB HBase Facebook Haystack

Thursday, November 19, 2009

SLIDE 63

Memcached Facebook Optimizations

UDP to reduce network traffic - Less Packets

Thursday, November 19, 2009

SLIDE 64

Memcached Facebook Optimizations

UDP to reduce network traffic - Less Packets
One core saturated with network interrupt handing
opportunistic polling of the network interfaces and

setting interrupt coalescing thresholds aggressively - Batching

Thursday, November 19, 2009

SLIDE 65

Memcached Facebook Optimizations

UDP to reduce network traffic - Less Packets
One core saturated with network interrupt handing
opportunistic polling of the network interfaces and

setting interrupt coalescing thresholds aggressively - Batching

Contention on network device transmit queue lock,

packets added/removed from the queue one at a time

Change dequeue algorithm to batch dequeues for

transmit, drop the queue lock, and then transmit the batched packets

Thursday, November 19, 2009

SLIDE 66

Memcached Facebook Optimizations

UDP to reduce network traffic - Less Packets
One core saturated with network interrupt handing
opportunistic polling of the network interfaces and

setting interrupt coalescing thresholds aggressively - Batching

Contention on network device transmit queue lock,

packets added/removed from the queue one at a time

Change dequeue algorithm to batch dequeues for

transmit, drop the queue lock, and then transmit the batched packets

More lock contention fixes

Thursday, November 19, 2009

SLIDE 67

Memcached Facebook Optimizations

UDP to reduce network traffic - Less Packets
One core saturated with network interrupt handing
opportunistic polling of the network interfaces and

setting interrupt coalescing thresholds aggressively - Batching

Contention on network device transmit queue lock,

packets added/removed from the queue one at a time

Change dequeue algorithm to batch dequeues for

transmit, drop the queue lock, and then transmit the batched packets

More lock contention fixes
Result 200,000 UDP requests/second with average

latency of 173 microseconds

Thursday, November 19, 2009

SLIDE 68

Google BigTable

Table contains a sequence of blocks

block index loaded into memory - Move Up

Thursday, November 19, 2009

SLIDE 69

Google BigTable

Table contains a sequence of blocks

block index loaded into memory - Move Up

Table can be completely mapped into memory - Move Up

Thursday, November 19, 2009

SLIDE 70

Google BigTable

Table contains a sequence of blocks

block index loaded into memory - Move Up

Table can be completely mapped into memory - Move Up Bloom filters hint for data - Move Up

Thursday, November 19, 2009

SLIDE 71

Google BigTable

Table contains a sequence of blocks

block index loaded into memory - Move Up

Table can be completely mapped into memory - Move Up Bloom filters hint for data - Move Up Locality groups loaded in memory - Move Up, Batching

Clients can control compression of locality groups

Thursday, November 19, 2009

SLIDE 72

Google BigTable

Table contains a sequence of blocks

block index loaded into memory - Move Up

Table can be completely mapped into memory - Move Up Bloom filters hint for data - Move Up Locality groups loaded in memory - Move Up, Batching

Clients can control compression of locality groups

2 levels of caching - Move Up

Scan cache of key/value pairs and block cache

Thursday, November 19, 2009

SLIDE 73

Google BigTable

Table contains a sequence of blocks

block index loaded into memory - Move Up

Table can be completely mapped into memory - Move Up Bloom filters hint for data - Move Up Locality groups loaded in memory - Move Up, Batching

Clients can control compression of locality groups

2 levels of caching - Move Up

Scan cache of key/value pairs and block cache

Clients cache tablet server locations

3 to 6 network trips if cache is invalid - Move Up

Thursday, November 19, 2009

SLIDE 74

Facebook Cassandra

Bloom filters used for keys in files on disk - Move Up

Thursday, November 19, 2009

SLIDE 75

Facebook Cassandra

Bloom filters used for keys in files on disk - Move Up Sequential disk access only - Batching Append w/o read ahead

Thursday, November 19, 2009

SLIDE 76

Facebook Cassandra

Bloom filters used for keys in files on disk - Move Up Sequential disk access only - Batching Append w/o read ahead Log to memory and write to commit log on dedicated disk - Batching

Thursday, November 19, 2009

SLIDE 77

Facebook Cassandra

Bloom filters used for keys in files on disk - Move Up Sequential disk access only - Batching Append w/o read ahead Log to memory and write to commit log on dedicated disk - Batching Programmer controlled data layout for locality - Batching

Thursday, November 19, 2009

SLIDE 78

Facebook Cassandra

Bloom filters used for keys in files on disk - Move Up Sequential disk access only - Batching Append w/o read ahead Log to memory and write to commit log on dedicated disk - Batching Programmer controlled data layout for locality - Batching Result: 2 orders of magnitude better performance than MySQL

Thursday, November 19, 2009

SLIDE 79

Move the Compute to the Data: YQL Execute

Thursday, November 19, 2009

SLIDE 80

From the Browser Perspective

Performance bounded by 3 things:

Thursday, November 19, 2009

SLIDE 81

From the Browser Perspective

Performance bounded by 3 things:

Fetch time
Unless you’re bundling everything it is a cascade of

interdependent requests, at least 2 phases worth

Thursday, November 19, 2009

SLIDE 82

From the Browser Perspective

Performance bounded by 3 things:

Fetch time
Unless you’re bundling everything it is a cascade of

interdependent requests, at least 2 phases worth

Parse time
HTML
CSS
Javascript

Thursday, November 19, 2009

SLIDE 83

From the Browser Perspective

Performance bounded by 3 things:

Fetch time
Unless you’re bundling everything it is a cascade of

interdependent requests, at least 2 phases worth

Parse time
HTML
CSS
Javascript
Execution time
Javascript execution
DOM construction and layout
Style application

Thursday, November 19, 2009

SLIDE 84

Recap

Move Data Up

Caching
Compression
If You Can’t Move All The Data Up
Indexes
Bloom filters

Batching and Streaming

Maximize locality

Thursday, November 19, 2009

SLIDE 85

Take 2 And Call Me In The Morning

An Engineer’s Guide to Bandwidth

http://developer.yahoo.net/blog/archives/2009/10/

a_engineers_gui.html High Performance Web Sites

Steve Souders

Even Faster Web Sites

Steve Souders

Managing Gigabytes: Compressing and Indexing Documents and Images

Witten, Moffat, Bell

Yahoo Query Language (YQL)

http://developer.yahoo.com/yql/

Thursday, November 19, 2009