Latency Trumps All Chris Saari twitter.com/chrissaari - - PowerPoint PPT Presentation

latency trumps all
SMART_READER_LITE
LIVE PREVIEW

Latency Trumps All Chris Saari twitter.com/chrissaari - - PowerPoint PPT Presentation

Latency Trumps All Chris Saari twitter.com/chrissaari blog.chrissaari.com saari@yahoo-inc.com Thursday, November 19, 2009 Packet Latency Time for a packet to get between points A and B Physical distance + time queued in devices along


slide-1
SLIDE 1

Latency Trumps All

Chris Saari twitter.com/chrissaari blog.chrissaari.com saari@yahoo-inc.com

Thursday, November 19, 2009

slide-2
SLIDE 2

Packet Latency

Time for a packet to get between points A and B Physical distance + time queued in devices along the way ~60ms

Thursday, November 19, 2009

slide-3
SLIDE 3

...

Thursday, November 19, 2009

slide-4
SLIDE 4

Anytime...

... the system is waiting for data The system is end to end

  • Human response time
  • Network card buffering
  • System bus/interconnect speed
  • Interrupt handling
  • Network stacks
  • Process scheduling delays
  • Application process waiting for data from memory to get

to CPU, or from disk to memory to CPU

  • Routers, modems, last mile speeds
  • Backbone speed and operating condition
  • Inter-cluster/colo performance

Thursday, November 19, 2009

slide-5
SLIDE 5

Big Picture

CPU

N e t w

  • r

k

Memory

User

D i s k

Thursday, November 19, 2009

slide-6
SLIDE 6

Tubes?

Thursday, November 19, 2009

slide-7
SLIDE 7

Latency vs. Bandwidth

Latency Bandwidth

Bits / Second

Time

Thursday, November 19, 2009

slide-8
SLIDE 8

Bandwidth of a Truck Full of Tape

Thursday, November 19, 2009

slide-9
SLIDE 9

Latency Lags Bandwidth -David Patterson

f r- al e d s n f s s- a ts e r n s; n

  • r

s- t- r t

  • Thursday, November 19, 2009
slide-10
SLIDE 10

The Problem

Relative Data Access Latencies, Fastest to Slowest

  • CPU Registers (1)
  • L1 Cache (1-2)
  • L2 Cache (6-10)
  • Main memory (25-100)
  • Hard drive (1e7)
  • LAN (1e7-1e8)
  • WAN (1e9-2e9)
  • -- don’t cross this line, don’t go off mother board! ---

Thursday, November 19, 2009

slide-11
SLIDE 11

Relative Data Access Latency

CPU Register L1 L2 RAM

Fast Slow

Thursday, November 19, 2009

slide-12
SLIDE 12

Relative Data Access Latency

CPU Register L1 L2 RAM Hard Disk

Fast Slow

Thursday, November 19, 2009

slide-13
SLIDE 13

Relative Data Access Latency

Register L1 L2 RAM Hard Disk LANFloppy/CD-ROM WAN

Lower Higher

Thursday, November 19, 2009

slide-14
SLIDE 14

CPU Register

CPU Register Latency - Average Human Height

Thursday, November 19, 2009

slide-15
SLIDE 15

L1 Cache

Thursday, November 19, 2009

slide-16
SLIDE 16

L2 Cache

x 6 x 10

Thursday, November 19, 2009

slide-17
SLIDE 17

RAM

x 25

to

x 100

Thursday, November 19, 2009

slide-18
SLIDE 18

Hard Drive

0.4 x equatorial circumference of Earth x 10 M

Thursday, November 19, 2009

slide-19
SLIDE 19

WAN

x 100 M 0.42 x Earth to Moon Distance

Thursday, November 19, 2009

slide-20
SLIDE 20

To experience pain...

Mobile phone network latency 2-10x that of wired

  • iPhone 3G 500ms ping

x 500 M 2 x Earth to Moon Distance

Thursday, November 19, 2009

slide-21
SLIDE 21

500ms isn’t that long...

Thursday, November 19, 2009

slide-22
SLIDE 22

Google SPDY

“It is designed specifically for minimizing latency through features such as multiplexed streams, request prioritization and HTTP header compression.”

Thursday, November 19, 2009

slide-23
SLIDE 23

Strategy Pattern: Move Data Up

Relative Data Access Latencies

  • CPU Registers (1)
  • L1 Cache (1-2)
  • L2 Cache (6-10)
  • Main memory (25-50)
  • Hard drive (1e7)
  • LAN (1e7-1e8)
  • WAN (1e9-2e9)

Thursday, November 19, 2009

slide-24
SLIDE 24

Batching: Do it Once

Thursday, November 19, 2009

slide-25
SLIDE 25

Batching: Maximize Data Locality

Thursday, November 19, 2009

slide-26
SLIDE 26

Let’s Dig In

Relative Data Access Latencies, Fastest to Slowest

  • CPU Registers (1)
  • L1 Cache (1-2)
  • L2 Cache (6-10)
  • Main memory (25-100)
  • Hard drive (1e7)
  • LAN (1e7-1e8)
  • WAN (1e9-2e9)

Thursday, November 19, 2009

slide-27
SLIDE 27

Network

If you can’t Move Data Up, minimize accesses

Thursday, November 19, 2009

slide-28
SLIDE 28

Network

If you can’t Move Data Up, minimize accesses Souders Performance Rules 1) Make fewer HTTP requests

  • Avoid going halfway to the moon whenever possible

Thursday, November 19, 2009

slide-29
SLIDE 29

Network

If you can’t Move Data Up, minimize accesses Souders Performance Rules 1) Make fewer HTTP requests

  • Avoid going halfway to the moon whenever possible

2) Use a content delivery network

  • Edge caching gets data physically closer to the user

Thursday, November 19, 2009

slide-30
SLIDE 30

Network

If you can’t Move Data Up, minimize accesses Souders Performance Rules 1) Make fewer HTTP requests

  • Avoid going halfway to the moon whenever possible

2) Use a content delivery network

  • Edge caching gets data physically closer to the user

3) Add an expires header

  • Instead of going halfway to the moon (Network),

climb Godzilla (RAM) or go 40% of the way around the Earth (Disk) instead

Thursday, November 19, 2009

slide-31
SLIDE 31

Network: Packets and Latency

Less data = less packets = less packet loss = less latency

Thursday, November 19, 2009

slide-32
SLIDE 32

Network

1) Make fewer HTTP requests 2) Use a content delivery network 3) Add an expires header 4) Gzip components

Thursday, November 19, 2009

slide-33
SLIDE 33

Disk: Falling of the Latency Cliff

Thursday, November 19, 2009

slide-34
SLIDE 34

Jim Gray, Microsoft 2006

Tape is Dead Disk is Tape Flash is Disk RAM Locality is King

Thursday, November 19, 2009

slide-35
SLIDE 35

Strategy: Move Up: Disk to RAM

RAM gets you above the exponential latency line

  • Linear cost and power consumption = $$$

Main memory (25-50) Hard drive (1e7)

Thursday, November 19, 2009

slide-36
SLIDE 36

Strategy: Avoidance: Bloom Filters

  • Probabilistic answer to question if a member is in a set
  • Constant time via multiple hashes
  • Constant space bit string
  • Used in BigTable, Cassandra, Squid

Thursday, November 19, 2009

slide-37
SLIDE 37

In Memory Indexes

Haystack keeps file system indexes in RAM

  • Cut disk access per image from 3 to 1

Search index compression GFS master node prefix compression of names

Thursday, November 19, 2009

slide-38
SLIDE 38

Managing Gigabytes -Witten, Moffat, and Bell

Thursday, November 19, 2009

slide-39
SLIDE 39

SSDs

Disk SSD I/O Ops / Sec

~ 180 - 200 (15K RPM) ~ 70 - 100 ~ 10K - 100K

Seek times

~ 7 - 3.2 ms ~ 0.085 - 0.05 ms SSDs < 1/5th power consumption of spinning disk

Thursday, November 19, 2009

slide-40
SLIDE 40

Sequential vs. Random Disk Access

  • James Hamilton

Thursday, November 19, 2009

slide-41
SLIDE 41

1TB Sequential Read

Thursday, November 19, 2009

slide-42
SLIDE 42

1TB Random Read

Sunday Monday Tuesday Wednes day Thursda y Friday Saturda y 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Done!

Thursday, November 19, 2009

slide-43
SLIDE 43

Strategy: Batching and Streaming

Fewer reads/writes of large contiguous chunks of data

  • GFS 64MB chunks

Thursday, November 19, 2009

slide-44
SLIDE 44

Strategy: Batching and Streaming

Fewer reads/writes of large contiguous chunks of data

  • GFS 64MB chunks

Requires data locality

  • BigTable app specified data layout and compression

Thursday, November 19, 2009

slide-45
SLIDE 45

The CPU

Thursday, November 19, 2009

slide-46
SLIDE 46

“CPU Bound”

Data in RAM CPU access to that data

Thursday, November 19, 2009

slide-47
SLIDE 47

The Memory Wall

Thursday, November 19, 2009

slide-48
SLIDE 48

Latency Lags Bandwidth

  • Dave Patterson

Thursday, November 19, 2009

slide-49
SLIDE 49

Multicore Makes It Worse!

More cores accelerates the rate of divergence

  • CPU performance doubled 3x over the past 5 years
  • Memory performance doubled once

Thursday, November 19, 2009

slide-50
SLIDE 50

Evolving CPU Memory Access Designs

Intel Nehalem integrated memory controller and new high- speed interconnect 40 percent shorter latency and increased bandwidth, 4-6x faster system

Thursday, November 19, 2009

slide-51
SLIDE 51

More CPU evolution

Intel Nehalem-EX

  • 8 core, 24MB of cache, 2 integrated memory controllers
  • ring interconnect on-die network designed to speed

the movement of data among the caches used by each of the cores IBM Power 7

  • 32MB Level 3 cache

AMD Magny-Cours

  • 12 cores, 12MB of Level 3 cache

Thursday, November 19, 2009

slide-52
SLIDE 52

Cache Hit Ratio

Thursday, November 19, 2009

slide-53
SLIDE 53

Cache Line Awareness

Linked list

  • Each node as a separate allocation is Bad

Thursday, November 19, 2009

slide-54
SLIDE 54

Cache Line Awareness

Linked list

  • Each node as a separate allocation is Bad

Hash table

  • Reprobe on collision with stride of 1

Thursday, November 19, 2009

slide-55
SLIDE 55

Cache Line Awareness

Linked list

  • Each node as a separate allocation is Bad

Hash table

  • Reprobe on collision with stride of 1

Stack allocation

  • Top of stack is usually in cache, top of the heap is

usually not in cache

Thursday, November 19, 2009

slide-56
SLIDE 56

Cache Line Awareness

Linked list

  • Each node as a separate allocation is Bad

Hash table

  • Reprobe on collision with stride of 1

Stack allocation

  • Top of stack is usually in cache, top of the heap is

usually not in cache Pipeline processing

  • Stages of operations on a piece of data do them all at
  • nce vs. each stage separately

Thursday, November 19, 2009

slide-57
SLIDE 57

Cache Line Awareness

Linked list

  • Each node as a separate allocation is Bad

Hash table

  • Reprobe on collision with stride of 1

Stack allocation

  • Top of stack is usually in cache, top of the heap is

usually not in cache Pipeline processing

  • Stages of operations on a piece of data do them all at
  • nce vs. each stage separately

Optimize for size

  • Might be faster execution than code optimized for speed

Thursday, November 19, 2009

slide-58
SLIDE 58

Cycles to Burn

1) Make fewer HTTP requests 2) Use a content delivery network 3) Add an expires header 4) Gzip components

  • Use excess compute for compression

Thursday, November 19, 2009

slide-59
SLIDE 59

Datacenter

Thursday, November 19, 2009

slide-60
SLIDE 60

Datacenter Storage Heiracrchy

  • Jeff Dean, Google

Thursday, November 19, 2009

slide-61
SLIDE 61

Intra-Datacenter Round Trip

x 500,000 ~500 miles ~NYC to Columbus, OH

Thursday, November 19, 2009

slide-62
SLIDE 62

Datacenter Level Systems Facebook Cassandra Google BigTable memcached Redis Project Voldemort Yahoo Sherpa Sawzall / Pig Google File System RethinkDB MonetDB HBase Facebook Haystack

Thursday, November 19, 2009

slide-63
SLIDE 63

Memcached Facebook Optimizations

  • UDP to reduce network traffic - Less Packets

Thursday, November 19, 2009

slide-64
SLIDE 64

Memcached Facebook Optimizations

  • UDP to reduce network traffic - Less Packets
  • One core saturated with network interrupt handing
  • opportunistic polling of the network interfaces and

setting interrupt coalescing thresholds aggressively - Batching

Thursday, November 19, 2009

slide-65
SLIDE 65

Memcached Facebook Optimizations

  • UDP to reduce network traffic - Less Packets
  • One core saturated with network interrupt handing
  • opportunistic polling of the network interfaces and

setting interrupt coalescing thresholds aggressively - Batching

  • Contention on network device transmit queue lock,

packets added/removed from the queue one at a time

  • Change dequeue algorithm to batch dequeues for

transmit, drop the queue lock, and then transmit the batched packets

Thursday, November 19, 2009

slide-66
SLIDE 66

Memcached Facebook Optimizations

  • UDP to reduce network traffic - Less Packets
  • One core saturated with network interrupt handing
  • opportunistic polling of the network interfaces and

setting interrupt coalescing thresholds aggressively - Batching

  • Contention on network device transmit queue lock,

packets added/removed from the queue one at a time

  • Change dequeue algorithm to batch dequeues for

transmit, drop the queue lock, and then transmit the batched packets

  • More lock contention fixes

Thursday, November 19, 2009

slide-67
SLIDE 67

Memcached Facebook Optimizations

  • UDP to reduce network traffic - Less Packets
  • One core saturated with network interrupt handing
  • opportunistic polling of the network interfaces and

setting interrupt coalescing thresholds aggressively - Batching

  • Contention on network device transmit queue lock,

packets added/removed from the queue one at a time

  • Change dequeue algorithm to batch dequeues for

transmit, drop the queue lock, and then transmit the batched packets

  • More lock contention fixes
  • Result 200,000 UDP requests/second with average

latency of 173 microseconds

Thursday, November 19, 2009

slide-68
SLIDE 68

Google BigTable

Table contains a sequence of blocks

  • block index loaded into memory - Move Up

Thursday, November 19, 2009

slide-69
SLIDE 69

Google BigTable

Table contains a sequence of blocks

  • block index loaded into memory - Move Up

Table can be completely mapped into memory - Move Up

Thursday, November 19, 2009

slide-70
SLIDE 70

Google BigTable

Table contains a sequence of blocks

  • block index loaded into memory - Move Up

Table can be completely mapped into memory - Move Up Bloom filters hint for data - Move Up

Thursday, November 19, 2009

slide-71
SLIDE 71

Google BigTable

Table contains a sequence of blocks

  • block index loaded into memory - Move Up

Table can be completely mapped into memory - Move Up Bloom filters hint for data - Move Up Locality groups loaded in memory - Move Up, Batching

  • Clients can control compression of locality groups

Thursday, November 19, 2009

slide-72
SLIDE 72

Google BigTable

Table contains a sequence of blocks

  • block index loaded into memory - Move Up

Table can be completely mapped into memory - Move Up Bloom filters hint for data - Move Up Locality groups loaded in memory - Move Up, Batching

  • Clients can control compression of locality groups

2 levels of caching - Move Up

  • Scan cache of key/value pairs and block cache

Thursday, November 19, 2009

slide-73
SLIDE 73

Google BigTable

Table contains a sequence of blocks

  • block index loaded into memory - Move Up

Table can be completely mapped into memory - Move Up Bloom filters hint for data - Move Up Locality groups loaded in memory - Move Up, Batching

  • Clients can control compression of locality groups

2 levels of caching - Move Up

  • Scan cache of key/value pairs and block cache

Clients cache tablet server locations

  • 3 to 6 network trips if cache is invalid - Move Up

Thursday, November 19, 2009

slide-74
SLIDE 74

Facebook Cassandra

Bloom filters used for keys in files on disk - Move Up

Thursday, November 19, 2009

slide-75
SLIDE 75

Facebook Cassandra

Bloom filters used for keys in files on disk - Move Up Sequential disk access only - Batching Append w/o read ahead

Thursday, November 19, 2009

slide-76
SLIDE 76

Facebook Cassandra

Bloom filters used for keys in files on disk - Move Up Sequential disk access only - Batching Append w/o read ahead Log to memory and write to commit log on dedicated disk - Batching

Thursday, November 19, 2009

slide-77
SLIDE 77

Facebook Cassandra

Bloom filters used for keys in files on disk - Move Up Sequential disk access only - Batching Append w/o read ahead Log to memory and write to commit log on dedicated disk - Batching Programmer controlled data layout for locality - Batching

Thursday, November 19, 2009

slide-78
SLIDE 78

Facebook Cassandra

Bloom filters used for keys in files on disk - Move Up Sequential disk access only - Batching Append w/o read ahead Log to memory and write to commit log on dedicated disk - Batching Programmer controlled data layout for locality - Batching Result: 2 orders of magnitude better performance than MySQL

Thursday, November 19, 2009

slide-79
SLIDE 79

Move the Compute to the Data: YQL Execute

Thursday, November 19, 2009

slide-80
SLIDE 80

From the Browser Perspective

Performance bounded by 3 things:

Thursday, November 19, 2009

slide-81
SLIDE 81

From the Browser Perspective

Performance bounded by 3 things:

  • Fetch time
  • Unless you’re bundling everything it is a cascade of

interdependent requests, at least 2 phases worth

Thursday, November 19, 2009

slide-82
SLIDE 82

From the Browser Perspective

Performance bounded by 3 things:

  • Fetch time
  • Unless you’re bundling everything it is a cascade of

interdependent requests, at least 2 phases worth

  • Parse time
  • HTML
  • CSS
  • Javascript

Thursday, November 19, 2009

slide-83
SLIDE 83

From the Browser Perspective

Performance bounded by 3 things:

  • Fetch time
  • Unless you’re bundling everything it is a cascade of

interdependent requests, at least 2 phases worth

  • Parse time
  • HTML
  • CSS
  • Javascript
  • Execution time
  • Javascript execution
  • DOM construction and layout
  • Style application

Thursday, November 19, 2009

slide-84
SLIDE 84

Recap

Move Data Up

  • Caching
  • Compression
  • If You Can’t Move All The Data Up
  • Indexes
  • Bloom filters

Batching and Streaming

  • Maximize locality

Thursday, November 19, 2009

slide-85
SLIDE 85

Take 2 And Call Me In The Morning

An Engineer’s Guide to Bandwidth

  • http://developer.yahoo.net/blog/archives/2009/10/

a_engineers_gui.html High Performance Web Sites

  • Steve Souders

Even Faster Web Sites

  • Steve Souders

Managing Gigabytes: Compressing and Indexing Documents and Images

  • Witten, Moffat, Bell

Yahoo Query Language (YQL)

  • http://developer.yahoo.com/yql/

Thursday, November 19, 2009