[PPT] - NOT EXACTLY! APPROXIMATE ALGORITHMS FOR BIG DATA FANGJIN YANG PowerPoint Presentation

SLIDE 1

NOT EXACTLY! APPROXIMATE ALGORITHMS FOR BIG DATA

FANGJIN YANG · DRUID COMMITTER · METAMARKETS NELSON RAY · QUANTITATIVE ANALYST · GOOGLE

SLIDE 2

THE PROBLEM MANAGE DATA COST EFFICIENTLY THE DATA DEALING WITH EVENT STREAMS SIMPLIFYING STORAGE DATA SUMMARIZATION FINDING UNIQUES HYPERLOGLOG ESTIMATING DISTRIBUTION APPROXIMATE HISTOGRAMS

OVERVIEW

SLIDE 3

THE PROBLEM

SLIDE 4

Fangjin Yang & Nelson Ray 2014

Real-time Bidding

SLIDE 5

Fangjin Yang & Nelson Ray 2014

PROBLEMS

Storing/processing billions of rows is expensive
Reduce storage, improve performance
Reduce storage by throwing away information
Throwing away information reduces accuracy

SLIDE 6

THE DATA

SLIDE 7

Fangjin Yang & Nelson Ray 2014

THE DATA

Timestamp Bid Price

2013-10-28T02:13:43Z 1.19 2013-10-28T02:14:21Z 0.05 2013-10-28T02:55:32Z 1.04 2013-10-28T03:07:28Z 0.16 2013-10-28T03:13:43Z 1.03 2013-10-28T04:18:19Z 0.15 2013-10-28T05:36:34Z 0.01 2013-10-28T05:37:59Z 1.03

SLIDE 8

Fangjin Yang & Nelson Ray 2014

DATA SUMMARIZATION

Timestamp Revenue Number of Prices

2013-10-28T02 2.28 3 2013-10-28T03 1.19 2 2013-10-28T04 0.15 1 2013-10-28T05 1.04 2

Timestamp Bid Price

2013-10-28T02:13:43Z 1.19 2013-10-28T02:14:21Z 0.05 2013-10-28T02:55:32Z 1.04 2013-10-28T03:07:28Z 0.16 2013-10-28T03:13:43Z 1.03 2013-10-28T04:18:19Z 0.15 2013-10-28T05:36:34Z 0.01 2013-10-28T05:37:59Z 1.03

SLIDE 9

Fangjin Yang & Nelson Ray 2014

COMBINING SUMMARIZATIONS

Timestamp Revenue Number of Prices

2013-10-28T02 2.28 3 2013-10-28T03 1.19 2 2013-10-28T04 0.15 1 2013-10-28T05 1.04 2

Timestamp Revenue Number of Prices

2013-10-28 4.66 8

SLIDE 10

Fangjin Yang & Nelson Ray 2014

SLIDE 11

Fangjin Yang & Nelson Ray 2014

Throw away information about individual

events

Drastically reduce storage and improve

query speed

On average, 40x reduction in storage on

with our own data

We’ve lost info about individual prices
Data summarization is not always trivial

SUMMARIZATION SUMMARY

SLIDE 12

CASE STUDY 1

SLIDE 13

Fangjin Yang & Nelson Ray 2014

Problem: determine unique number
f elements in a set
Use case: measuring number of

unique users

CASE STUDY 1

DATA BIG DATA

SLIDE 14

Fangjin Yang & Nelson Ray 2014

Store every single username (in a Java HashSet)
No loss of information, no accuracy tradeoff

EXACT SOLUTION

SLIDE 15

Fangjin Yang & Nelson Ray 2014

HASHSET

Timestamp Username

2013-10-28T02:13:43Z user1 2013-10-28T02:14:21Z user2 2013-10-28T02:55:32Z user1 2013-10-28T03:07:28Z user4 2013-10-28T03:13:43Z user97 2013-10-28T04:18:19Z user2 2013-10-28T05:36:34Z user9834 2013-10-28T05:37:59Z user97

Timestamp Usernames

2013-10-28T02 {user1, user2} 2013-10-28T03 {user4, user97} 2013-10-28T04 {user2} 2013-10-28T05 {user9834, user97}

SLIDE 16

Fangjin Yang & Nelson Ray 2014

HASHSET

Timestamp Usernames

2013-10-28 {user1, user2, user4, user97, user9834}

Timestamp Usernames

2013-10-28T02 {user1, user2} 2013-10-28T03 {user4, user97} 2013-10-28T04 {user2} 2013-10-28T05 {user9834, user97}

SLIDE 17

Fangjin Yang & Nelson Ray 2014

Storage/Computation: O(# uniques)
We’re not throwing away any information about usernames
Accuracy: 100%

EXACT SOLUTION

SLIDE 18

Fangjin Yang & Nelson Ray 2014

High cardinality user dimensions == infeasible storage
Storage cost for 10^9 unique elements == ~48GB of storage

INFEASIBLE STORAGE

SLIDE 19

Fangjin Yang & Nelson Ray 2014

Plenty of literature
Linear Counting
Count-Min Sketch
LogLog

CARDINALITY ESTIMATION

SLIDE 20

Fangjin Yang & Nelson Ray 2014

Storage: 1.5 KB ( for cardinalities 10^9 and above)
99.999997% decrease in storage size
Computation: O(1) (for cardinalities < ~10^10)
Accuracy: 97%

HYPERLOGLOG

SLIDE 21

Fangjin Yang & Nelson Ray 2014

Maps value in one space (generally larger) to another value in

another space (generally smaller)

HASH FUNCTIONS

HashFn

0001 String

SLIDE 22

Fangjin Yang & Nelson Ray 2014

Bits of output value are independent and have an equal

probability of occurring (50%)

WHAT MAKES A GOOD HASH FUNCTION?

HashFn

0xxx String

HashFn

1xxx String 50% Probability 50% Probability

SLIDE 23

Fangjin Yang & Nelson Ray 2013

HASHING TWO STRINGS

HashFn

0xxx user1

HashFn

1xxx user2

SLIDE 24

Fangjin Yang & Nelson Ray 2013

THE NEXT BIT

HashFn

00xx String

HashFn

10xx String

HashFn

01xx String

HashFn

11xx String 25% Probability 25% Probability 25% Probability 25% Probability

SLIDE 25

Fangjin Yang & Nelson Ray 2013

HASHING 4 STRINGS

HashFn

00xx user1

HashFn

10xx user2

HashFn

01xx user3

HashFn

11xx user4

SLIDE 26

Fangjin Yang & Nelson Ray 2013

What about 001x?
If we hashed one string, 12.5% chance this could occur
If we hashed 8 strings, one of them should be this value
What about 000001…x?
Extremely unlikely to occur if we only hashed one string

HYPERLOGLOG

SLIDE 27

Fangjin Yang & Nelson Ray 2014

HYPERLOGLOG

Looks at distribution of bits of hashed values
Cares about the position of the left most ‘1’ bit
1000 -> position == 1
0100 -> position == 2
0011 -> position == 3

SLIDE 28

Fangjin Yang & Nelson Ray 2014

HYPERLOGLOG

Stores the max position of the left-most ‘1’ bit of hashed values
User1 —> hash —> 1000 (position == 1)
User2 —> hash —> 0100 (position == 2)
User3 —> hash —> 0011 (position == 3)
HLL will store position == 3

SLIDE 29

Fangjin Yang & Nelson Ray 2014

HYPERLOGLOG

SLIDE 30

Fangjin Yang & Nelson Ray 2013

HYPERLOGLOG ACCURACY

HashFn

00xx String

HashFn

10xx String

HashFn

01xx String

HashFn

11xx String 25% Probability

SLIDE 31

Fangjin Yang & Nelson Ray 2014

HYPERLOGLOG

If we fed the stream through a second hash function, we’d have a

second independent estimate

Adding more hash functions gives us more independent

estimates that we can combine together for a lower variance estimate

This is expensive because we have to hash the same data n times

SLIDE 32

Fangjin Yang & Nelson Ray 2014

HYPERLOGLOG

Instead we can split the stream
Estimate the cardinality of each sub-stream
For each sub-stream
Store the maximum over the positions of the leftmost '1' bit for

hashed values of the sub-stream

SLIDE 33

Fangjin Yang & Nelson Ray 2014

HYPERLOGLOG

Buckets

INF
INF
INF
INF

SLIDE 34

Fangjin Yang & Nelson Ray 2014

HYPERLOGLOG

HashFn

01xxx...x user1

Buckets

2

INF
INF
INF

SLIDE 35

Fangjin Yang & Nelson Ray 2014

HYPERLOGLOG

HashFn

01xxx...x user1

Buckets

2 2 2 1

HashFn

01xxx...x user4

HashFn

01xxx...x user12

HashFn

1xxxx...x user7

SLIDE 36

Fangjin Yang & Nelson Ray 2014

HYPERLOGLOG

HashFn

001xx...x user6

Buckets

2 -> 3 2 2 1

SLIDE 37

Fangjin Yang & Nelson Ray 2014

DETERMINING FINAL CARDINALITY

Buckets

3 2 2 1

MATH

11.00

SLIDE 38

Fangjin Yang & Nelson Ray 2014

HYPERLOGLOG

Timestamp Buckets

2013-10-28T02 [3, 2, 2, 1] 2013-10-28T03 [1, 2, 1, 2] 2013-10-28T04 [2, 1, 4, 1] 2013-10-28T05 [2, 2, 3, 1]

SLIDE 39

Fangjin Yang & Nelson Ray 2014

HYPERLOGLOG

Timestamp HLL Object

2013-10-28 [3, 2, 4, 2]

SLIDE 40

Fangjin Yang & Nelson Ray 2014

SLIDE 41

Fangjin Yang & Nelson Ray 2014

RESULTS

SLIDE 42

CASE STUDY 2

SLIDE 43

Fangjin Yang & Nelson Ray 2014

Problem: determine distribution of

values

Use case: quantiles and histograms
Hourly truncation

CASE STUDY 2

SLIDE 44

Fangjin Yang & Nelson Ray 2014

THE DATA

Timestamp Bid Price

2013-10-28T02:13:43Z 1.19 2013-10-28T02:14:21Z 0.05 2013-10-28T02:55:32Z 1.04 2013-10-28T03:07:28Z 0.16 2013-10-28T03:13:43Z 1.03 2013-10-28T04:18:19Z 0.15 2013-10-28T05:36:34Z 0.01 2013-10-28T05:37:59Z 1.03

SLIDE 45

Fangjin Yang & Nelson Ray 2014

EXACT SOLUTION

Timestamp Bid Price

2013-10-28T02:13:43Z 1.19 2013-10-28T02:14:21Z 0.05 2013-10-28T02:55:32Z 1.04 2013-10-28T03:07:28Z 0.16 2013-10-28T03:13:43Z 1.03 2013-10-28T04:18:19Z 0.15 2013-10-28T05:36:34Z 0.01 2013-10-28T05:37:59Z 1.03

Timestamp Bid Prices

2013-10-28T02 [1.19, 0.05, 1.04] 2013-10-28T03 [0.16, 1.03] 2013-10-28T04 [0.15] 2013-10-28T05 [0.01, 1.03]

SLIDE 46

Fangjin Yang & Nelson Ray 2014

EXACT SOLUTION

Timestamp Bid Prices

2013-10-28 [1.19, 0.05, 1.04, 0.16, 1.03, 0.15, 0.01, 1.03]

Timestamp Bid Prices

2013-10-28T02 [1.19, 0.05, 1.04] 2013-10-28T03 [0.16, 1.03] 2013-10-28T04 [0.15] 2013-10-28T05 [0.01, 1.03]

SLIDE 47

Fangjin Yang & Nelson Ray 2014

Arrays of values
Storage: Linear
Computation: Linear
Accuracy: 100%
Problem: Storing raw values can often be more expensive than

storing the rest of the row.

Solution: Store an approximate representation!

EXACT SOLUTION

SLIDE 48

Fangjin Yang & Nelson Ray 2014

“A Streaming Parallel Decision Tree Algorithm”
Yael Ben-Haim & Elad Tom-Tov
Storage: Sublinear/Linear
Computation: Sublinear/Linear
Accuracy: pretty good

APPROXIMATE HISTOGRAMS

SLIDE 49

Fangjin Yang & Nelson Ray 2013

RAW DATA

40 Prices: 3.46, 5.37, 5.62, 5.87, 6.21, 6.79, 7.11, 7.36, 7.55, 7.64, 7.89,

7.9, 8.07, 8.44, 8.62, 8.78, 8.87, 9.03, 9.24, 9.36, 9.58, 9.59, 9.81, 10.31, 10.35, 10.39, 10.47, 10.77, 10.93, 11.04, 11.1, 13.1, 13.27, 13.29, 13.87, 14.29, 14.51, 14.9, 15.75, 17.07

SLIDE 50

Fangjin Yang & Nelson Ray 2013

RAW DATA

SLIDE 51

Fangjin Yang & Nelson Ray 2013

SUMMARIZE WITH (COUNT, MEAN)

SLIDE 52

Fangjin Yang & Nelson Ray 2013

SUMMARIZE WITH (COUNT, MEAN)

SLIDE 53

Fangjin Yang & Nelson Ray 2013

SUMMARIZE WITH (COUNT, MEAN)

SLIDE 54

Fangjin Yang & Nelson Ray 2014

COMBINING HISTOGRAMS

SLIDE 55

Fangjin Yang & Nelson Ray 2014

COMBINING HISTOGRAMS

SLIDE 56

Fangjin Yang & Nelson Ray 2014

SLIDE 57

Fangjin Yang & Nelson Ray 2014

COUNT # <= X

SLIDE 58

Fangjin Yang & Nelson Ray 2014

SLIDE 59

Fangjin Yang & Nelson Ray 2014

ACCURACY

SLIDE 60

Fangjin Yang & Nelson Ray 2014

Open source
Designed to power interactive applications at scale
Optimized for business intelligence (OLAP) queries
Arbitrary slice-n-dice and drill into data
Supports streaming and batch data ingestion
Exact and approximate calculations (Hyperloglog, approximate

histograms)

DRUID

SLIDE 61

Fangjin Yang & Nelson Ray 2014

100 cc2.8xlarge (1600 cores, 6TB RAM) Druid cluster
27B summarized rows/s scan rate
Add 16B summarized (~640B raw) rows/s
Combine 4B HyperLogLog objects/s
Combine 1.5B ApproximateHistogram objects/s

BENCHMARKS

SLIDE 62

Fangjin Yang & Nelson Ray 2014

Summarization for sums: substantially (e.g. ~40x for us) faster/less

storage

100% accuracy
Sketches for cardinality/distribution: 1-2 orders of magnitude faster/

less storage than raw

97% accuracy
40x lower costs is make or break
interactive queries that are accurate enough

CONCLUSIONS

SLIDE 63

DRUID IS OPEN SOURCE WWW.DRUID.IO

twitter @druidio irc.freenode.net #druid-dev

SLIDE 64

NOT EXACTLY! APPROXIMATE ALGORITHMS FOR BIG DATA

OVERVIEW

THE PROBLEM

PROBLEMS

THE DATA

THE DATA

DATA SUMMARIZATION

COMBINING SUMMARIZATIONS

events

query speed

with our own data

SUMMARIZATION SUMMARY

CASE STUDY 1

unique users

CASE STUDY 1

EXACT SOLUTION

HASHSET

HASHSET

EXACT SOLUTION

INFEASIBLE STORAGE

CARDINALITY ESTIMATION

HYPERLOGLOG

another space (generally smaller)

HASH FUNCTIONS

probability of occurring (50%)

WHAT MAKES A GOOD HASH FUNCTION?

HASHING TWO STRINGS

THE NEXT BIT

HASHING 4 STRINGS

HYPERLOGLOG

HYPERLOGLOG

HYPERLOGLOG

HYPERLOGLOG

HYPERLOGLOG ACCURACY

HYPERLOGLOG

second independent estimate

estimates that we can combine together for a lower variance estimate

HYPERLOGLOG

hashed values of the sub-stream

HYPERLOGLOG

HYPERLOGLOG

HYPERLOGLOG

HYPERLOGLOG

DETERMINING FINAL CARDINALITY

HYPERLOGLOG

HYPERLOGLOG

RESULTS

CASE STUDY 2

values

CASE STUDY 2

THE DATA

EXACT SOLUTION

EXACT SOLUTION

storing the rest of the row.

EXACT SOLUTION

APPROXIMATE HISTOGRAMS

RAW DATA

7.9, 8.07, 8.44, 8.62, 8.78, 8.87, 9.03, 9.24, 9.36, 9.58, 9.59, 9.81, 10.31, 10.35, 10.39, 10.47, 10.77, 10.93, 11.04, 11.1, 13.1, 13.27, 13.29, 13.87, 14.29, 14.51, 14.9, 15.75, 17.07

RAW DATA

SUMMARIZE WITH (COUNT, MEAN)

SUMMARIZE WITH (COUNT, MEAN)

SUMMARIZE WITH (COUNT, MEAN)

COMBINING HISTOGRAMS

COMBINING HISTOGRAMS

COUNT # <= X

ACCURACY

histograms)

DRUID

BENCHMARKS

storage

less storage than raw

CONCLUSIONS

DRUID IS OPEN SOURCE WWW.DRUID.IO

twitter @druidio irc.freenode.net #druid-dev

THANK YOU