L ECTURE 9 Last time Approximate counting Estimation of the 2 nd - - PowerPoint PPT Presentation

β–Ά
l ecture 9
SMART_READER_LITE
LIVE PREVIEW

L ECTURE 9 Last time Approximate counting Estimation of the 2 nd - - PowerPoint PPT Presentation

Sublinear Algorithms L ECTURE 9 Last time Approximate counting Estimation of the 2 nd moment Linear sketching Today Multipurpose sketches Count-min and count-sketch Range queries, heavy hitters, quantiles 10/1/2020 Sofya


slide-1
SLIDE 1

10/1/2020

Sublinear Algorithms

LECTURE 9

Last time

  • Approximate counting
  • Estimation of the 2nd moment
  • Linear sketching

Today

  • Multipurpose sketches
  • Count-min and count-sketch
  • Range queries, heavy hitters, quantiles

Sofya Raskhodnikova;Boston University

slide-2
SLIDE 2

Multipurpose Sketches: Problems

Input: a stream 𝑏1, 𝑏2, … , 𝑏𝑛 ∈ π‘œ 𝑛

  • The frequency vector of the stream is 𝑔 = (𝑔

1, … , 𝑔 π‘œ),

where 𝑔

𝑗 is the number of times 𝑗 appears in the stream

Goal: to maintain data structures that can answer the following queries

  • Point Query: For 𝑗 ∈ [π‘œ], estimate 𝑔

𝑗

  • Range Query: For 𝑗, π‘˜ ∈ [n], estimate 𝑔

𝑗 + 𝑔 𝑗+1 + . . . + 𝑔 π‘˜

  • Quantile Query: For 𝜚 ∈ [0, 1], find π‘˜ with 𝑔

1 + . . . + 𝑔 π‘˜ β‰ˆ πœšπ‘›

  • Heavy Hitters Query: For 𝜚 ∈ [0, 1], find all 𝑗 with 𝑔

𝑗 β‰₯ πœšπ‘›.

Desired accuracy: Β±πœπ‘› with error probability πœ€

2

Based on Andrew McGregor’s slides: https://people.cs.umass.edu/~mcgregor/711S18/vectors-3.pdf

slide-3
SLIDE 3

Initial Solution to Point Queries

  • We could maintain the whole frequency vector 𝑔

1, … , 𝑔 π‘œ

  • Then, on query 𝑗, we can output 𝑔

𝑗

Idea: Group counts for some numbers together If 𝑗 falls into bucket π‘˜, then 𝑔

𝑗 ≀ 𝑑 π‘˜.

3 1 5 1 1 3 7 2 4 4 6 7 1

𝑑1 𝑑2 𝑑3 𝑑𝑐 … Point Query Algorithm (initial version) 1. Sample a hash function β„Ž ∢ π‘œ β†’ [𝑐] from a 2-wise independent family 2. Initialize counters 𝑑1, … , 𝑑𝑐 to 0 3. For each element 𝑏, increment cβ„Ž 𝑏 by 1. 4. To answer a point query 𝑗, return cβ„Ž 𝑗 .

Never underestimate

slide-4
SLIDE 4

Initial Solution to Point Queries: Analysis

  • Fix π‘—βˆ— ∈ [π‘œ].
  • Let π‘Ž = π‘‘β„Ž π‘—βˆ— βˆ’ 𝑔

π‘—βˆ— be the overestimation error.

  • For all 𝑗 β‰  π‘—βˆ—, let π‘Œπ‘— = α‰Š1 if β„Ž 𝑗 = β„Ž π‘—βˆ—
  • therwise
  • By Markov’s inequality, if 𝑐 = 2/𝜁 then

Pr π‘Ž β‰₯ πœπ‘› ≀ 𝔽 π‘Ž πœπ‘› ≀ 1 πœπ‘ ≀ 1 2

4

π‘Ž = ෍

π‘—β‰ π‘—βˆ—

π‘Œπ‘— β‹… 𝑔

𝑗

𝔽 π‘Ž = ෍

π‘—β‰ π‘—βˆ—

𝔽 π‘Œπ‘— β‹… 𝑔

𝑗 = 1

𝑐 ෍

π‘—β‰ π‘—βˆ—

𝑔

𝑗 ≀ 𝑛

𝑐 𝔽 π‘Œπ‘— = Pr[β„Ž 𝑗 = β„Ž(π‘—βˆ—)] = 1 𝑐 Point Query Algorithm (initial version) 1. Sample a hash function β„Ž ∢ π‘œ β†’ [𝑐] from a 2-wise independent family 2. Initialize counters 𝑑1, … , 𝑑𝑐 to 0 3. For each element 𝑏, increment cβ„Ž 𝑏 by 1. 4. To answer a point query 𝑗, return cβ„Ž 𝑗 .

Never underestimate by 2-wise independence by linearity of expectation

Based on Andrew McGregor’s slides: https://people.cs.umass.edu/~mcgregor/711S18/vectors-3.pdf

slide-5
SLIDE 5

Count-Min Sketch [Cormode Muthukrishnan 03]

  • Correctness: Pr 𝑔

𝑗 ≀ ΰ·©

𝑔

𝑗 ≀ 𝑔 𝑗 + πœπ‘›

= 1 βˆ’ Pr all 𝑒 hash functions overestimate by more than πœπ‘› β‰₯ 1 βˆ’ 1 2

𝑒

= 1 βˆ’ πœ€

  • Space: 𝑃 𝑒 (log π‘œ + log 𝑐) for the hash functions +

𝑃 𝑒𝑐 log 𝑛 for the counters Total: 𝑃 log π‘œ + log 𝑛

1 𝜁 log 1 πœ€

5

Point Query Algorithm 1. Set 𝑒 = log2 1/πœ€ and 𝑐 = 2/𝜁 2. Sample 𝑒 hash functions β„Žπ‘˜: π‘œ β†’ [𝑐] from a 2-wise independent family 3. Initialize 𝑒𝑐 counters 𝑑

π‘˜,𝑙 to 0

4. For each element 𝑏 and each π‘˜ ∈ [𝑒], increment cπ‘˜,β„Ž 𝑏 by 1. 5. To answer a point query 𝑗, return ΰ·© 𝑔

𝑗 = min π‘˜βˆˆ[𝑒] cπ‘˜,β„Ž 𝑗 .

Never underestimate since hash functions are chosen independently

Based on Andrew McGregor’s slides: https://people.cs.umass.edu/~mcgregor/711S18/vectors-3.pdf

slide-6
SLIDE 6

Multipurpose Sketches: Problems

Input: a stream 𝑏1, 𝑏2, … , 𝑏𝑛 ∈ π‘œ 𝑛

  • The frequency vector of the stream is 𝑔 = (𝑔

1, … , 𝑔 π‘œ),

where 𝑔

𝑗 is the number of times 𝑗 appears in the stream

Goal: to maintain data structures that can answer the following queries

  • Point Query: For 𝑗 ∈ [π‘œ], estimate 𝑔

𝑗

  • Range Query: For 𝑗, π‘˜ ∈ [n], estimate 𝑔

𝑗 + 𝑔 𝑗+1 + . . . + 𝑔 π‘˜

  • Quantile Query: For 𝜚 ∈ [0, 1], find π‘˜ with 𝑔

1 + . . . + 𝑔 π‘˜ β‰ˆ πœšπ‘›

  • Heavy Hitters Query: For 𝜚 ∈ [0, 1], find all 𝑗 with 𝑔

𝑗 β‰₯ πœšπ‘›.

Desired accuracy: Β±πœπ‘› with error probability πœ€

6

Based on Andrew McGregor’s slides: https://people.cs.umass.edu/~mcgregor/711S18/vectors-3.pdf

Denote by 𝑔𝑗,π‘˜

slide-7
SLIDE 7

Range Queries

  • We could estimate 𝑔𝑗,π‘˜ by ΰ·©

𝑔

𝑗 + ሚ

𝑔

𝑗+1+. . . +ΰ·©

𝑔

𝑗

But errors add up: need too much space to keep accurate enough estimates Idea: We could estimate counts for some intervals directly by grouping 𝑗, … , π‘˜

How many intervals do we need so that each interval is a sum of 𝑃 log π‘œ original intervals?

7 2 4 6 8 1 3 5 7

slide-8
SLIDE 8

Dyadic Intervals

  • Exercise: Each interval [𝑗, π‘˜] is a sum of at most 2 lg π‘œ dyadic intervals.
  • Such a representation of an interval is its dyadic decomposition.

8

[π‘œ] 1, π‘œ 2 π‘œ 2 + 1, π‘œ 1, π‘œ 4

π‘œ 2 + 1, 3π‘œ 4

π‘œ 4 + 1, π‘œ 2

3π‘œ 4 + 1, π‘œ

… … … …

1 2 π‘œ βˆ’ 1 π‘œ

…

lg π‘œ + 1 levels

slide-9
SLIDE 9

Count-Min Strikes Back

  • Correctness: Pr 𝑔

[𝑗,π‘˜] ≀ ሚ

𝑔𝑗,π‘˜ ≀ 𝑔𝑗,π‘˜ + πœπ‘›(2 lg π‘œ) β‰₯ 1 βˆ’ πœ€(2 lg π‘œ)

  • Space:

Multiply the old space complexity by log π‘œ and divide 𝜁 and πœ€ by log π‘œ: 𝑃 log2 π‘œ log π‘œ + log 𝑛 1 𝜁 log log π‘œ πœ€

  • Quantile Query: For 𝜚 ∈ [0, 1] find π‘˜ with 𝑔1,π‘˜ β‰ˆ πœšπ‘›

Approximate Median: Find π‘˜ such that 𝑔1,π‘˜ β‰₯

𝑛 2 βˆ’ πœπ‘› and 𝑔1,π‘˜βˆ’1 ≀ 𝑛 2 + πœπ‘›

We can approximate median via binary search of range queries.

9

Range Query Algorithm 1. Construct lg π‘œ + 1 Count-Min sketches, one for each level such that for all intervals 𝐽 at that level, our estimate ΰ·© 𝑔

𝐽 for 𝑔 𝐽 satisfies

Pr 𝑔

𝐽 ≀ ΰ·©

𝑔

𝐽 ≀ 𝑔 𝐽 + πœπ‘› ≀ 1 βˆ’ πœ€

2. To answer a range query [𝑗, π‘˜], let 𝐽1, … , 𝐽𝑙 be its dyadic decomposition Return ሚ 𝑔𝑗,π‘˜ = ሚ 𝑔

𝐽1 + β‹― + ሚ

𝑔

𝐽𝑙

Based on Andrew McGregor’s slides: https://people.cs.umass.edu/~mcgregor/711S18/vectors-3.pdf

slide-10
SLIDE 10

Count-Min Strikes Back (Part 2)

Heavy Hitters Query: For 𝜚 ∈ (𝜁, 1), find a set 𝑇 that

– includes all 𝑗 with 𝑔

𝑗 β‰₯ πœšπ‘›

– excludes all π‘˜ with 𝑔

π‘˜ ≀ 𝜚 βˆ’ 𝜁 𝑛

Correctness: If 𝑔

𝑗 β‰₯ πœšπ‘›, then for all ancestors 𝐽 of the leaf 𝑗,

ሚ 𝑔

𝐽 β‰₯ 𝑔

𝐽 β‰₯ πœšπ‘›

  • If we ensure that Pr[point query overestimates by > πœπ‘›]≀ πœ€/π‘œ,

then, by union bound, all point queries are correct w.p. β‰₯ 1 βˆ’ πœ€

  • There are at most 1/𝜚 indices 𝑗 with 𝑔

𝑗 β‰₯ πœšπ‘›

Thus, 𝑃(πœšβˆ’1 log π‘œ) time suffices for post-processing

10

Heavy Hitters Algorithm 1. Construct lg π‘œ + 1 Count-Min sketches for levels of dyadic tree, as before 2. To answer query 𝜚, mark the root. Going level-by-level from the root, mark children 𝐽 of marked nodes if ሚ 𝑔

𝐽 β‰₯ πœšπ‘›

3. Return all marked leaves

Based on Andrew McGregor’s slides: https://people.cs.umass.edu/~mcgregor/711S18/vectors-3.pdf

slide-11
SLIDE 11

CR-Precis: Deterministic Count-Min [Ganguly Majumder 07]

Use deterministic hash functions: β„Žπ‘˜ 𝑏 = 𝑏 mod π‘žπ‘˜, where π‘žπ‘˜ is the π‘˜-th prime, for π‘˜ ∈ [𝑒] Analysis: Fix π‘—βˆ— ∈ π‘œ . Define 𝑨1, … , 𝑨𝑒 such that π‘‘π‘˜,β„Žπ‘˜ π‘—βˆ— = 𝑔

π‘—βˆ— + 𝑨 π‘˜, that is,

𝑨

π‘˜ =

෍

π‘—β‰ π‘—βˆ—:β„Žπ‘˜ 𝑗 =β„Žπ‘˜ π‘—βˆ—

𝑔

𝑗

  • Claim: For each 𝑗 β‰  π‘—βˆ—, we have β„Žπ‘˜ 𝑗 = β„Žπ‘˜ π‘—βˆ— for at most log π‘œ primes π‘žπ‘˜
  • Thus, Οƒπ‘˜βˆˆ 𝑒 𝑨

π‘˜ = Οƒπ‘˜ σ𝑗 𝑔 𝑗 = σ𝑗 Οƒπ‘˜ 𝑔 𝑗 ≀ σ𝑗 𝑔 𝑗 log π‘œ = 𝑛 log π‘œ

ΰ·ͺ 𝑔

π‘—βˆ— = min π‘˜βˆˆ[𝑒] cπ‘˜,β„Ž π‘—βˆ— = min π‘˜βˆˆ[𝑒](𝑔 π‘—βˆ— + 𝑨 π‘˜) = 𝑔 π‘—βˆ— + min π‘˜βˆˆ[𝑒] 𝑨 π‘˜ ≀ 𝑔 π‘—βˆ— + 𝑛 log π‘œ

𝑒

  • We set 𝑒 =

log π‘œ 𝜁

to get 𝑔

𝑗 ≀ ΰ·©

𝑔

𝑗 ≀ 𝑔 𝑗 + πœπ‘›

  • Requires keeping at most 𝑒 β‹… π‘žπ‘’ = ΰ·¨

𝑃

log2 π‘œ 𝜁2

counters since π‘žπ‘’ = 𝑃(𝑒 log 𝑒)

11

by Chinese Remainder Theorem

Based on Andrew McGregor’s slides: https://people.cs.umass.edu/~mcgregor/711S18/vectors-3.pdf

slide-12
SLIDE 12

Count-Sketch: Count-Min+AMS combined

  • Claim. 𝔽 𝑠

π‘˜ 𝑗 π‘‘π‘˜,β„Žπ‘˜ 𝑗

= 𝑔

𝑗 and Var 𝑠 π‘˜ 𝑗 π‘‘π‘˜,β„Žπ‘˜ 𝑗

≀

𝐺

2

𝑐

βˆ€π‘˜ ∈ [𝑒]

  • By Chebyshev, for 𝑐 = 2/𝜁2,

Pr 𝑔

𝑗 βˆ’ 𝑠 π‘˜ 𝑗 π‘‘π‘˜,β„Žπ‘˜ 𝑗

β‰₯ 𝜁 𝐺

2 ≀

𝐺

2

𝜁2𝑐𝐺

2

= 1 3

  • By Chernoff, for 𝑒 = Θ(log 1/πœ€)

Pr 𝑔

𝑗 βˆ’ መ

𝑔

𝑗 β‰₯ 𝜁 𝐺 2 ≀ πœ€

12

Count-Sketch 1. In addition to β„Žπ‘˜: π‘œ β†’ [𝑐], use hash functions 𝑠

π‘˜: π‘œ β†’ {βˆ’1,1}

2. Maintain 𝑒𝑐 counters 𝑑

π‘˜,𝑙 = σ𝑗:β„Žπ‘˜ 𝑗 =𝑙 𝑠 π‘˜ 𝑗 𝑔 𝑗

3. To answer a point query 𝑗, return መ 𝑔

𝑗 =median 𝑠 1 𝑗 𝑑1,β„Ž1 𝑗 , … , 𝑠 𝑒 𝑗 𝑑𝑒,β„Žπ‘’ 𝑗

Based on Andrew McGregor’s slides: https://people.cs.umass.edu/~mcgregor/711S18/vectors-3.pdf

slide-13
SLIDE 13

Count-Sketch: Proof of Claim

  • Claim. 𝔽 𝑠

π‘˜ 𝑗 π‘‘π‘˜,β„Žπ‘˜ 𝑗

= 𝑔

𝑗 and Var 𝑠 𝑒 𝑗 𝑑𝑒,β„Žπ‘’ 𝑗

≀

𝐺

2

𝑐

βˆ€π‘˜ ∈ [𝑒] Proof: Fix 𝑗 = π‘—βˆ— and π‘˜ ∈ [𝑐]. We omit subscripts π‘˜.

  • For all 𝑗 β‰  π‘—βˆ—, let π‘Œπ‘— = α‰Š1 if β„Ž 𝑗 = β„Ž π‘—βˆ—
  • therwise
  • Expectation:
  • Variance:

13

Count-Sketch: መ 𝑔

𝑗 =median 𝑠 1 𝑗 𝑑1,β„Ž1 𝑗 , … , 𝑠 𝑒 𝑗 𝑑𝑒,β„Žπ‘’ 𝑗

by 2-wise independence

𝔽 𝑠 π‘—βˆ— π‘‘β„Ž π‘—βˆ— = 𝔽 𝑔

𝑗 βˆ— + ෍ π‘—β‰ π‘—βˆ—

𝑠 𝑗 𝑠 π‘—βˆ— π‘Œπ‘—π‘”

𝑗 = 𝑔 𝑗 βˆ—

Var 𝑠 π‘—βˆ— π‘‘β„Ž π‘—βˆ— ≀ 𝔽 ෍

π‘—β‰ π‘—βˆ—

𝑠 𝑗 𝑠 π‘—βˆ— π‘Œπ‘—π‘”

𝑗 2

= 𝔽 ෍

π‘—β‰ π‘—βˆ—

π‘Œπ‘—

2𝑔 𝑗 2 + ෍ 𝑗≠𝑙

𝑠 𝑗 𝑠 𝑙 π‘Œπ‘—π‘Œπ‘™π‘”

𝑗𝑔 𝑙 = 𝐺2

𝑐

Based on Andrew McGregor’s slides: https://people.cs.umass.edu/~mcgregor/711S18/vectors-3.pdf

slide-14
SLIDE 14

Count-Sketch: Count-Min+AMS combined

  • Fix π‘—βˆ— ∈ [π‘œ].
  • Let π‘Ž = π‘‘β„Ž π‘—βˆ— βˆ’ 𝑔

π‘—βˆ— be the overestimation error.

  • For all 𝑗 β‰  π‘—βˆ—, let π‘Œπ‘— = α‰Š1 if β„Ž 𝑗 = β„Ž π‘—βˆ—
  • therwise
  • By Markov’s inequality, if 𝑐 = 2/𝜁 then

Pr π‘Ž β‰₯ πœπ‘› ≀ 𝔽 π‘Ž πœπ‘› ≀ 1 πœπ‘ ≀ 1 2

14

π‘Ž = ෍

π‘—β‰ π‘—βˆ—

π‘Œπ‘— β‹… 𝑔

𝑗

𝔽 π‘Ž = ෍

π‘—β‰ π‘—βˆ—

𝔽 π‘Œπ‘— β‹… 𝑔

𝑗 = 1

𝑐 ෍

π‘—β‰ π‘—βˆ—

𝑔

𝑗 ≀ 𝑛

𝑐 𝔽 π‘Œπ‘— = Pr[β„Ž 𝑗 = β„Ž(π‘—βˆ—)] = 1 𝑐 Count-Sketch 1. In addition to β„Žπ‘˜: π‘œ β†’ [𝑐], use hash functions 𝑠

π‘˜: π‘œ β†’ {βˆ’1,1}

2. Maintain 𝑒𝑐 counters 𝑑

π‘˜,𝑙 = σ𝑗:β„Žπ‘˜ 𝑗 =𝑙 𝑠 π‘˜ 𝑗 𝑔 𝑗

3. To answer a point query 𝑗, return መ 𝑔

𝑗 =median 𝑠 1 𝑗 𝑑1,β„Ž1 𝑗 , … , 𝑠 𝑒 𝑗 𝑑𝑒,β„Žπ‘’ 𝑗

by 2-wise independence by linearity of expectation